You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Ben Kietzman <be...@rstudio.com> on 2019/07/10 18:40:15 UTC

[Discuss] Are Union.typeIds worth keeping?

The Union.typeIds property is confusing and its utility is unclear. I'd
like to remove it (or at least document it better). Unless anyone knows a
real advantage for keeping it I plan to assemble a PR to drop it from the
format and the C++ implementation.

ARROW-257 ( resolved by pull request
https://github.com/apache/arrow/pull/143 ) extended Unions with an optional
typeIds property (in the C++ implementation, this is
UnionType::type_codes). Prior to that pull request each element (int8) in
the type_ids (second) buffer of a union array was the index of a child
array. Thus a type_ids buffer beginning with 5 indicated that the union
array began with a value from child_data[5]. After that change to interpret
a type_id of 5 one must look through the typeIds property and the index at
which a 5 is found is the index of the corresponding child array.

The change was made to allow unused child arrays to be dropped; for example
if a union type were predefined with 64 members then an array of nearly
identical type containing only int32 and utf8 values would only be required
to have two child arrays. Note: the union types are not exactly identical
even though they contain identical members; their typeIds properties will
differ.

However unused child arrays can be replaced by null arrays (which are
almost equally lightweight as they require no heap allocation). I'm also
unaware of a use case for predefined type_ids; if they are application
specific then I think it's out of scope for arrow to maintain a child_index
<-> type_id mapping. It seems that the optimization has questionable merit
and does not warrant the added complexity.

Re: [Discuss] Are Union.typeIds worth keeping?

Posted by Ben Kietzman <be...@rstudio.com>.
Thanks all, this is helpful and I've added
https://issues.apache.org/jira/browse/ARROW-5933 to improve the
documentation for future developers.

On Wed, Jul 10, 2019 at 11:09 PM Jacques Nadeau <ja...@apache.org> wrote:

> I was also supportive of this pattern. We definitely have used it before to
> optimize in certain cases.
>
> On Wed, Jul 10, 2019, 2:40 PM Wes McKinney <we...@gmail.com> wrote:
>
> > On Wed, Jul 10, 2019 at 3:57 PM Ben Kietzman <be...@rstudio.com>
> > wrote:
> > >
> > > In this scenario option A (include child arrays for each child type,
> even
> > > if that type is not observed) seems like the clearly correct choice to
> > me.
> > > It yiedls a more intuitive layout for the union array and incurs no
> > runtime
> > > overhead (since the absent children are empty/null arrays).
> >
> > I am not sure this is right. The child arrays still occupy memory in
> > the Sparse Union case (where all child arrays have the same length).
> > In order to satisfy the requirements of the IPC protocol, the child
> > arrays need to be of the same type as the types in the union. In the
> > Dense Union case, the not-present children will have length 0.
> >
> > >
> > > > why not allow them to be flexible in this regard?
> > >
> > > I would say that if code doesn't add anything except cognitive overhead
> > > then it's worthwhile to remove it.
> >
> > The cognitive overhead comes for the Arrow library implementer --
> > users of the libraries aren't required to deal with this detail
> > necessarily. The type ids are optional, after all. Even if it is
> > removed, you still have ids, so whether it's
> >
> > type 0, id=0
> > type 1, id=1
> > type 2, id=2
> >
> > or
> >
> > type 0, id=3
> > type 1, id=7
> > type 2, id=10
> >
> > the difference is in the second case, you have to look up the code
> > corresponding to each type rather than assuming that the type's
> > position and its code are the same.
> >
> > In processing, branching should occur at the Type level, so a function
> > to process a child looks like
> >
> > ProcessChild(child, child_id, ...)
> >
> > In either case you have to match a child with its id that appears in the
> > data.
> >
> > Anyway, since Julien and I are responsible for introducing this
> > concept in the early stages of the project I'm interested to hear more
> > from others. Note that this doesn't serve to resolve the
> > Union-of-Nested-Types problem that has prevented the development of
> > integration tests between Java and C++.
> >
> > >
> > > On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > hi Ben,
> > > >
> > > > Some applications use static type ids for various data types. Let's
> > > > consider one possibility:
> > > >
> > > > BOOLEAN: 0
> > > > INT32: 1
> > > > DOUBLE: 2
> > > > STRING (UTF8): 3
> > > >
> > > > If you were parsing JSON and constructing unions while parsing, you
> > > > might encounter some types, but not all. So if we _don't_ have the
> > > > option of having type ids in the metadata then we are left with some
> > > > unsatisfactory options:
> > > >
> > > > A: Include all types in the resulting union, even if they are
> > unobserved,
> > > > or
> > > > B: Assign type id dynamically to types when they are observed
> > > >
> > > > Option B: is potentially bad because it does not parallelize across
> > > > threads or nodes.
> > > >
> > > > So I do think the feature is useful. It does make the implementations
> > > > of unions more complex, though, so it does not come without cost. But
> > > > unions are already the most complex tool we have in our nested data
> > > > toolbox, so why not allow them to be flexible in this regard?
> > > >
> > > > In any case I'm -0 on making changes, but would be interested in
> > > > feedback of others if there is strong sentiment about deprecating the
> > > > feature.
> > > >
> > > > - Wes
> > > >
> > > > On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman <
> ben.kietzman@rstudio.com
> > >
> > > > wrote:
> > > > >
> > > > > The Union.typeIds property is confusing and its utility is unclear.
> > I'd
> > > > > like to remove it (or at least document it better). Unless anyone
> > knows a
> > > > > real advantage for keeping it I plan to assemble a PR to drop it
> > from the
> > > > > format and the C++ implementation.
> > > > >
> > > > > ARROW-257 ( resolved by pull request
> > > > > https://github.com/apache/arrow/pull/143 ) extended Unions with an
> > > > optional
> > > > > typeIds property (in the C++ implementation, this is
> > > > > UnionType::type_codes). Prior to that pull request each element
> > (int8) in
> > > > > the type_ids (second) buffer of a union array was the index of a
> > child
> > > > > array. Thus a type_ids buffer beginning with 5 indicated that the
> > union
> > > > > array began with a value from child_data[5]. After that change to
> > > > interpret
> > > > > a type_id of 5 one must look through the typeIds property and the
> > index
> > > > at
> > > > > which a 5 is found is the index of the corresponding child array.
> > > > >
> > > > > The change was made to allow unused child arrays to be dropped; for
> > > > example
> > > > > if a union type were predefined with 64 members then an array of
> > nearly
> > > > > identical type containing only int32 and utf8 values would only be
> > > > required
> > > > > to have two child arrays. Note: the union types are not exactly
> > identical
> > > > > even though they contain identical members; their typeIds
> properties
> > will
> > > > > differ.
> > > > >
> > > > > However unused child arrays can be replaced by null arrays (which
> are
> > > > > almost equally lightweight as they require no heap allocation). I'm
> > also
> > > > > unaware of a use case for predefined type_ids; if they are
> > application
> > > > > specific then I think it's out of scope for arrow to maintain a
> > > > child_index
> > > > > <-> type_id mapping. It seems that the optimization has
> questionable
> > > > merit
> > > > > and does not warrant the added complexity.
> > > >
> >
>

Re: [Discuss] Are Union.typeIds worth keeping?

Posted by Jacques Nadeau <ja...@apache.org>.
I was also supportive of this pattern. We definitely have used it before to
optimize in certain cases.

On Wed, Jul 10, 2019, 2:40 PM Wes McKinney <we...@gmail.com> wrote:

> On Wed, Jul 10, 2019 at 3:57 PM Ben Kietzman <be...@rstudio.com>
> wrote:
> >
> > In this scenario option A (include child arrays for each child type, even
> > if that type is not observed) seems like the clearly correct choice to
> me.
> > It yiedls a more intuitive layout for the union array and incurs no
> runtime
> > overhead (since the absent children are empty/null arrays).
>
> I am not sure this is right. The child arrays still occupy memory in
> the Sparse Union case (where all child arrays have the same length).
> In order to satisfy the requirements of the IPC protocol, the child
> arrays need to be of the same type as the types in the union. In the
> Dense Union case, the not-present children will have length 0.
>
> >
> > > why not allow them to be flexible in this regard?
> >
> > I would say that if code doesn't add anything except cognitive overhead
> > then it's worthwhile to remove it.
>
> The cognitive overhead comes for the Arrow library implementer --
> users of the libraries aren't required to deal with this detail
> necessarily. The type ids are optional, after all. Even if it is
> removed, you still have ids, so whether it's
>
> type 0, id=0
> type 1, id=1
> type 2, id=2
>
> or
>
> type 0, id=3
> type 1, id=7
> type 2, id=10
>
> the difference is in the second case, you have to look up the code
> corresponding to each type rather than assuming that the type's
> position and its code are the same.
>
> In processing, branching should occur at the Type level, so a function
> to process a child looks like
>
> ProcessChild(child, child_id, ...)
>
> In either case you have to match a child with its id that appears in the
> data.
>
> Anyway, since Julien and I are responsible for introducing this
> concept in the early stages of the project I'm interested to hear more
> from others. Note that this doesn't serve to resolve the
> Union-of-Nested-Types problem that has prevented the development of
> integration tests between Java and C++.
>
> >
> > On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > hi Ben,
> > >
> > > Some applications use static type ids for various data types. Let's
> > > consider one possibility:
> > >
> > > BOOLEAN: 0
> > > INT32: 1
> > > DOUBLE: 2
> > > STRING (UTF8): 3
> > >
> > > If you were parsing JSON and constructing unions while parsing, you
> > > might encounter some types, but not all. So if we _don't_ have the
> > > option of having type ids in the metadata then we are left with some
> > > unsatisfactory options:
> > >
> > > A: Include all types in the resulting union, even if they are
> unobserved,
> > > or
> > > B: Assign type id dynamically to types when they are observed
> > >
> > > Option B: is potentially bad because it does not parallelize across
> > > threads or nodes.
> > >
> > > So I do think the feature is useful. It does make the implementations
> > > of unions more complex, though, so it does not come without cost. But
> > > unions are already the most complex tool we have in our nested data
> > > toolbox, so why not allow them to be flexible in this regard?
> > >
> > > In any case I'm -0 on making changes, but would be interested in
> > > feedback of others if there is strong sentiment about deprecating the
> > > feature.
> > >
> > > - Wes
> > >
> > > On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman <ben.kietzman@rstudio.com
> >
> > > wrote:
> > > >
> > > > The Union.typeIds property is confusing and its utility is unclear.
> I'd
> > > > like to remove it (or at least document it better). Unless anyone
> knows a
> > > > real advantage for keeping it I plan to assemble a PR to drop it
> from the
> > > > format and the C++ implementation.
> > > >
> > > > ARROW-257 ( resolved by pull request
> > > > https://github.com/apache/arrow/pull/143 ) extended Unions with an
> > > optional
> > > > typeIds property (in the C++ implementation, this is
> > > > UnionType::type_codes). Prior to that pull request each element
> (int8) in
> > > > the type_ids (second) buffer of a union array was the index of a
> child
> > > > array. Thus a type_ids buffer beginning with 5 indicated that the
> union
> > > > array began with a value from child_data[5]. After that change to
> > > interpret
> > > > a type_id of 5 one must look through the typeIds property and the
> index
> > > at
> > > > which a 5 is found is the index of the corresponding child array.
> > > >
> > > > The change was made to allow unused child arrays to be dropped; for
> > > example
> > > > if a union type were predefined with 64 members then an array of
> nearly
> > > > identical type containing only int32 and utf8 values would only be
> > > required
> > > > to have two child arrays. Note: the union types are not exactly
> identical
> > > > even though they contain identical members; their typeIds properties
> will
> > > > differ.
> > > >
> > > > However unused child arrays can be replaced by null arrays (which are
> > > > almost equally lightweight as they require no heap allocation). I'm
> also
> > > > unaware of a use case for predefined type_ids; if they are
> application
> > > > specific then I think it's out of scope for arrow to maintain a
> > > child_index
> > > > <-> type_id mapping. It seems that the optimization has questionable
> > > merit
> > > > and does not warrant the added complexity.
> > >
>

Re: [Discuss] Are Union.typeIds worth keeping?

Posted by Wes McKinney <we...@gmail.com>.
On Wed, Jul 10, 2019 at 3:57 PM Ben Kietzman <be...@rstudio.com> wrote:
>
> In this scenario option A (include child arrays for each child type, even
> if that type is not observed) seems like the clearly correct choice to me.
> It yiedls a more intuitive layout for the union array and incurs no runtime
> overhead (since the absent children are empty/null arrays).

I am not sure this is right. The child arrays still occupy memory in
the Sparse Union case (where all child arrays have the same length).
In order to satisfy the requirements of the IPC protocol, the child
arrays need to be of the same type as the types in the union. In the
Dense Union case, the not-present children will have length 0.

>
> > why not allow them to be flexible in this regard?
>
> I would say that if code doesn't add anything except cognitive overhead
> then it's worthwhile to remove it.

The cognitive overhead comes for the Arrow library implementer --
users of the libraries aren't required to deal with this detail
necessarily. The type ids are optional, after all. Even if it is
removed, you still have ids, so whether it's

type 0, id=0
type 1, id=1
type 2, id=2

or

type 0, id=3
type 1, id=7
type 2, id=10

the difference is in the second case, you have to look up the code
corresponding to each type rather than assuming that the type's
position and its code are the same.

In processing, branching should occur at the Type level, so a function
to process a child looks like

ProcessChild(child, child_id, ...)

In either case you have to match a child with its id that appears in the data.

Anyway, since Julien and I are responsible for introducing this
concept in the early stages of the project I'm interested to hear more
from others. Note that this doesn't serve to resolve the
Union-of-Nested-Types problem that has prevented the development of
integration tests between Java and C++.

>
> On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Ben,
> >
> > Some applications use static type ids for various data types. Let's
> > consider one possibility:
> >
> > BOOLEAN: 0
> > INT32: 1
> > DOUBLE: 2
> > STRING (UTF8): 3
> >
> > If you were parsing JSON and constructing unions while parsing, you
> > might encounter some types, but not all. So if we _don't_ have the
> > option of having type ids in the metadata then we are left with some
> > unsatisfactory options:
> >
> > A: Include all types in the resulting union, even if they are unobserved,
> > or
> > B: Assign type id dynamically to types when they are observed
> >
> > Option B: is potentially bad because it does not parallelize across
> > threads or nodes.
> >
> > So I do think the feature is useful. It does make the implementations
> > of unions more complex, though, so it does not come without cost. But
> > unions are already the most complex tool we have in our nested data
> > toolbox, so why not allow them to be flexible in this regard?
> >
> > In any case I'm -0 on making changes, but would be interested in
> > feedback of others if there is strong sentiment about deprecating the
> > feature.
> >
> > - Wes
> >
> > On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman <be...@rstudio.com>
> > wrote:
> > >
> > > The Union.typeIds property is confusing and its utility is unclear. I'd
> > > like to remove it (or at least document it better). Unless anyone knows a
> > > real advantage for keeping it I plan to assemble a PR to drop it from the
> > > format and the C++ implementation.
> > >
> > > ARROW-257 ( resolved by pull request
> > > https://github.com/apache/arrow/pull/143 ) extended Unions with an
> > optional
> > > typeIds property (in the C++ implementation, this is
> > > UnionType::type_codes). Prior to that pull request each element (int8) in
> > > the type_ids (second) buffer of a union array was the index of a child
> > > array. Thus a type_ids buffer beginning with 5 indicated that the union
> > > array began with a value from child_data[5]. After that change to
> > interpret
> > > a type_id of 5 one must look through the typeIds property and the index
> > at
> > > which a 5 is found is the index of the corresponding child array.
> > >
> > > The change was made to allow unused child arrays to be dropped; for
> > example
> > > if a union type were predefined with 64 members then an array of nearly
> > > identical type containing only int32 and utf8 values would only be
> > required
> > > to have two child arrays. Note: the union types are not exactly identical
> > > even though they contain identical members; their typeIds properties will
> > > differ.
> > >
> > > However unused child arrays can be replaced by null arrays (which are
> > > almost equally lightweight as they require no heap allocation). I'm also
> > > unaware of a use case for predefined type_ids; if they are application
> > > specific then I think it's out of scope for arrow to maintain a
> > child_index
> > > <-> type_id mapping. It seems that the optimization has questionable
> > merit
> > > and does not warrant the added complexity.
> >

Re: [Discuss] Are Union.typeIds worth keeping?

Posted by Ben Kietzman <be...@rstudio.com>.
In this scenario option A (include child arrays for each child type, even
if that type is not observed) seems like the clearly correct choice to me.
It yiedls a more intuitive layout for the union array and incurs no runtime
overhead (since the absent children are empty/null arrays).

> why not allow them to be flexible in this regard?

I would say that if code doesn't add anything except cognitive overhead
then it's worthwhile to remove it.

On Wed, Jul 10, 2019 at 2:51 PM Wes McKinney <we...@gmail.com> wrote:

> hi Ben,
>
> Some applications use static type ids for various data types. Let's
> consider one possibility:
>
> BOOLEAN: 0
> INT32: 1
> DOUBLE: 2
> STRING (UTF8): 3
>
> If you were parsing JSON and constructing unions while parsing, you
> might encounter some types, but not all. So if we _don't_ have the
> option of having type ids in the metadata then we are left with some
> unsatisfactory options:
>
> A: Include all types in the resulting union, even if they are unobserved,
> or
> B: Assign type id dynamically to types when they are observed
>
> Option B: is potentially bad because it does not parallelize across
> threads or nodes.
>
> So I do think the feature is useful. It does make the implementations
> of unions more complex, though, so it does not come without cost. But
> unions are already the most complex tool we have in our nested data
> toolbox, so why not allow them to be flexible in this regard?
>
> In any case I'm -0 on making changes, but would be interested in
> feedback of others if there is strong sentiment about deprecating the
> feature.
>
> - Wes
>
> On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman <be...@rstudio.com>
> wrote:
> >
> > The Union.typeIds property is confusing and its utility is unclear. I'd
> > like to remove it (or at least document it better). Unless anyone knows a
> > real advantage for keeping it I plan to assemble a PR to drop it from the
> > format and the C++ implementation.
> >
> > ARROW-257 ( resolved by pull request
> > https://github.com/apache/arrow/pull/143 ) extended Unions with an
> optional
> > typeIds property (in the C++ implementation, this is
> > UnionType::type_codes). Prior to that pull request each element (int8) in
> > the type_ids (second) buffer of a union array was the index of a child
> > array. Thus a type_ids buffer beginning with 5 indicated that the union
> > array began with a value from child_data[5]. After that change to
> interpret
> > a type_id of 5 one must look through the typeIds property and the index
> at
> > which a 5 is found is the index of the corresponding child array.
> >
> > The change was made to allow unused child arrays to be dropped; for
> example
> > if a union type were predefined with 64 members then an array of nearly
> > identical type containing only int32 and utf8 values would only be
> required
> > to have two child arrays. Note: the union types are not exactly identical
> > even though they contain identical members; their typeIds properties will
> > differ.
> >
> > However unused child arrays can be replaced by null arrays (which are
> > almost equally lightweight as they require no heap allocation). I'm also
> > unaware of a use case for predefined type_ids; if they are application
> > specific then I think it's out of scope for arrow to maintain a
> child_index
> > <-> type_id mapping. It seems that the optimization has questionable
> merit
> > and does not warrant the added complexity.
>

Re: [Discuss] Are Union.typeIds worth keeping?

Posted by Wes McKinney <we...@gmail.com>.
hi Ben,

Some applications use static type ids for various data types. Let's
consider one possibility:

BOOLEAN: 0
INT32: 1
DOUBLE: 2
STRING (UTF8): 3

If you were parsing JSON and constructing unions while parsing, you
might encounter some types, but not all. So if we _don't_ have the
option of having type ids in the metadata then we are left with some
unsatisfactory options:

A: Include all types in the resulting union, even if they are unobserved, or
B: Assign type id dynamically to types when they are observed

Option B: is potentially bad because it does not parallelize across
threads or nodes.

So I do think the feature is useful. It does make the implementations
of unions more complex, though, so it does not come without cost. But
unions are already the most complex tool we have in our nested data
toolbox, so why not allow them to be flexible in this regard?

In any case I'm -0 on making changes, but would be interested in
feedback of others if there is strong sentiment about deprecating the
feature.

- Wes

On Wed, Jul 10, 2019 at 1:40 PM Ben Kietzman <be...@rstudio.com> wrote:
>
> The Union.typeIds property is confusing and its utility is unclear. I'd
> like to remove it (or at least document it better). Unless anyone knows a
> real advantage for keeping it I plan to assemble a PR to drop it from the
> format and the C++ implementation.
>
> ARROW-257 ( resolved by pull request
> https://github.com/apache/arrow/pull/143 ) extended Unions with an optional
> typeIds property (in the C++ implementation, this is
> UnionType::type_codes). Prior to that pull request each element (int8) in
> the type_ids (second) buffer of a union array was the index of a child
> array. Thus a type_ids buffer beginning with 5 indicated that the union
> array began with a value from child_data[5]. After that change to interpret
> a type_id of 5 one must look through the typeIds property and the index at
> which a 5 is found is the index of the corresponding child array.
>
> The change was made to allow unused child arrays to be dropped; for example
> if a union type were predefined with 64 members then an array of nearly
> identical type containing only int32 and utf8 values would only be required
> to have two child arrays. Note: the union types are not exactly identical
> even though they contain identical members; their typeIds properties will
> differ.
>
> However unused child arrays can be replaced by null arrays (which are
> almost equally lightweight as they require no heap allocation). I'm also
> unaware of a use case for predefined type_ids; if they are application
> specific then I think it's out of scope for arrow to maintain a child_index
> <-> type_id mapping. It seems that the optimization has questionable merit
> and does not warrant the added complexity.