You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2019/11/25 14:52:17 UTC

Union type ids - signed or unsigned?

Hello,

The spec has the following language about union type ids:
"""
Types buffer: A buffer of 8-bit signed integers. Each type in the union
has a corresponding type id whose values are found in this buffer. A
union with more than 127 possible types can be modeled as a union of unions.
"""
https://arrow.apache.org/docs/format/Columnar.html#union-layout

However, in several places the C++ code assumes type ids are unsigned.
Java doesn't seem to implement type ids (and there is no integration
task for union types).

In the flatbuffers description, the type ids array is modeled as an
array of signed 32-bit integers.

Moreover, according to the language above, type ids should be restricted
to the [0, 127] interval?  Which one should it be?

Regards

Antoine.

Re: Union type ids - signed or unsigned?

Posted by Antoine Pitrou <an...@python.org>.
Thanks for all the answers.  The assumptions about union types in C++
code are fixed in https://github.com/apache/arrow/pull/5892

Regards

Antoine.


Le 25/11/2019 à 16:41, Wes McKinney a écrit :
> On Mon, Nov 25, 2019 at 9:25 AM Antoine Pitrou <so...@pitrou.net> wrote:
>>
>> On Mon, 25 Nov 2019 09:12:21 -0600
>> Wes McKinney <we...@gmail.com> wrote:
>>> On Mon, Nov 25, 2019 at 8:52 AM Antoine Pitrou <an...@python.org> wrote:
>>>>
>>>>
>>>> Hello,
>>>>
>>>> The spec has the following language about union type ids:
>>>> """
>>>> Types buffer: A buffer of 8-bit signed integers. Each type in the union
>>>> has a corresponding type id whose values are found in this buffer. A
>>>> union with more than 127 possible types can be modeled as a union of unions.
>>>> """
>>>> https://arrow.apache.org/docs/format/Columnar.html#union-layout
>>>>
>>>> However, in several places the C++ code assumes type ids are unsigned.
>>>> Java doesn't seem to implement type ids (and there is no integration
>>>> task for union types).
>>>>
>>>> In the flatbuffers description, the type ids array is modeled as an
>>>> array of signed 32-bit integers.
>>>>
>>>> Moreover, according to the language above, type ids should be restricted
>>>> to the [0, 127] interval?  Which one should it be?
>>>
>>> The (optional) type ids in the metadata provide a correspondence
>>> between the union types / children and the values found in the types
>>> buffer (data). As stated in the spec, the types buffer are 8-bit
>>> signed integers. As I recall the reason that we used [ Int ] in the
>>> metadata was that the Int type is thought to be easier for languages
>>> to work with in general when serializing/deserializing the metadata.
>>
>> Ok, but is there a reason the C++ code uses `std::vector<uint8_t>` for
>> the type codes?
> 
> Oversight on my part. Suggest we change to int8_t
> 
>> Regards
>>
>> Antoine.
>>
>>

Re: Union type ids - signed or unsigned?

Posted by Wes McKinney <we...@gmail.com>.
On Mon, Nov 25, 2019 at 9:25 AM Antoine Pitrou <so...@pitrou.net> wrote:
>
> On Mon, 25 Nov 2019 09:12:21 -0600
> Wes McKinney <we...@gmail.com> wrote:
> > On Mon, Nov 25, 2019 at 8:52 AM Antoine Pitrou <an...@python.org> wrote:
> > >
> > >
> > > Hello,
> > >
> > > The spec has the following language about union type ids:
> > > """
> > > Types buffer: A buffer of 8-bit signed integers. Each type in the union
> > > has a corresponding type id whose values are found in this buffer. A
> > > union with more than 127 possible types can be modeled as a union of unions.
> > > """
> > > https://arrow.apache.org/docs/format/Columnar.html#union-layout
> > >
> > > However, in several places the C++ code assumes type ids are unsigned.
> > > Java doesn't seem to implement type ids (and there is no integration
> > > task for union types).
> > >
> > > In the flatbuffers description, the type ids array is modeled as an
> > > array of signed 32-bit integers.
> > >
> > > Moreover, according to the language above, type ids should be restricted
> > > to the [0, 127] interval?  Which one should it be?
> >
> > The (optional) type ids in the metadata provide a correspondence
> > between the union types / children and the values found in the types
> > buffer (data). As stated in the spec, the types buffer are 8-bit
> > signed integers. As I recall the reason that we used [ Int ] in the
> > metadata was that the Int type is thought to be easier for languages
> > to work with in general when serializing/deserializing the metadata.
>
> Ok, but is there a reason the C++ code uses `std::vector<uint8_t>` for
> the type codes?

Oversight on my part. Suggest we change to int8_t

> Regards
>
> Antoine.
>
>

Re: Union type ids - signed or unsigned?

Posted by Antoine Pitrou <so...@pitrou.net>.
On Mon, 25 Nov 2019 09:12:21 -0600
Wes McKinney <we...@gmail.com> wrote:
> On Mon, Nov 25, 2019 at 8:52 AM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Hello,
> >
> > The spec has the following language about union type ids:
> > """
> > Types buffer: A buffer of 8-bit signed integers. Each type in the union
> > has a corresponding type id whose values are found in this buffer. A
> > union with more than 127 possible types can be modeled as a union of unions.
> > """
> > https://arrow.apache.org/docs/format/Columnar.html#union-layout
> >
> > However, in several places the C++ code assumes type ids are unsigned.
> > Java doesn't seem to implement type ids (and there is no integration
> > task for union types).
> >
> > In the flatbuffers description, the type ids array is modeled as an
> > array of signed 32-bit integers.
> >
> > Moreover, according to the language above, type ids should be restricted
> > to the [0, 127] interval?  Which one should it be?  
> 
> The (optional) type ids in the metadata provide a correspondence
> between the union types / children and the values found in the types
> buffer (data). As stated in the spec, the types buffer are 8-bit
> signed integers. As I recall the reason that we used [ Int ] in the
> metadata was that the Int type is thought to be easier for languages
> to work with in general when serializing/deserializing the metadata.

Ok, but is there a reason the C++ code uses `std::vector<uint8_t>` for
the type codes?

Regards

Antoine.



Re: Union type ids - signed or unsigned?

Posted by Wes McKinney <we...@gmail.com>.
On Mon, Nov 25, 2019 at 8:52 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hello,
>
> The spec has the following language about union type ids:
> """
> Types buffer: A buffer of 8-bit signed integers. Each type in the union
> has a corresponding type id whose values are found in this buffer. A
> union with more than 127 possible types can be modeled as a union of unions.
> """
> https://arrow.apache.org/docs/format/Columnar.html#union-layout
>
> However, in several places the C++ code assumes type ids are unsigned.
> Java doesn't seem to implement type ids (and there is no integration
> task for union types).
>
> In the flatbuffers description, the type ids array is modeled as an
> array of signed 32-bit integers.
>
> Moreover, according to the language above, type ids should be restricted
> to the [0, 127] interval?  Which one should it be?

The (optional) type ids in the metadata provide a correspondence
between the union types / children and the values found in the types
buffer (data). As stated in the spec, the types buffer are 8-bit
signed integers. As I recall the reason that we used [ Int ] in the
metadata was that the Int type is thought to be easier for languages
to work with in general when serializing/deserializing the metadata.

Functionally these values are limited to the range [0, 127] and so we
should probably add some comments about this in Schema.fbs

> Regards
>
> Antoine.