You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Ryan Murray <ry...@dremio.com> on 2020/05/19 11:43:55 UTC

Sparse Union format

Hey All,

While working on https://issues.apache.org/jira/browse/ARROW-1692 I noticed
that there is a difference between C++ and Java on the way Sparse Unions
are handled. I haven't seen in the format spec which the correct is so I
wanted to check with the wider community.

c++ (and the integration tests) see sparse unions as:
name
count
VALIDITY[]
TYPE_ID[]
children[]

and java as:
name
count
TYPE[]
children[]

The precise names may only be important for json reading/writing in the
integration tests so I will ignore TYPE/TYPE_ID for now. However, the big
difference is that Java doesn't have a validity buffer and c++ does. My
understanding is thta technically the validity buffer is redundant (0 type
== NULL) so I can see why Java would omit it. My question is then: which
language is 'correct'?

I suppose the actual language implementation is not entirely relevant here,
instead correct refers to what the canonical IPC schema for a sparse union
should be.

Best,
Ryan

Re: Sparse Union format

Posted by Micah Kornfield <em...@gmail.com>.

Hi Ryan,
In addition to the limitations mentioned above another one is only 1 column
of each type that can participate in the union.

There are some old threads on these differences on the mailing list that
should be searchable.

Thanks,
Micah

On Tue, May 19, 2020 at 6:44 AM Antoine Pitrou <an...@python.org> wrote:

>
> Also, you may want to run the integration tests and inspect the
> generated JSON file for union data, it will probably be informative
> (look for type ids).
>
> Regards
>
> Antoine.
>
>
> Le 19/05/2020 à 15:38, Ryan Murray a écrit :
> > Thanks for the clarification! Next time I will read the whole document
> ;-)
> >
> > On Tue, May 19, 2020 at 2:38 PM Antoine Pitrou <an...@python.org>
> wrote:
> >
> >>
> >> As explained in the comment below:
> >> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L91
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 19/05/2020 à 14:14, Ryan Murray a écrit :
> >>> Thanks Antoine,
> >>>
> >>> Can you just clarify what you mean by 'type ids are logical'? In my
> mind
> >>> type ids are strongly coupled to the types and their order in
> Schema.fbs
> >>> [1]. Do you mean that the order there is only a convention and we can't
> >>> assume that 0 === Null?
> >>>
> >>> Best,
> >>> Ryan
> >>>
> >>> [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L235
> >>>
> >>> On Tue, May 19, 2020 at 2:04 PM Antoine Pitrou <an...@python.org>
> >> wrote:
> >>>
> >>>>
> >>>> Le 19/05/2020 à 13:43, Ryan Murray a écrit :
> >>>>> Hey All,
> >>>>>
> >>>>> While working on https://issues.apache.org/jira/browse/ARROW-1692 I
> >>>> noticed
> >>>>> that there is a difference between C++ and Java on the way Sparse
> >> Unions
> >>>>> are handled. I haven't seen in the format spec which the correct is
> so
> >> I
> >>>>> wanted to check with the wider community.
> >>>>>
> >>>>> c++ (and the integration tests) see sparse unions as:
> >>>>> name
> >>>>> count
> >>>>> VALIDITY[]
> >>>>> TYPE_ID[]
> >>>>> children[]
> >>>>>
> >>>>> and java as:
> >>>>> name
> >>>>> count
> >>>>> TYPE[]
> >>>>> children[]
> >>>>>
> >>>>> The precise names may only be important for json reading/writing in
> the
> >>>>> integration tests so I will ignore TYPE/TYPE_ID for now. However, the
> >> big
> >>>>> difference is that Java doesn't have a validity buffer and c++ does.
> My
> >>>>> understanding is thta technically the validity buffer is redundant (0
> >>>> type
> >>>>> == NULL) so I can see why Java would omit it. My question is then:
> >> which
> >>>>> language is 'correct'?
> >>>>
> >>>> Union type ids are logical, so 0 could very well be a valid type id.
> >>>> You can't assume that type 0 means a null entry.
> >>>>
> >>>> Regards
> >>>>
> >>>> Antoine.
> >>>>
> >>>
> >>
> >
>

Re: Sparse Union format

Posted by Antoine Pitrou <an...@python.org>.

Also, you may want to run the integration tests and inspect the
generated JSON file for union data, it will probably be informative
(look for type ids).

Regards

Antoine.


Le 19/05/2020 à 15:38, Ryan Murray a écrit :
> Thanks for the clarification! Next time I will read the whole document ;-)
> 
> On Tue, May 19, 2020 at 2:38 PM Antoine Pitrou <an...@python.org> wrote:
> 
>>
>> As explained in the comment below:
>> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L91
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 19/05/2020 à 14:14, Ryan Murray a écrit :
>>> Thanks Antoine,
>>>
>>> Can you just clarify what you mean by 'type ids are logical'? In my mind
>>> type ids are strongly coupled to the types and their order in Schema.fbs
>>> [1]. Do you mean that the order there is only a convention and we can't
>>> assume that 0 === Null?
>>>
>>> Best,
>>> Ryan
>>>
>>> [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L235
>>>
>>> On Tue, May 19, 2020 at 2:04 PM Antoine Pitrou <an...@python.org>
>> wrote:
>>>
>>>>
>>>> Le 19/05/2020 à 13:43, Ryan Murray a écrit :
>>>>> Hey All,
>>>>>
>>>>> While working on https://issues.apache.org/jira/browse/ARROW-1692 I
>>>> noticed
>>>>> that there is a difference between C++ and Java on the way Sparse
>> Unions
>>>>> are handled. I haven't seen in the format spec which the correct is so
>> I
>>>>> wanted to check with the wider community.
>>>>>
>>>>> c++ (and the integration tests) see sparse unions as:
>>>>> name
>>>>> count
>>>>> VALIDITY[]
>>>>> TYPE_ID[]
>>>>> children[]
>>>>>
>>>>> and java as:
>>>>> name
>>>>> count
>>>>> TYPE[]
>>>>> children[]
>>>>>
>>>>> The precise names may only be important for json reading/writing in the
>>>>> integration tests so I will ignore TYPE/TYPE_ID for now. However, the
>> big
>>>>> difference is that Java doesn't have a validity buffer and c++ does. My
>>>>> understanding is thta technically the validity buffer is redundant (0
>>>> type
>>>>> == NULL) so I can see why Java would omit it. My question is then:
>> which
>>>>> language is 'correct'?
>>>>
>>>> Union type ids are logical, so 0 could very well be a valid type id.
>>>> You can't assume that type 0 means a null entry.
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>
>>
>

Re: Sparse Union format

Posted by Ryan Murray <ry...@dremio.com>.

Thanks for the clarification! Next time I will read the whole document ;-)

On Tue, May 19, 2020 at 2:38 PM Antoine Pitrou <an...@python.org> wrote:

>
> As explained in the comment below:
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L91
>
> Regards
>
> Antoine.
>
>
> Le 19/05/2020 à 14:14, Ryan Murray a écrit :
> > Thanks Antoine,
> >
> > Can you just clarify what you mean by 'type ids are logical'? In my mind
> > type ids are strongly coupled to the types and their order in Schema.fbs
> > [1]. Do you mean that the order there is only a convention and we can't
> > assume that 0 === Null?
> >
> > Best,
> > Ryan
> >
> > [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L235
> >
> > On Tue, May 19, 2020 at 2:04 PM Antoine Pitrou <an...@python.org>
> wrote:
> >
> >>
> >> Le 19/05/2020 à 13:43, Ryan Murray a écrit :
> >>> Hey All,
> >>>
> >>> While working on https://issues.apache.org/jira/browse/ARROW-1692 I
> >> noticed
> >>> that there is a difference between C++ and Java on the way Sparse
> Unions
> >>> are handled. I haven't seen in the format spec which the correct is so
> I
> >>> wanted to check with the wider community.
> >>>
> >>> c++ (and the integration tests) see sparse unions as:
> >>> name
> >>> count
> >>> VALIDITY[]
> >>> TYPE_ID[]
> >>> children[]
> >>>
> >>> and java as:
> >>> name
> >>> count
> >>> TYPE[]
> >>> children[]
> >>>
> >>> The precise names may only be important for json reading/writing in the
> >>> integration tests so I will ignore TYPE/TYPE_ID for now. However, the
> big
> >>> difference is that Java doesn't have a validity buffer and c++ does. My
> >>> understanding is thta technically the validity buffer is redundant (0
> >> type
> >>> == NULL) so I can see why Java would omit it. My question is then:
> which
> >>> language is 'correct'?
> >>
> >> Union type ids are logical, so 0 could very well be a valid type id.
> >> You can't assume that type 0 means a null entry.
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >
>

Re: Sparse Union format

Posted by Antoine Pitrou <an...@python.org>.

As explained in the comment below:
https://github.com/apache/arrow/blob/master/format/Schema.fbs#L91

Regards

Antoine.


Le 19/05/2020 à 14:14, Ryan Murray a écrit :
> Thanks Antoine,
> 
> Can you just clarify what you mean by 'type ids are logical'? In my mind
> type ids are strongly coupled to the types and their order in Schema.fbs
> [1]. Do you mean that the order there is only a convention and we can't
> assume that 0 === Null?
> 
> Best,
> Ryan
> 
> [1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L235
> 
> On Tue, May 19, 2020 at 2:04 PM Antoine Pitrou <an...@python.org> wrote:
> 
>>
>> Le 19/05/2020 à 13:43, Ryan Murray a écrit :
>>> Hey All,
>>>
>>> While working on https://issues.apache.org/jira/browse/ARROW-1692 I
>> noticed
>>> that there is a difference between C++ and Java on the way Sparse Unions
>>> are handled. I haven't seen in the format spec which the correct is so I
>>> wanted to check with the wider community.
>>>
>>> c++ (and the integration tests) see sparse unions as:
>>> name
>>> count
>>> VALIDITY[]
>>> TYPE_ID[]
>>> children[]
>>>
>>> and java as:
>>> name
>>> count
>>> TYPE[]
>>> children[]
>>>
>>> The precise names may only be important for json reading/writing in the
>>> integration tests so I will ignore TYPE/TYPE_ID for now. However, the big
>>> difference is that Java doesn't have a validity buffer and c++ does. My
>>> understanding is thta technically the validity buffer is redundant (0
>> type
>>> == NULL) so I can see why Java would omit it. My question is then: which
>>> language is 'correct'?
>>
>> Union type ids are logical, so 0 could very well be a valid type id.
>> You can't assume that type 0 means a null entry.
>>
>> Regards
>>
>> Antoine.
>>
>

Re: Sparse Union format

Posted by Ryan Murray <ry...@dremio.com>.

Thanks Antoine,

Can you just clarify what you mean by 'type ids are logical'? In my mind
type ids are strongly coupled to the types and their order in Schema.fbs
[1]. Do you mean that the order there is only a convention and we can't
assume that 0 === Null?

Best,
Ryan

[1] https://github.com/apache/arrow/blob/master/format/Schema.fbs#L235

On Tue, May 19, 2020 at 2:04 PM Antoine Pitrou <an...@python.org> wrote:

>
> Le 19/05/2020 à 13:43, Ryan Murray a écrit :
> > Hey All,
> >
> > While working on https://issues.apache.org/jira/browse/ARROW-1692 I
> noticed
> > that there is a difference between C++ and Java on the way Sparse Unions
> > are handled. I haven't seen in the format spec which the correct is so I
> > wanted to check with the wider community.
> >
> > c++ (and the integration tests) see sparse unions as:
> > name
> > count
> > VALIDITY[]
> > TYPE_ID[]
> > children[]
> >
> > and java as:
> > name
> > count
> > TYPE[]
> > children[]
> >
> > The precise names may only be important for json reading/writing in the
> > integration tests so I will ignore TYPE/TYPE_ID for now. However, the big
> > difference is that Java doesn't have a validity buffer and c++ does. My
> > understanding is thta technically the validity buffer is redundant (0
> type
> > == NULL) so I can see why Java would omit it. My question is then: which
> > language is 'correct'?
>
> Union type ids are logical, so 0 could very well be a valid type id.
> You can't assume that type 0 means a null entry.
>
> Regards
>
> Antoine.
>

Re: Sparse Union format

Posted by Antoine Pitrou <an...@python.org>.

Le 19/05/2020 à 13:43, Ryan Murray a écrit :
> Hey All,
> 
> While working on https://issues.apache.org/jira/browse/ARROW-1692 I noticed
> that there is a difference between C++ and Java on the way Sparse Unions
> are handled. I haven't seen in the format spec which the correct is so I
> wanted to check with the wider community.
> 
> c++ (and the integration tests) see sparse unions as:
> name
> count
> VALIDITY[]
> TYPE_ID[]
> children[]
> 
> and java as:
> name
> count
> TYPE[]
> children[]
> 
> The precise names may only be important for json reading/writing in the
> integration tests so I will ignore TYPE/TYPE_ID for now. However, the big
> difference is that Java doesn't have a validity buffer and c++ does. My
> understanding is thta technically the validity buffer is redundant (0 type
> == NULL) so I can see why Java would omit it. My question is then: which
> language is 'correct'?

Union type ids are logical, so 0 could very well be a valid type id.
You can't assume that type 0 means a null entry.

Regards

Antoine.