You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2020/06/24 14:57:28 UTC

[DISCUSS] Removing top-level validity bitmap from Union type

hi folks,

As discussed on the recent GitHub PR [1], as a means of reconciling
the long-standing cross-implementation incompatibilities with Union
types, it's been proposed to remove the top-level validity bitmap from
the Union data layout and let validity be determined exclusively by
the child arrays of the union. So the only additional data needed to
form a union are the type ids (and for the dense union, the offsets).

I do not think this change meaningfully alters the semantics of Union
types and I think it also simplifies their construction, so I would be
in favor of making it for 1.0.0.

I can create a PR with the relevant alterations but wanted to raise
the issue now so if there is consensus about doing this, that we can
act quickly to implement it.

Thanks,
Wes

[1]: https://github.com/apache/arrow/pull/7290

Re: [DISCUSS] Removing top-level validity bitmap from Union type

Posted by Wes McKinney <we...@gmail.com>.
I'm going to start a vote about this tomorrow if there are not more
comments about this.

On Thu, Jun 25, 2020 at 5:41 PM Wes McKinney <we...@gmail.com> wrote:
>
> I updated the PR to fix some issues with my edits that Antoine pointed
> out. I can start working on a C++ patch to implement the C++ changes
> in the next few days if that helps. Given the time urgency of deciding
> what to do on this if anyone else could express opinions it would be
> helpful.
>
> I see one of the major benefits of this that when forming a sparse
> union (the one without the offsets buffer) the child bitmaps do not
> have to be spliced together to create the top-level bitmap with the
> assistance of the type ids buffer, which when you think about it is a
> rather complicated operation.
>
> On Wed, Jun 24, 2020 at 5:34 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > I drafted the specification changes that would be associated with the
> > union changes
> >
> > https://github.com/apache/arrow/pull/7535
> >
> > I'll start a separate discussion about incrementing the
> > MetadataVersion since that must be discussed independently.
> >
> > Please take a look
> >
> > On Wed, Jun 24, 2020 at 3:50 PM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > I should also add that we could (with some effort) use the
> > > MetadataVersion V4/V5 indicator to offer backward compatibility for
> > > old serialized union data
> > >
> > > In any case, if there is consensus about this, we would need to have a
> > > vote and get busy with implementing and testing the changes. I could
> > > assist with the C++ changes
> > >
> > > On Wed, Jun 24, 2020 at 1:14 PM Wes McKinney <we...@gmail.com> wrote:
> > > >
> > > > On Wed, Jun 24, 2020 at 1:07 PM Francois Saint-Jacques
> > > > <fs...@gmail.com> wrote:
> > > > >
> > > > > OTOH,
> > > > >
> > > > > how do we handle NullType -> UnionType<T...> cast conversion? Do we
> > > > > require some convention like the first children ArrayData null bitmap
> > > > > to be set and all tags set to 0?
> > > >
> > > > Sure, that sounds like a reasonable implementation should this
> > > > operation actually be required.
> > > >
> > > > > François
> > > > >
> > > > > On Wed, Jun 24, 2020 at 1:09 PM Antoine Pitrou <an...@python.org> wrote:
> > > > > >
> > > > > >
> > > > > > Le 24/06/2020 à 18:34, Wes McKinney a écrit :
> > > > > > > On Wed, Jun 24, 2020 at 11:08 AM Antoine Pitrou <an...@python.org> wrote:
> > > > > > >>
> > > > > > >>
> > > > > > >> Le 24/06/2020 à 16:57, Wes McKinney a écrit :
> > > > > > >>> hi folks,
> > > > > > >>>
> > > > > > >>> As discussed on the recent GitHub PR [1], as a means of reconciling
> > > > > > >>> the long-standing cross-implementation incompatibilities with Union
> > > > > > >>> types, it's been proposed to remove the top-level validity bitmap from
> > > > > > >>> the Union data layout and let validity be determined exclusively by
> > > > > > >>> the child arrays of the union. So the only additional data needed to
> > > > > > >>> form a union are the type ids (and for the dense union, the offsets).
> > > > > > >>>
> > > > > > >>> I do not think this change meaningfully alters the semantics of Union
> > > > > > >>> types and I think it also simplifies their construction, so I would be
> > > > > > >>> in favor of making it for 1.0.0.
> > > > > > >>
> > > > > > >> So it sounds like this may break compatibility with existing only uses
> > > > > > >> of Arrow C++ (and the relevant bindings: PyArrow, Arrow C/GLib, Red
> > > > > > >> Arrow); not only on the API side, but on the data side.
> > > > > > >
> > > > > > > Right. However, I don't think these changes will be very disruptive,
> > > > > > > and we always knew that this disruption was possible because of the
> > > > > > > hitherto unreconciled issues with Unions. The applications that I'm
> > > > > > > aware of that use Union serialization (e.g. Ray) use it only for
> > > > > > > ephemeral serialization.
> > > > > >
> > > > > > Ok, that's a convincing argument.
> > > > > >
> > > > > > > In general, I think that we should be bumping the metadata version [1]
> > > > > > > for 1.0.0 to create a forcing function for upgrade to the
> > > > > > > format-stable line of libraries. The C++/Python libraries could have a
> > > > > > > "compatibility mode" (like the "write_legacy_ipc_format" options) that
> > > > > > > writes MetadataVersion::V4 (v0.8.0 -> v0.17.1) with certain features
> > > > > > > (like unions -- which are not needed for Spark for example) disabled.
> > > > > >
> > > > > > Hmm, I hope we can keep the negotiation minimal.  We should take from
> > > > > > the Jon Postel principle - be liberal in what you accept, strict in what
> > > > > > you emit.
> > > > > >
> > > > > > So the IPC reader can have a simple detection that goes this way:
> > > > > >
> > > > > >   * if we receive 1 buffer for sparse union or 2 buffers for dense union
> > > > > > => it's the new-style format, there's nothing to do
> > > > > >
> > > > > >   * if we receive 2 (non-null) buffers for sparse union or 3 (non-null)
> > > > > > buffers for dense union
> > > > > > => it's the old format, we should AND the parent bitmap into each of the
> > > > > > child bitmaps
> > > > > >
> > > > > > We can also add a flag to IpcOptions to enable/disable compatibility tricks.
> > > > > >
> > > > > > Regards
> > > > > >
> > > > > > Antoine.

Re: [DISCUSS] Removing top-level validity bitmap from Union type

Posted by Wes McKinney <we...@gmail.com>.
I updated the PR to fix some issues with my edits that Antoine pointed
out. I can start working on a C++ patch to implement the C++ changes
in the next few days if that helps. Given the time urgency of deciding
what to do on this if anyone else could express opinions it would be
helpful.

I see one of the major benefits of this that when forming a sparse
union (the one without the offsets buffer) the child bitmaps do not
have to be spliced together to create the top-level bitmap with the
assistance of the type ids buffer, which when you think about it is a
rather complicated operation.

On Wed, Jun 24, 2020 at 5:34 PM Wes McKinney <we...@gmail.com> wrote:
>
> I drafted the specification changes that would be associated with the
> union changes
>
> https://github.com/apache/arrow/pull/7535
>
> I'll start a separate discussion about incrementing the
> MetadataVersion since that must be discussed independently.
>
> Please take a look
>
> On Wed, Jun 24, 2020 at 3:50 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > I should also add that we could (with some effort) use the
> > MetadataVersion V4/V5 indicator to offer backward compatibility for
> > old serialized union data
> >
> > In any case, if there is consensus about this, we would need to have a
> > vote and get busy with implementing and testing the changes. I could
> > assist with the C++ changes
> >
> > On Wed, Jun 24, 2020 at 1:14 PM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > On Wed, Jun 24, 2020 at 1:07 PM Francois Saint-Jacques
> > > <fs...@gmail.com> wrote:
> > > >
> > > > OTOH,
> > > >
> > > > how do we handle NullType -> UnionType<T...> cast conversion? Do we
> > > > require some convention like the first children ArrayData null bitmap
> > > > to be set and all tags set to 0?
> > >
> > > Sure, that sounds like a reasonable implementation should this
> > > operation actually be required.
> > >
> > > > François
> > > >
> > > > On Wed, Jun 24, 2020 at 1:09 PM Antoine Pitrou <an...@python.org> wrote:
> > > > >
> > > > >
> > > > > Le 24/06/2020 à 18:34, Wes McKinney a écrit :
> > > > > > On Wed, Jun 24, 2020 at 11:08 AM Antoine Pitrou <an...@python.org> wrote:
> > > > > >>
> > > > > >>
> > > > > >> Le 24/06/2020 à 16:57, Wes McKinney a écrit :
> > > > > >>> hi folks,
> > > > > >>>
> > > > > >>> As discussed on the recent GitHub PR [1], as a means of reconciling
> > > > > >>> the long-standing cross-implementation incompatibilities with Union
> > > > > >>> types, it's been proposed to remove the top-level validity bitmap from
> > > > > >>> the Union data layout and let validity be determined exclusively by
> > > > > >>> the child arrays of the union. So the only additional data needed to
> > > > > >>> form a union are the type ids (and for the dense union, the offsets).
> > > > > >>>
> > > > > >>> I do not think this change meaningfully alters the semantics of Union
> > > > > >>> types and I think it also simplifies their construction, so I would be
> > > > > >>> in favor of making it for 1.0.0.
> > > > > >>
> > > > > >> So it sounds like this may break compatibility with existing only uses
> > > > > >> of Arrow C++ (and the relevant bindings: PyArrow, Arrow C/GLib, Red
> > > > > >> Arrow); not only on the API side, but on the data side.
> > > > > >
> > > > > > Right. However, I don't think these changes will be very disruptive,
> > > > > > and we always knew that this disruption was possible because of the
> > > > > > hitherto unreconciled issues with Unions. The applications that I'm
> > > > > > aware of that use Union serialization (e.g. Ray) use it only for
> > > > > > ephemeral serialization.
> > > > >
> > > > > Ok, that's a convincing argument.
> > > > >
> > > > > > In general, I think that we should be bumping the metadata version [1]
> > > > > > for 1.0.0 to create a forcing function for upgrade to the
> > > > > > format-stable line of libraries. The C++/Python libraries could have a
> > > > > > "compatibility mode" (like the "write_legacy_ipc_format" options) that
> > > > > > writes MetadataVersion::V4 (v0.8.0 -> v0.17.1) with certain features
> > > > > > (like unions -- which are not needed for Spark for example) disabled.
> > > > >
> > > > > Hmm, I hope we can keep the negotiation minimal.  We should take from
> > > > > the Jon Postel principle - be liberal in what you accept, strict in what
> > > > > you emit.
> > > > >
> > > > > So the IPC reader can have a simple detection that goes this way:
> > > > >
> > > > >   * if we receive 1 buffer for sparse union or 2 buffers for dense union
> > > > > => it's the new-style format, there's nothing to do
> > > > >
> > > > >   * if we receive 2 (non-null) buffers for sparse union or 3 (non-null)
> > > > > buffers for dense union
> > > > > => it's the old format, we should AND the parent bitmap into each of the
> > > > > child bitmaps
> > > > >
> > > > > We can also add a flag to IpcOptions to enable/disable compatibility tricks.
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.

Re: [DISCUSS] Removing top-level validity bitmap from Union type

Posted by Wes McKinney <we...@gmail.com>.
I drafted the specification changes that would be associated with the
union changes

https://github.com/apache/arrow/pull/7535

I'll start a separate discussion about incrementing the
MetadataVersion since that must be discussed independently.

Please take a look

On Wed, Jun 24, 2020 at 3:50 PM Wes McKinney <we...@gmail.com> wrote:
>
> I should also add that we could (with some effort) use the
> MetadataVersion V4/V5 indicator to offer backward compatibility for
> old serialized union data
>
> In any case, if there is consensus about this, we would need to have a
> vote and get busy with implementing and testing the changes. I could
> assist with the C++ changes
>
> On Wed, Jun 24, 2020 at 1:14 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > On Wed, Jun 24, 2020 at 1:07 PM Francois Saint-Jacques
> > <fs...@gmail.com> wrote:
> > >
> > > OTOH,
> > >
> > > how do we handle NullType -> UnionType<T...> cast conversion? Do we
> > > require some convention like the first children ArrayData null bitmap
> > > to be set and all tags set to 0?
> >
> > Sure, that sounds like a reasonable implementation should this
> > operation actually be required.
> >
> > > François
> > >
> > > On Wed, Jun 24, 2020 at 1:09 PM Antoine Pitrou <an...@python.org> wrote:
> > > >
> > > >
> > > > Le 24/06/2020 à 18:34, Wes McKinney a écrit :
> > > > > On Wed, Jun 24, 2020 at 11:08 AM Antoine Pitrou <an...@python.org> wrote:
> > > > >>
> > > > >>
> > > > >> Le 24/06/2020 à 16:57, Wes McKinney a écrit :
> > > > >>> hi folks,
> > > > >>>
> > > > >>> As discussed on the recent GitHub PR [1], as a means of reconciling
> > > > >>> the long-standing cross-implementation incompatibilities with Union
> > > > >>> types, it's been proposed to remove the top-level validity bitmap from
> > > > >>> the Union data layout and let validity be determined exclusively by
> > > > >>> the child arrays of the union. So the only additional data needed to
> > > > >>> form a union are the type ids (and for the dense union, the offsets).
> > > > >>>
> > > > >>> I do not think this change meaningfully alters the semantics of Union
> > > > >>> types and I think it also simplifies their construction, so I would be
> > > > >>> in favor of making it for 1.0.0.
> > > > >>
> > > > >> So it sounds like this may break compatibility with existing only uses
> > > > >> of Arrow C++ (and the relevant bindings: PyArrow, Arrow C/GLib, Red
> > > > >> Arrow); not only on the API side, but on the data side.
> > > > >
> > > > > Right. However, I don't think these changes will be very disruptive,
> > > > > and we always knew that this disruption was possible because of the
> > > > > hitherto unreconciled issues with Unions. The applications that I'm
> > > > > aware of that use Union serialization (e.g. Ray) use it only for
> > > > > ephemeral serialization.
> > > >
> > > > Ok, that's a convincing argument.
> > > >
> > > > > In general, I think that we should be bumping the metadata version [1]
> > > > > for 1.0.0 to create a forcing function for upgrade to the
> > > > > format-stable line of libraries. The C++/Python libraries could have a
> > > > > "compatibility mode" (like the "write_legacy_ipc_format" options) that
> > > > > writes MetadataVersion::V4 (v0.8.0 -> v0.17.1) with certain features
> > > > > (like unions -- which are not needed for Spark for example) disabled.
> > > >
> > > > Hmm, I hope we can keep the negotiation minimal.  We should take from
> > > > the Jon Postel principle - be liberal in what you accept, strict in what
> > > > you emit.
> > > >
> > > > So the IPC reader can have a simple detection that goes this way:
> > > >
> > > >   * if we receive 1 buffer for sparse union or 2 buffers for dense union
> > > > => it's the new-style format, there's nothing to do
> > > >
> > > >   * if we receive 2 (non-null) buffers for sparse union or 3 (non-null)
> > > > buffers for dense union
> > > > => it's the old format, we should AND the parent bitmap into each of the
> > > > child bitmaps
> > > >
> > > > We can also add a flag to IpcOptions to enable/disable compatibility tricks.
> > > >
> > > > Regards
> > > >
> > > > Antoine.

Re: [DISCUSS] Removing top-level validity bitmap from Union type

Posted by Wes McKinney <we...@gmail.com>.
I should also add that we could (with some effort) use the
MetadataVersion V4/V5 indicator to offer backward compatibility for
old serialized union data

In any case, if there is consensus about this, we would need to have a
vote and get busy with implementing and testing the changes. I could
assist with the C++ changes

On Wed, Jun 24, 2020 at 1:14 PM Wes McKinney <we...@gmail.com> wrote:
>
> On Wed, Jun 24, 2020 at 1:07 PM Francois Saint-Jacques
> <fs...@gmail.com> wrote:
> >
> > OTOH,
> >
> > how do we handle NullType -> UnionType<T...> cast conversion? Do we
> > require some convention like the first children ArrayData null bitmap
> > to be set and all tags set to 0?
>
> Sure, that sounds like a reasonable implementation should this
> operation actually be required.
>
> > François
> >
> > On Wed, Jun 24, 2020 at 1:09 PM Antoine Pitrou <an...@python.org> wrote:
> > >
> > >
> > > Le 24/06/2020 à 18:34, Wes McKinney a écrit :
> > > > On Wed, Jun 24, 2020 at 11:08 AM Antoine Pitrou <an...@python.org> wrote:
> > > >>
> > > >>
> > > >> Le 24/06/2020 à 16:57, Wes McKinney a écrit :
> > > >>> hi folks,
> > > >>>
> > > >>> As discussed on the recent GitHub PR [1], as a means of reconciling
> > > >>> the long-standing cross-implementation incompatibilities with Union
> > > >>> types, it's been proposed to remove the top-level validity bitmap from
> > > >>> the Union data layout and let validity be determined exclusively by
> > > >>> the child arrays of the union. So the only additional data needed to
> > > >>> form a union are the type ids (and for the dense union, the offsets).
> > > >>>
> > > >>> I do not think this change meaningfully alters the semantics of Union
> > > >>> types and I think it also simplifies their construction, so I would be
> > > >>> in favor of making it for 1.0.0.
> > > >>
> > > >> So it sounds like this may break compatibility with existing only uses
> > > >> of Arrow C++ (and the relevant bindings: PyArrow, Arrow C/GLib, Red
> > > >> Arrow); not only on the API side, but on the data side.
> > > >
> > > > Right. However, I don't think these changes will be very disruptive,
> > > > and we always knew that this disruption was possible because of the
> > > > hitherto unreconciled issues with Unions. The applications that I'm
> > > > aware of that use Union serialization (e.g. Ray) use it only for
> > > > ephemeral serialization.
> > >
> > > Ok, that's a convincing argument.
> > >
> > > > In general, I think that we should be bumping the metadata version [1]
> > > > for 1.0.0 to create a forcing function for upgrade to the
> > > > format-stable line of libraries. The C++/Python libraries could have a
> > > > "compatibility mode" (like the "write_legacy_ipc_format" options) that
> > > > writes MetadataVersion::V4 (v0.8.0 -> v0.17.1) with certain features
> > > > (like unions -- which are not needed for Spark for example) disabled.
> > >
> > > Hmm, I hope we can keep the negotiation minimal.  We should take from
> > > the Jon Postel principle - be liberal in what you accept, strict in what
> > > you emit.
> > >
> > > So the IPC reader can have a simple detection that goes this way:
> > >
> > >   * if we receive 1 buffer for sparse union or 2 buffers for dense union
> > > => it's the new-style format, there's nothing to do
> > >
> > >   * if we receive 2 (non-null) buffers for sparse union or 3 (non-null)
> > > buffers for dense union
> > > => it's the old format, we should AND the parent bitmap into each of the
> > > child bitmaps
> > >
> > > We can also add a flag to IpcOptions to enable/disable compatibility tricks.
> > >
> > > Regards
> > >
> > > Antoine.

Re: [DISCUSS] Removing top-level validity bitmap from Union type

Posted by Wes McKinney <we...@gmail.com>.
On Wed, Jun 24, 2020 at 1:07 PM Francois Saint-Jacques
<fs...@gmail.com> wrote:
>
> OTOH,
>
> how do we handle NullType -> UnionType<T...> cast conversion? Do we
> require some convention like the first children ArrayData null bitmap
> to be set and all tags set to 0?

Sure, that sounds like a reasonable implementation should this
operation actually be required.

> François
>
> On Wed, Jun 24, 2020 at 1:09 PM Antoine Pitrou <an...@python.org> wrote:
> >
> >
> > Le 24/06/2020 à 18:34, Wes McKinney a écrit :
> > > On Wed, Jun 24, 2020 at 11:08 AM Antoine Pitrou <an...@python.org> wrote:
> > >>
> > >>
> > >> Le 24/06/2020 à 16:57, Wes McKinney a écrit :
> > >>> hi folks,
> > >>>
> > >>> As discussed on the recent GitHub PR [1], as a means of reconciling
> > >>> the long-standing cross-implementation incompatibilities with Union
> > >>> types, it's been proposed to remove the top-level validity bitmap from
> > >>> the Union data layout and let validity be determined exclusively by
> > >>> the child arrays of the union. So the only additional data needed to
> > >>> form a union are the type ids (and for the dense union, the offsets).
> > >>>
> > >>> I do not think this change meaningfully alters the semantics of Union
> > >>> types and I think it also simplifies their construction, so I would be
> > >>> in favor of making it for 1.0.0.
> > >>
> > >> So it sounds like this may break compatibility with existing only uses
> > >> of Arrow C++ (and the relevant bindings: PyArrow, Arrow C/GLib, Red
> > >> Arrow); not only on the API side, but on the data side.
> > >
> > > Right. However, I don't think these changes will be very disruptive,
> > > and we always knew that this disruption was possible because of the
> > > hitherto unreconciled issues with Unions. The applications that I'm
> > > aware of that use Union serialization (e.g. Ray) use it only for
> > > ephemeral serialization.
> >
> > Ok, that's a convincing argument.
> >
> > > In general, I think that we should be bumping the metadata version [1]
> > > for 1.0.0 to create a forcing function for upgrade to the
> > > format-stable line of libraries. The C++/Python libraries could have a
> > > "compatibility mode" (like the "write_legacy_ipc_format" options) that
> > > writes MetadataVersion::V4 (v0.8.0 -> v0.17.1) with certain features
> > > (like unions -- which are not needed for Spark for example) disabled.
> >
> > Hmm, I hope we can keep the negotiation minimal.  We should take from
> > the Jon Postel principle - be liberal in what you accept, strict in what
> > you emit.
> >
> > So the IPC reader can have a simple detection that goes this way:
> >
> >   * if we receive 1 buffer for sparse union or 2 buffers for dense union
> > => it's the new-style format, there's nothing to do
> >
> >   * if we receive 2 (non-null) buffers for sparse union or 3 (non-null)
> > buffers for dense union
> > => it's the old format, we should AND the parent bitmap into each of the
> > child bitmaps
> >
> > We can also add a flag to IpcOptions to enable/disable compatibility tricks.
> >
> > Regards
> >
> > Antoine.

Re: [DISCUSS] Removing top-level validity bitmap from Union type

Posted by Francois Saint-Jacques <fs...@gmail.com>.
OTOH,

how do we handle NullType -> UnionType<T...> cast conversion? Do we
require some convention like the first children ArrayData null bitmap
to be set and all tags set to 0?

François

On Wed, Jun 24, 2020 at 1:09 PM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 24/06/2020 à 18:34, Wes McKinney a écrit :
> > On Wed, Jun 24, 2020 at 11:08 AM Antoine Pitrou <an...@python.org> wrote:
> >>
> >>
> >> Le 24/06/2020 à 16:57, Wes McKinney a écrit :
> >>> hi folks,
> >>>
> >>> As discussed on the recent GitHub PR [1], as a means of reconciling
> >>> the long-standing cross-implementation incompatibilities with Union
> >>> types, it's been proposed to remove the top-level validity bitmap from
> >>> the Union data layout and let validity be determined exclusively by
> >>> the child arrays of the union. So the only additional data needed to
> >>> form a union are the type ids (and for the dense union, the offsets).
> >>>
> >>> I do not think this change meaningfully alters the semantics of Union
> >>> types and I think it also simplifies their construction, so I would be
> >>> in favor of making it for 1.0.0.
> >>
> >> So it sounds like this may break compatibility with existing only uses
> >> of Arrow C++ (and the relevant bindings: PyArrow, Arrow C/GLib, Red
> >> Arrow); not only on the API side, but on the data side.
> >
> > Right. However, I don't think these changes will be very disruptive,
> > and we always knew that this disruption was possible because of the
> > hitherto unreconciled issues with Unions. The applications that I'm
> > aware of that use Union serialization (e.g. Ray) use it only for
> > ephemeral serialization.
>
> Ok, that's a convincing argument.
>
> > In general, I think that we should be bumping the metadata version [1]
> > for 1.0.0 to create a forcing function for upgrade to the
> > format-stable line of libraries. The C++/Python libraries could have a
> > "compatibility mode" (like the "write_legacy_ipc_format" options) that
> > writes MetadataVersion::V4 (v0.8.0 -> v0.17.1) with certain features
> > (like unions -- which are not needed for Spark for example) disabled.
>
> Hmm, I hope we can keep the negotiation minimal.  We should take from
> the Jon Postel principle - be liberal in what you accept, strict in what
> you emit.
>
> So the IPC reader can have a simple detection that goes this way:
>
>   * if we receive 1 buffer for sparse union or 2 buffers for dense union
> => it's the new-style format, there's nothing to do
>
>   * if we receive 2 (non-null) buffers for sparse union or 3 (non-null)
> buffers for dense union
> => it's the old format, we should AND the parent bitmap into each of the
> child bitmaps
>
> We can also add a flag to IpcOptions to enable/disable compatibility tricks.
>
> Regards
>
> Antoine.

Re: [DISCUSS] Removing top-level validity bitmap from Union type

Posted by Antoine Pitrou <an...@python.org>.
Le 24/06/2020 à 18:34, Wes McKinney a écrit :
> On Wed, Jun 24, 2020 at 11:08 AM Antoine Pitrou <an...@python.org> wrote:
>>
>>
>> Le 24/06/2020 à 16:57, Wes McKinney a écrit :
>>> hi folks,
>>>
>>> As discussed on the recent GitHub PR [1], as a means of reconciling
>>> the long-standing cross-implementation incompatibilities with Union
>>> types, it's been proposed to remove the top-level validity bitmap from
>>> the Union data layout and let validity be determined exclusively by
>>> the child arrays of the union. So the only additional data needed to
>>> form a union are the type ids (and for the dense union, the offsets).
>>>
>>> I do not think this change meaningfully alters the semantics of Union
>>> types and I think it also simplifies their construction, so I would be
>>> in favor of making it for 1.0.0.
>>
>> So it sounds like this may break compatibility with existing only uses
>> of Arrow C++ (and the relevant bindings: PyArrow, Arrow C/GLib, Red
>> Arrow); not only on the API side, but on the data side.
> 
> Right. However, I don't think these changes will be very disruptive,
> and we always knew that this disruption was possible because of the
> hitherto unreconciled issues with Unions. The applications that I'm
> aware of that use Union serialization (e.g. Ray) use it only for
> ephemeral serialization.

Ok, that's a convincing argument.

> In general, I think that we should be bumping the metadata version [1]
> for 1.0.0 to create a forcing function for upgrade to the
> format-stable line of libraries. The C++/Python libraries could have a
> "compatibility mode" (like the "write_legacy_ipc_format" options) that
> writes MetadataVersion::V4 (v0.8.0 -> v0.17.1) with certain features
> (like unions -- which are not needed for Spark for example) disabled.

Hmm, I hope we can keep the negotiation minimal.  We should take from
the Jon Postel principle - be liberal in what you accept, strict in what
you emit.

So the IPC reader can have a simple detection that goes this way:

  * if we receive 1 buffer for sparse union or 2 buffers for dense union
=> it's the new-style format, there's nothing to do

  * if we receive 2 (non-null) buffers for sparse union or 3 (non-null)
buffers for dense union
=> it's the old format, we should AND the parent bitmap into each of the
child bitmaps

We can also add a flag to IpcOptions to enable/disable compatibility tricks.

Regards

Antoine.

Re: [DISCUSS] Removing top-level validity bitmap from Union type

Posted by Wes McKinney <we...@gmail.com>.
On Wed, Jun 24, 2020 at 11:08 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 24/06/2020 à 16:57, Wes McKinney a écrit :
> > hi folks,
> >
> > As discussed on the recent GitHub PR [1], as a means of reconciling
> > the long-standing cross-implementation incompatibilities with Union
> > types, it's been proposed to remove the top-level validity bitmap from
> > the Union data layout and let validity be determined exclusively by
> > the child arrays of the union. So the only additional data needed to
> > form a union are the type ids (and for the dense union, the offsets).
> >
> > I do not think this change meaningfully alters the semantics of Union
> > types and I think it also simplifies their construction, so I would be
> > in favor of making it for 1.0.0.
>
> So it sounds like this may break compatibility with existing only uses
> of Arrow C++ (and the relevant bindings: PyArrow, Arrow C/GLib, Red
> Arrow); not only on the API side, but on the data side.

Right. However, I don't think these changes will be very disruptive,
and we always knew that this disruption was possible because of the
hitherto unreconciled issues with Unions. The applications that I'm
aware of that use Union serialization (e.g. Ray) use it only for
ephemeral serialization.

In general, I think that we should be bumping the metadata version [1]
for 1.0.0 to create a forcing function for upgrade to the
format-stable line of libraries. The C++/Python libraries could have a
"compatibility mode" (like the "write_legacy_ipc_format" options) that
writes MetadataVersion::V4 (v0.8.0 -> v0.17.1) with certain features
(like unions -- which are not needed for Spark for example) disabled.

[1]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22

> Regards
>
> Antoine.

Re: [DISCUSS] Removing top-level validity bitmap from Union type

Posted by Antoine Pitrou <an...@python.org>.
Le 24/06/2020 à 16:57, Wes McKinney a écrit :
> hi folks,
> 
> As discussed on the recent GitHub PR [1], as a means of reconciling
> the long-standing cross-implementation incompatibilities with Union
> types, it's been proposed to remove the top-level validity bitmap from
> the Union data layout and let validity be determined exclusively by
> the child arrays of the union. So the only additional data needed to
> form a union are the type ids (and for the dense union, the offsets).
> 
> I do not think this change meaningfully alters the semantics of Union
> types and I think it also simplifies their construction, so I would be
> in favor of making it for 1.0.0.

So it sounds like this may break compatibility with existing only uses
of Arrow C++ (and the relevant bindings: PyArrow, Arrow C/GLib, Red
Arrow); not only on the API side, but on the data side.

Regards

Antoine.

Re: [DISCUSS] Removing top-level validity bitmap from Union type

Posted by Jacques Nadeau <ja...@apache.org>.
Per my comments on the pr, I also think this is preferred. I believe we
will avoid the potential for validity inconsistency and simplify
construction of union data in most cases.

On Wed, Jun 24, 2020, 7:58 AM Wes McKinney <we...@gmail.com> wrote:

> hi folks,
>
> As discussed on the recent GitHub PR [1], as a means of reconciling
> the long-standing cross-implementation incompatibilities with Union
> types, it's been proposed to remove the top-level validity bitmap from
> the Union data layout and let validity be determined exclusively by
> the child arrays of the union. So the only additional data needed to
> form a union are the type ids (and for the dense union, the offsets).
>
> I do not think this change meaningfully alters the semantics of Union
> types and I think it also simplifies their construction, so I would be
> in favor of making it for 1.0.0.
>
> I can create a PR with the relevant alterations but wanted to raise
> the issue now so if there is consensus about doing this, that we can
> act quickly to implement it.
>
> Thanks,
> Wes
>
> [1]: https://github.com/apache/arrow/pull/7290
>