You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2019/06/30 08:00:36 UTC

[Discuss] IPC Specification, flatbuffers and unaligned memory accesses

While working on trying to fix undefined behavior for unaligned memory
accesses [1], I ran into an issue with the IPC specification [2] which
prevents us from ever achieving zero-copy memory mapping and having aligned
accesses (i.e. clean UBSan runs).

Flatbuffer metadata needs 8-byte alignment to guarantee aligned accesses.

In the IPC format we align each message to 8-byte boundaries.  We then
write a int32_t integer to to denote the size of flat buffer metadata,
followed immediately  by the flatbuffer metadata.  This means the
flatbuffer metadata will never be 8 byte aligned.

Do people care?  A simple fix  would be to use int64_t instead of int32_t
for length.  However, any fix essentially breaks all previous client
library versions or incurs a memory copy.

[1] https://github.com/apache/arrow/pull/4757
[2] https://arrow.apache.org/docs/ipc.html

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.

I'm going to review Micah's Format PR and propose a vote. I would like
to take on the C++ implementation of this in the near future.

On Mon, Jul 29, 2019 at 11:34 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 29/07/2019 à 18:32, Wes McKinney a écrit :
> >
> > We have a version number in the metadata already
> > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22. I
> > don't think we need to add other signaling beyond the 0xFFFFFFFF
> > (stream continues / non-null message) / 0x00000000 (stream end / null
> > message) marker at the beginning of the IPC payload
> >
> > Unless you envision a situation where the metadata cannot be obtained.
> > I would hope that this can never occur.
>
> Fair enough.  I hope so as well. :-)
>
> Regards
>
> Antoine.

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Antoine Pitrou <an...@python.org>.

Le 29/07/2019 à 18:32, Wes McKinney a écrit :
> 
> We have a version number in the metadata already
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22. I
> don't think we need to add other signaling beyond the 0xFFFFFFFF
> (stream continues / non-null message) / 0x00000000 (stream end / null
> message) marker at the beginning of the IPC payload
> 
> Unless you envision a situation where the metadata cannot be obtained.
> I would hope that this can never occur.

Fair enough.  I hope so as well. :-)

Regards

Antoine.

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.

On Mon, Jul 29, 2019 at 10:25 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 26/07/2019 à 07:27, Micah Kornfield a écrit :
> > I started PR https://github.com/apache/arrow/pull/4951 to capture the
> > proposed change.
> >
> > I would like to verify that with this change all alignment issues are
> > solved.
> >
> > It seems like there is mostly consensus on having another minor release to
> > get this resolved.
>
> You mean 0.15.0, right?
>
> > I think the open questions I have:
> > 1.  Do we want to provide backwards compatibility for writers?
>
> IMHO, yes.  How many APIs would we have to add a flag to?
>

To reduce the burden of maintaining such flags I would suggest
introducing an IpcOptions struct in C++ to APIs that can produce
messages. There are some other kinds of IPC configuration that might
be useful in the future, such as switching between 8-byte (current)
and 64-byte padding.

> Or do we need a more general "provide_compatibility_with_version" option
> on writers, for future format changes?
>
> > 2.  Similar to the file format, should there be some leading "magic"
> > indicator in the stream format to indicate a version, so we can do
> > something a little less ugly if we encounter more issues?
>
> At least a version number.  Perhaps also a magic number, I don't know.
>

We have a version number in the metadata already
https://github.com/apache/arrow/blob/master/format/Schema.fbs#L22. I
don't think we need to add other signaling beyond the 0xFFFFFFFF
(stream continues / non-null message) / 0x00000000 (stream end / null
message) marker at the beginning of the IPC payload

Unless you envision a situation where the metadata cannot be obtained.
I would hope that this can never occur.

> Regards
>
> Antoine.

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Antoine Pitrou <an...@python.org>.

Le 26/07/2019 à 07:27, Micah Kornfield a écrit :
> I started PR https://github.com/apache/arrow/pull/4951 to capture the
> proposed change.
> 
> I would like to verify that with this change all alignment issues are
> solved.
> 
> It seems like there is mostly consensus on having another minor release to
> get this resolved.

You mean 0.15.0, right?

> I think the open questions I have:
> 1.  Do we want to provide backwards compatibility for writers?

IMHO, yes.  How many APIs would we have to add a flag to?

Or do we need a more general "provide_compatibility_with_version" option
on writers, for future format changes?

> 2.  Similar to the file format, should there be some leading "magic"
> indicator in the stream format to indicate a version, so we can do
> something a little less ugly if we encounter more issues?

At least a version number.  Perhaps also a magic number, I don't know.

Regards

Antoine.

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Micah Kornfield <em...@gmail.com>.

I started PR https://github.com/apache/arrow/pull/4951 to capture the
proposed change.

I would like to verify that with this change all alignment issues are
solved.

It seems like there is mostly consensus on having another minor release to
get this resolved.  I think the open questions I have:
1.  Do we want to provide backwards compatibility for writers?
2.  Similar to the file format, should there be some leading "magic"
indicator in the stream format to indicate a version, so we can do
something a little less ugly if we encounter more issues?

On Thu, Jul 25, 2019 at 11:05 AM Wes McKinney <we...@gmail.com> wrote:

> We probably need to have a vote here. Either Micah or I can propose
> changes to
>
>
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#encapsulated-message-format
>
> to indicate the new binary protocol and the backwards/forward
> compatibility strategy
>
> Any other thoughts about this before a vote is proposed (along with a
> corresponding PR to change the format document)?
>
> AFAIK there are a few implementations of the protocol that will need
> development work around this:
>
> - C++
> - C#
> - Java
> - JavaScript
> - Go
>
> (has Rust implemented this yet?)
>
> The scope of changes in each case should be relatively isolated. In
> the case of C++, if we want to have the ability to emit backwards
> compatible messages (for old readers, e.g. <= 0.14.0, this is a good
> idea anyway for unit testing) then a bit more refactoring will be
> required to introduce the appropriate option. We could also expand the
> integration tests around this issue as needed
>
> On Fri, Jul 19, 2019 at 8:53 AM David Li <li...@gmail.com> wrote:
> >
> > I agree with Bryan and Micah - a gradual transition as part of 1.0 (or
> > 0.15.0) would be much less painful for us than staying on pre-1.0
> > until we can upgrade everything using Arrow at once. It is kind of a
> > 'have your cake and eat it too' situation, and it would be a
> > maintenance burden, but something like what Micah proposes would be
> > ideal.
> >
> > Thanks,
> > David
> >
> > On 7/19/19, Micah Kornfield <em...@gmail.com> wrote:
> > > I'm trying to work out the exact steps in my mind for a migration. It
> seems
> > > like one approach is:
> > >
> > > 1.  Add a code change which throws a clear exception it encounters -1
> for
> > > size.  In java the reasonable place seems to be at [1] (there might be
> > > more?).   The exception should state that the current stream reader
> isn't
> > > compatible with version 1.0.0 streams (we should have similar ones in
> each
> > > language).  We can add a note about the environment variable in 2 if we
> > > decide to do it.  Release this change as 0.15.0 or 0.14.2 and ensure at
> > > least Spark upgrades to this version.
> > >
> > > 2.  Change the reader implementation to support reading both 1.0.0
> streams
> > > and be backwards compatible with pre-1.0.0 streams.  Change the writer
> > > implementation to default to writing 1.0.0 streams but have an
> environment
> > > variable that make it write backwards compatible streams (writer
> > > compatibility seems like it should be optional).  Release this as 1.0.0
> > >
> > > 3. If provided, remove the environment variable switch in a later
> release.
> > >
> > > Thanks,
> > > Micah
> > >
> > > [1]
> > >
> https://github.com/apache/arrow/blob/9fe728c86caaf9ceb1827159eb172ff81fb98550/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageChannelReader.java#L67
> > >
> > > On Thu, Jul 18, 2019 at 8:58 PM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > >> To be clear, we could make a patch 0.14.x release that includes the
> > >> necessary compatibility changes. I presume Spark will be able to
> upgrade
> > >> to
> > >> a new patch release (I'd be surprised if not, otherwise how can you
> get
> > >> security fixes)?
> > >>
> > >> On Thu, Jul 18, 2019, 10:52 PM Bryan Cutler <cu...@gmail.com>
> wrote:
> > >>
> > >> > Hey Wes,
> > >> > I understand we don't want to burden 1.0 by maintaining
> compatibility
> > >> > and
> > >> > that is fine with me. I'm just try to figure out how to best handle
> > >> > this
> > >> > situation so Spark users won't get a cryptic error message. It
> sounds
> > >> like
> > >> > it will need to be handled on the Spark side to not allow mixing 1.0
> > >> > and
> > >> > pre-1.0 versions. I'm not too sure how much a 0.15.0 release with
> > >> > compatibility would help, it might depend on when things get
> released
> > >> > but
> > >> > we can discuss that in another thread.
> > >> >
> > >> > On Thu, Jul 18, 2019 at 12:03 PM Wes McKinney <we...@gmail.com>
> > >> wrote:
> > >> >
> > >> > > hi Bryan -- well, the reason for the current 0.x version is
> precisely
> > >> > > to avoid a situation where we are making decisions on the basis of
> > >> > > maintaining forward / backward compatibility.
> > >> > >
> > >> > > One possible way forward on this is to make a 0.15.0 (0.14.2, so
> > >> > > there
> > >> > > is less trouble for Spark to upgrade) release that supports
> reading
> > >> > > _both_ old and new variants of the protocol.
> > >> > >
> > >> > > On Thu, Jul 18, 2019 at 1:20 PM Bryan Cutler <cu...@gmail.com>
> > >> wrote:
> > >> > > >
> > >> > > > Are we going to say that Arrow 1.0 is not compatible with any
> > >> > > > version
> > >> > > > before?  My concern is that Spark 2.4.x might get stuck on Arrow
> > >> > > > Java
> > >> > > > 0.14.1 and a lot of users will install PyArrow 1.0.0, which will
> > >> > > > not
> > >> > > work.
> > >> > > > In Spark 3.0.0, though it will be no problem to update both Java
> > >> > > > and
> > >> > > Python
> > >> > > > to 1.0. Having a compatibility mode so that new readers/writers
> can
> > >> > work
> > >> > > > with old readers using a 4-byte prefix would solve the problem,
> but
> > >> if
> > >> > we
> > >> > > > don't want to do this will pyarrow be able to raise an error
> that
> > >> > clearly
> > >> > > > the new version does not support the old protocol?  For example,
> > >> would
> > >> > a
> > >> > > > pyarrow reader see the 0xFFFFFFFF and raise something like
> "PyArrow
> > >> > > > detected an old protocol and cannot continue, please use a
> version
> > >> > > > <
> > >> > > 1.0.0"?
> > >> > > >
> > >> > > > On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney <
> wesmckinn@gmail.com>
> > >> > > wrote:
> > >> > > >
> > >> > > > > Hi Francois -- copying the metadata into memory isn't the end
> of
> > >> the
> > >> > > world
> > >> > > > > but it's a pretty ugly wart. This affects every IPC protocol
> > >> message
> > >> > > > > everywhere.
> > >> > > > >
> > >> > > > > We have an opportunity to address the wart now but such a fix
> > >> > > post-1.0.0
> > >> > > > > will be much more difficult.
> > >> > > > >
> > >> > > > > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques <
> > >> > > > > fsaintjacques@gmail.com> wrote:
> > >> > > > >
> > >> > > > > > If the data buffers are still aligned, then I don't think we
> > >> should
> > >> > > > > > add a breaking change just for avoiding the copy on the
> > >> > > > > > metadata?
> > >> > I'd
> > >> > > > > > expect said metadata to be small enough that zero-copy
> doesn't
> > >> > really
> > >> > > > > > affect performance.
> > >> > > > > >
> > >> > > > > > François
> > >> > > > > >
> > >> > > > > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield <
> > >> > > emkornfield@gmail.com>
> > >> > > > > > wrote:
> > >> > > > > > >
> > >> > > > > > > While working on trying to fix undefined behavior for
> > >> > > > > > > unaligned
> > >> > > memory
> > >> > > > > > > accesses [1], I ran into an issue with the IPC
> specification
> > >> [2]
> > >> > > which
> > >> > > > > > > prevents us from ever achieving zero-copy memory mapping
> and
> > >> > having
> > >> > > > > > aligned
> > >> > > > > > > accesses (i.e. clean UBSan runs).
> > >> > > > > > >
> > >> > > > > > > Flatbuffer metadata needs 8-byte alignment to guarantee
> > >> > > > > > > aligned
> > >> > > > > accesses.
> > >> > > > > > >
> > >> > > > > > > In the IPC format we align each message to 8-byte
> boundaries.
> > >> We
> > >> > > then
> > >> > > > > > > write a int32_t integer to to denote the size of flat
> buffer
> > >> > > metadata,
> > >> > > > > > > followed immediately  by the flatbuffer metadata.  This
> means
> > >> the
> > >> > > > > > > flatbuffer metadata will never be 8 byte aligned.
> > >> > > > > > >
> > >> > > > > > > Do people care?  A simple fix  would be to use int64_t
> > >> > > > > > > instead
> > >> of
> > >> > > > > int32_t
> > >> > > > > > > for length.  However, any fix essentially breaks all
> previous
> > >> > > client
> > >> > > > > > > library versions or incurs a memory copy.
> > >> > > > > > >
> > >> > > > > > > [1] https://github.com/apache/arrow/pull/4757
> > >> > > > > > > [2] https://arrow.apache.org/docs/ipc.html
> > >> > > > > >
> > >> > > > >
> > >> > >
> > >> >
> > >>
> > >
>

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.

We probably need to have a vote here. Either Micah or I can propose changes to

https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#encapsulated-message-format

to indicate the new binary protocol and the backwards/forward
compatibility strategy

Any other thoughts about this before a vote is proposed (along with a
corresponding PR to change the format document)?

AFAIK there are a few implementations of the protocol that will need
development work around this:

- C++
- C#
- Java
- JavaScript
- Go

(has Rust implemented this yet?)

The scope of changes in each case should be relatively isolated. In
the case of C++, if we want to have the ability to emit backwards
compatible messages (for old readers, e.g. <= 0.14.0, this is a good
idea anyway for unit testing) then a bit more refactoring will be
required to introduce the appropriate option. We could also expand the
integration tests around this issue as needed

On Fri, Jul 19, 2019 at 8:53 AM David Li <li...@gmail.com> wrote:
>
> I agree with Bryan and Micah - a gradual transition as part of 1.0 (or
> 0.15.0) would be much less painful for us than staying on pre-1.0
> until we can upgrade everything using Arrow at once. It is kind of a
> 'have your cake and eat it too' situation, and it would be a
> maintenance burden, but something like what Micah proposes would be
> ideal.
>
> Thanks,
> David
>
> On 7/19/19, Micah Kornfield <em...@gmail.com> wrote:
> > I'm trying to work out the exact steps in my mind for a migration. It seems
> > like one approach is:
> >
> > 1.  Add a code change which throws a clear exception it encounters -1 for
> > size.  In java the reasonable place seems to be at [1] (there might be
> > more?).   The exception should state that the current stream reader isn't
> > compatible with version 1.0.0 streams (we should have similar ones in each
> > language).  We can add a note about the environment variable in 2 if we
> > decide to do it.  Release this change as 0.15.0 or 0.14.2 and ensure at
> > least Spark upgrades to this version.
> >
> > 2.  Change the reader implementation to support reading both 1.0.0 streams
> > and be backwards compatible with pre-1.0.0 streams.  Change the writer
> > implementation to default to writing 1.0.0 streams but have an environment
> > variable that make it write backwards compatible streams (writer
> > compatibility seems like it should be optional).  Release this as 1.0.0
> >
> > 3. If provided, remove the environment variable switch in a later release.
> >
> > Thanks,
> > Micah
> >
> > [1]
> > https://github.com/apache/arrow/blob/9fe728c86caaf9ceb1827159eb172ff81fb98550/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageChannelReader.java#L67
> >
> > On Thu, Jul 18, 2019 at 8:58 PM Wes McKinney <we...@gmail.com> wrote:
> >
> >> To be clear, we could make a patch 0.14.x release that includes the
> >> necessary compatibility changes. I presume Spark will be able to upgrade
> >> to
> >> a new patch release (I'd be surprised if not, otherwise how can you get
> >> security fixes)?
> >>
> >> On Thu, Jul 18, 2019, 10:52 PM Bryan Cutler <cu...@gmail.com> wrote:
> >>
> >> > Hey Wes,
> >> > I understand we don't want to burden 1.0 by maintaining compatibility
> >> > and
> >> > that is fine with me. I'm just try to figure out how to best handle
> >> > this
> >> > situation so Spark users won't get a cryptic error message. It sounds
> >> like
> >> > it will need to be handled on the Spark side to not allow mixing 1.0
> >> > and
> >> > pre-1.0 versions. I'm not too sure how much a 0.15.0 release with
> >> > compatibility would help, it might depend on when things get released
> >> > but
> >> > we can discuss that in another thread.
> >> >
> >> > On Thu, Jul 18, 2019 at 12:03 PM Wes McKinney <we...@gmail.com>
> >> wrote:
> >> >
> >> > > hi Bryan -- well, the reason for the current 0.x version is precisely
> >> > > to avoid a situation where we are making decisions on the basis of
> >> > > maintaining forward / backward compatibility.
> >> > >
> >> > > One possible way forward on this is to make a 0.15.0 (0.14.2, so
> >> > > there
> >> > > is less trouble for Spark to upgrade) release that supports reading
> >> > > _both_ old and new variants of the protocol.
> >> > >
> >> > > On Thu, Jul 18, 2019 at 1:20 PM Bryan Cutler <cu...@gmail.com>
> >> wrote:
> >> > > >
> >> > > > Are we going to say that Arrow 1.0 is not compatible with any
> >> > > > version
> >> > > > before?  My concern is that Spark 2.4.x might get stuck on Arrow
> >> > > > Java
> >> > > > 0.14.1 and a lot of users will install PyArrow 1.0.0, which will
> >> > > > not
> >> > > work.
> >> > > > In Spark 3.0.0, though it will be no problem to update both Java
> >> > > > and
> >> > > Python
> >> > > > to 1.0. Having a compatibility mode so that new readers/writers can
> >> > work
> >> > > > with old readers using a 4-byte prefix would solve the problem, but
> >> if
> >> > we
> >> > > > don't want to do this will pyarrow be able to raise an error that
> >> > clearly
> >> > > > the new version does not support the old protocol?  For example,
> >> would
> >> > a
> >> > > > pyarrow reader see the 0xFFFFFFFF and raise something like "PyArrow
> >> > > > detected an old protocol and cannot continue, please use a version
> >> > > > <
> >> > > 1.0.0"?
> >> > > >
> >> > > > On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney <we...@gmail.com>
> >> > > wrote:
> >> > > >
> >> > > > > Hi Francois -- copying the metadata into memory isn't the end of
> >> the
> >> > > world
> >> > > > > but it's a pretty ugly wart. This affects every IPC protocol
> >> message
> >> > > > > everywhere.
> >> > > > >
> >> > > > > We have an opportunity to address the wart now but such a fix
> >> > > post-1.0.0
> >> > > > > will be much more difficult.
> >> > > > >
> >> > > > > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques <
> >> > > > > fsaintjacques@gmail.com> wrote:
> >> > > > >
> >> > > > > > If the data buffers are still aligned, then I don't think we
> >> should
> >> > > > > > add a breaking change just for avoiding the copy on the
> >> > > > > > metadata?
> >> > I'd
> >> > > > > > expect said metadata to be small enough that zero-copy doesn't
> >> > really
> >> > > > > > affect performance.
> >> > > > > >
> >> > > > > > François
> >> > > > > >
> >> > > > > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield <
> >> > > emkornfield@gmail.com>
> >> > > > > > wrote:
> >> > > > > > >
> >> > > > > > > While working on trying to fix undefined behavior for
> >> > > > > > > unaligned
> >> > > memory
> >> > > > > > > accesses [1], I ran into an issue with the IPC specification
> >> [2]
> >> > > which
> >> > > > > > > prevents us from ever achieving zero-copy memory mapping and
> >> > having
> >> > > > > > aligned
> >> > > > > > > accesses (i.e. clean UBSan runs).
> >> > > > > > >
> >> > > > > > > Flatbuffer metadata needs 8-byte alignment to guarantee
> >> > > > > > > aligned
> >> > > > > accesses.
> >> > > > > > >
> >> > > > > > > In the IPC format we align each message to 8-byte boundaries.
> >> We
> >> > > then
> >> > > > > > > write a int32_t integer to to denote the size of flat buffer
> >> > > metadata,
> >> > > > > > > followed immediately  by the flatbuffer metadata.  This means
> >> the
> >> > > > > > > flatbuffer metadata will never be 8 byte aligned.
> >> > > > > > >
> >> > > > > > > Do people care?  A simple fix  would be to use int64_t
> >> > > > > > > instead
> >> of
> >> > > > > int32_t
> >> > > > > > > for length.  However, any fix essentially breaks all previous
> >> > > client
> >> > > > > > > library versions or incurs a memory copy.
> >> > > > > > >
> >> > > > > > > [1] https://github.com/apache/arrow/pull/4757
> >> > > > > > > [2] https://arrow.apache.org/docs/ipc.html
> >> > > > > >
> >> > > > >
> >> > >
> >> >
> >>
> >

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by David Li <li...@gmail.com>.

I agree with Bryan and Micah - a gradual transition as part of 1.0 (or
0.15.0) would be much less painful for us than staying on pre-1.0
until we can upgrade everything using Arrow at once. It is kind of a
'have your cake and eat it too' situation, and it would be a
maintenance burden, but something like what Micah proposes would be
ideal.

Thanks,
David

On 7/19/19, Micah Kornfield <em...@gmail.com> wrote:
> I'm trying to work out the exact steps in my mind for a migration. It seems
> like one approach is:
>
> 1.  Add a code change which throws a clear exception it encounters -1 for
> size.  In java the reasonable place seems to be at [1] (there might be
> more?).   The exception should state that the current stream reader isn't
> compatible with version 1.0.0 streams (we should have similar ones in each
> language).  We can add a note about the environment variable in 2 if we
> decide to do it.  Release this change as 0.15.0 or 0.14.2 and ensure at
> least Spark upgrades to this version.
>
> 2.  Change the reader implementation to support reading both 1.0.0 streams
> and be backwards compatible with pre-1.0.0 streams.  Change the writer
> implementation to default to writing 1.0.0 streams but have an environment
> variable that make it write backwards compatible streams (writer
> compatibility seems like it should be optional).  Release this as 1.0.0
>
> 3. If provided, remove the environment variable switch in a later release.
>
> Thanks,
> Micah
>
> [1]
> https://github.com/apache/arrow/blob/9fe728c86caaf9ceb1827159eb172ff81fb98550/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageChannelReader.java#L67
>
> On Thu, Jul 18, 2019 at 8:58 PM Wes McKinney <we...@gmail.com> wrote:
>
>> To be clear, we could make a patch 0.14.x release that includes the
>> necessary compatibility changes. I presume Spark will be able to upgrade
>> to
>> a new patch release (I'd be surprised if not, otherwise how can you get
>> security fixes)?
>>
>> On Thu, Jul 18, 2019, 10:52 PM Bryan Cutler <cu...@gmail.com> wrote:
>>
>> > Hey Wes,
>> > I understand we don't want to burden 1.0 by maintaining compatibility
>> > and
>> > that is fine with me. I'm just try to figure out how to best handle
>> > this
>> > situation so Spark users won't get a cryptic error message. It sounds
>> like
>> > it will need to be handled on the Spark side to not allow mixing 1.0
>> > and
>> > pre-1.0 versions. I'm not too sure how much a 0.15.0 release with
>> > compatibility would help, it might depend on when things get released
>> > but
>> > we can discuss that in another thread.
>> >
>> > On Thu, Jul 18, 2019 at 12:03 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> > > hi Bryan -- well, the reason for the current 0.x version is precisely
>> > > to avoid a situation where we are making decisions on the basis of
>> > > maintaining forward / backward compatibility.
>> > >
>> > > One possible way forward on this is to make a 0.15.0 (0.14.2, so
>> > > there
>> > > is less trouble for Spark to upgrade) release that supports reading
>> > > _both_ old and new variants of the protocol.
>> > >
>> > > On Thu, Jul 18, 2019 at 1:20 PM Bryan Cutler <cu...@gmail.com>
>> wrote:
>> > > >
>> > > > Are we going to say that Arrow 1.0 is not compatible with any
>> > > > version
>> > > > before?  My concern is that Spark 2.4.x might get stuck on Arrow
>> > > > Java
>> > > > 0.14.1 and a lot of users will install PyArrow 1.0.0, which will
>> > > > not
>> > > work.
>> > > > In Spark 3.0.0, though it will be no problem to update both Java
>> > > > and
>> > > Python
>> > > > to 1.0. Having a compatibility mode so that new readers/writers can
>> > work
>> > > > with old readers using a 4-byte prefix would solve the problem, but
>> if
>> > we
>> > > > don't want to do this will pyarrow be able to raise an error that
>> > clearly
>> > > > the new version does not support the old protocol?  For example,
>> would
>> > a
>> > > > pyarrow reader see the 0xFFFFFFFF and raise something like "PyArrow
>> > > > detected an old protocol and cannot continue, please use a version
>> > > > <
>> > > 1.0.0"?
>> > > >
>> > > > On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney <we...@gmail.com>
>> > > wrote:
>> > > >
>> > > > > Hi Francois -- copying the metadata into memory isn't the end of
>> the
>> > > world
>> > > > > but it's a pretty ugly wart. This affects every IPC protocol
>> message
>> > > > > everywhere.
>> > > > >
>> > > > > We have an opportunity to address the wart now but such a fix
>> > > post-1.0.0
>> > > > > will be much more difficult.
>> > > > >
>> > > > > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques <
>> > > > > fsaintjacques@gmail.com> wrote:
>> > > > >
>> > > > > > If the data buffers are still aligned, then I don't think we
>> should
>> > > > > > add a breaking change just for avoiding the copy on the
>> > > > > > metadata?
>> > I'd
>> > > > > > expect said metadata to be small enough that zero-copy doesn't
>> > really
>> > > > > > affect performance.
>> > > > > >
>> > > > > > François
>> > > > > >
>> > > > > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield <
>> > > emkornfield@gmail.com>
>> > > > > > wrote:
>> > > > > > >
>> > > > > > > While working on trying to fix undefined behavior for
>> > > > > > > unaligned
>> > > memory
>> > > > > > > accesses [1], I ran into an issue with the IPC specification
>> [2]
>> > > which
>> > > > > > > prevents us from ever achieving zero-copy memory mapping and
>> > having
>> > > > > > aligned
>> > > > > > > accesses (i.e. clean UBSan runs).
>> > > > > > >
>> > > > > > > Flatbuffer metadata needs 8-byte alignment to guarantee
>> > > > > > > aligned
>> > > > > accesses.
>> > > > > > >
>> > > > > > > In the IPC format we align each message to 8-byte boundaries.
>> We
>> > > then
>> > > > > > > write a int32_t integer to to denote the size of flat buffer
>> > > metadata,
>> > > > > > > followed immediately  by the flatbuffer metadata.  This means
>> the
>> > > > > > > flatbuffer metadata will never be 8 byte aligned.
>> > > > > > >
>> > > > > > > Do people care?  A simple fix  would be to use int64_t
>> > > > > > > instead
>> of
>> > > > > int32_t
>> > > > > > > for length.  However, any fix essentially breaks all previous
>> > > client
>> > > > > > > library versions or incurs a memory copy.
>> > > > > > >
>> > > > > > > [1] https://github.com/apache/arrow/pull/4757
>> > > > > > > [2] https://arrow.apache.org/docs/ipc.html
>> > > > > >
>> > > > >
>> > >
>> >
>>
>

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Micah Kornfield <em...@gmail.com>.

I'm trying to work out the exact steps in my mind for a migration. It seems
like one approach is:

1.  Add a code change which throws a clear exception it encounters -1 for
size.  In java the reasonable place seems to be at [1] (there might be
more?).   The exception should state that the current stream reader isn't
compatible with version 1.0.0 streams (we should have similar ones in each
language).  We can add a note about the environment variable in 2 if we
decide to do it.  Release this change as 0.15.0 or 0.14.2 and ensure at
least Spark upgrades to this version.

2.  Change the reader implementation to support reading both 1.0.0 streams
and be backwards compatible with pre-1.0.0 streams.  Change the writer
implementation to default to writing 1.0.0 streams but have an environment
variable that make it write backwards compatible streams (writer
compatibility seems like it should be optional).  Release this as 1.0.0

3. If provided, remove the environment variable switch in a later release.

Thanks,
Micah

[1]
https://github.com/apache/arrow/blob/9fe728c86caaf9ceb1827159eb172ff81fb98550/java/vector/src/main/java/org/apache/arrow/vector/ipc/message/MessageChannelReader.java#L67

On Thu, Jul 18, 2019 at 8:58 PM Wes McKinney <we...@gmail.com> wrote:

> To be clear, we could make a patch 0.14.x release that includes the
> necessary compatibility changes. I presume Spark will be able to upgrade to
> a new patch release (I'd be surprised if not, otherwise how can you get
> security fixes)?
>
> On Thu, Jul 18, 2019, 10:52 PM Bryan Cutler <cu...@gmail.com> wrote:
>
> > Hey Wes,
> > I understand we don't want to burden 1.0 by maintaining compatibility and
> > that is fine with me. I'm just try to figure out how to best handle this
> > situation so Spark users won't get a cryptic error message. It sounds
> like
> > it will need to be handled on the Spark side to not allow mixing 1.0 and
> > pre-1.0 versions. I'm not too sure how much a 0.15.0 release with
> > compatibility would help, it might depend on when things get released but
> > we can discuss that in another thread.
> >
> > On Thu, Jul 18, 2019 at 12:03 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > hi Bryan -- well, the reason for the current 0.x version is precisely
> > > to avoid a situation where we are making decisions on the basis of
> > > maintaining forward / backward compatibility.
> > >
> > > One possible way forward on this is to make a 0.15.0 (0.14.2, so there
> > > is less trouble for Spark to upgrade) release that supports reading
> > > _both_ old and new variants of the protocol.
> > >
> > > On Thu, Jul 18, 2019 at 1:20 PM Bryan Cutler <cu...@gmail.com>
> wrote:
> > > >
> > > > Are we going to say that Arrow 1.0 is not compatible with any version
> > > > before?  My concern is that Spark 2.4.x might get stuck on Arrow Java
> > > > 0.14.1 and a lot of users will install PyArrow 1.0.0, which will not
> > > work.
> > > > In Spark 3.0.0, though it will be no problem to update both Java and
> > > Python
> > > > to 1.0. Having a compatibility mode so that new readers/writers can
> > work
> > > > with old readers using a 4-byte prefix would solve the problem, but
> if
> > we
> > > > don't want to do this will pyarrow be able to raise an error that
> > clearly
> > > > the new version does not support the old protocol?  For example,
> would
> > a
> > > > pyarrow reader see the 0xFFFFFFFF and raise something like "PyArrow
> > > > detected an old protocol and cannot continue, please use a version <
> > > 1.0.0"?
> > > >
> > > > On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Francois -- copying the metadata into memory isn't the end of
> the
> > > world
> > > > > but it's a pretty ugly wart. This affects every IPC protocol
> message
> > > > > everywhere.
> > > > >
> > > > > We have an opportunity to address the wart now but such a fix
> > > post-1.0.0
> > > > > will be much more difficult.
> > > > >
> > > > > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques <
> > > > > fsaintjacques@gmail.com> wrote:
> > > > >
> > > > > > If the data buffers are still aligned, then I don't think we
> should
> > > > > > add a breaking change just for avoiding the copy on the metadata?
> > I'd
> > > > > > expect said metadata to be small enough that zero-copy doesn't
> > really
> > > > > > affect performance.
> > > > > >
> > > > > > François
> > > > > >
> > > > > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield <
> > > emkornfield@gmail.com>
> > > > > > wrote:
> > > > > > >
> > > > > > > While working on trying to fix undefined behavior for unaligned
> > > memory
> > > > > > > accesses [1], I ran into an issue with the IPC specification
> [2]
> > > which
> > > > > > > prevents us from ever achieving zero-copy memory mapping and
> > having
> > > > > > aligned
> > > > > > > accesses (i.e. clean UBSan runs).
> > > > > > >
> > > > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> > > > > accesses.
> > > > > > >
> > > > > > > In the IPC format we align each message to 8-byte boundaries.
> We
> > > then
> > > > > > > write a int32_t integer to to denote the size of flat buffer
> > > metadata,
> > > > > > > followed immediately  by the flatbuffer metadata.  This means
> the
> > > > > > > flatbuffer metadata will never be 8 byte aligned.
> > > > > > >
> > > > > > > Do people care?  A simple fix  would be to use int64_t instead
> of
> > > > > int32_t
> > > > > > > for length.  However, any fix essentially breaks all previous
> > > client
> > > > > > > library versions or incurs a memory copy.
> > > > > > >
> > > > > > > [1] https://github.com/apache/arrow/pull/4757
> > > > > > > [2] https://arrow.apache.org/docs/ipc.html
> > > > > >
> > > > >
> > >
> >
>

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.

To be clear, we could make a patch 0.14.x release that includes the
necessary compatibility changes. I presume Spark will be able to upgrade to
a new patch release (I'd be surprised if not, otherwise how can you get
security fixes)?

On Thu, Jul 18, 2019, 10:52 PM Bryan Cutler <cu...@gmail.com> wrote:

> Hey Wes,
> I understand we don't want to burden 1.0 by maintaining compatibility and
> that is fine with me. I'm just try to figure out how to best handle this
> situation so Spark users won't get a cryptic error message. It sounds like
> it will need to be handled on the Spark side to not allow mixing 1.0 and
> pre-1.0 versions. I'm not too sure how much a 0.15.0 release with
> compatibility would help, it might depend on when things get released but
> we can discuss that in another thread.
>
> On Thu, Jul 18, 2019 at 12:03 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Bryan -- well, the reason for the current 0.x version is precisely
> > to avoid a situation where we are making decisions on the basis of
> > maintaining forward / backward compatibility.
> >
> > One possible way forward on this is to make a 0.15.0 (0.14.2, so there
> > is less trouble for Spark to upgrade) release that supports reading
> > _both_ old and new variants of the protocol.
> >
> > On Thu, Jul 18, 2019 at 1:20 PM Bryan Cutler <cu...@gmail.com> wrote:
> > >
> > > Are we going to say that Arrow 1.0 is not compatible with any version
> > > before?  My concern is that Spark 2.4.x might get stuck on Arrow Java
> > > 0.14.1 and a lot of users will install PyArrow 1.0.0, which will not
> > work.
> > > In Spark 3.0.0, though it will be no problem to update both Java and
> > Python
> > > to 1.0. Having a compatibility mode so that new readers/writers can
> work
> > > with old readers using a 4-byte prefix would solve the problem, but if
> we
> > > don't want to do this will pyarrow be able to raise an error that
> clearly
> > > the new version does not support the old protocol?  For example, would
> a
> > > pyarrow reader see the 0xFFFFFFFF and raise something like "PyArrow
> > > detected an old protocol and cannot continue, please use a version <
> > 1.0.0"?
> > >
> > > On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > Hi Francois -- copying the metadata into memory isn't the end of the
> > world
> > > > but it's a pretty ugly wart. This affects every IPC protocol message
> > > > everywhere.
> > > >
> > > > We have an opportunity to address the wart now but such a fix
> > post-1.0.0
> > > > will be much more difficult.
> > > >
> > > > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques <
> > > > fsaintjacques@gmail.com> wrote:
> > > >
> > > > > If the data buffers are still aligned, then I don't think we should
> > > > > add a breaking change just for avoiding the copy on the metadata?
> I'd
> > > > > expect said metadata to be small enough that zero-copy doesn't
> really
> > > > > affect performance.
> > > > >
> > > > > François
> > > > >
> > > > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield <
> > emkornfield@gmail.com>
> > > > > wrote:
> > > > > >
> > > > > > While working on trying to fix undefined behavior for unaligned
> > memory
> > > > > > accesses [1], I ran into an issue with the IPC specification [2]
> > which
> > > > > > prevents us from ever achieving zero-copy memory mapping and
> having
> > > > > aligned
> > > > > > accesses (i.e. clean UBSan runs).
> > > > > >
> > > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> > > > accesses.
> > > > > >
> > > > > > In the IPC format we align each message to 8-byte boundaries.  We
> > then
> > > > > > write a int32_t integer to to denote the size of flat buffer
> > metadata,
> > > > > > followed immediately  by the flatbuffer metadata.  This means the
> > > > > > flatbuffer metadata will never be 8 byte aligned.
> > > > > >
> > > > > > Do people care?  A simple fix  would be to use int64_t instead of
> > > > int32_t
> > > > > > for length.  However, any fix essentially breaks all previous
> > client
> > > > > > library versions or incurs a memory copy.
> > > > > >
> > > > > > [1] https://github.com/apache/arrow/pull/4757
> > > > > > [2] https://arrow.apache.org/docs/ipc.html
> > > > >
> > > >
> >
>

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Bryan Cutler <cu...@gmail.com>.

Hey Wes,
I understand we don't want to burden 1.0 by maintaining compatibility and
that is fine with me. I'm just try to figure out how to best handle this
situation so Spark users won't get a cryptic error message. It sounds like
it will need to be handled on the Spark side to not allow mixing 1.0 and
pre-1.0 versions. I'm not too sure how much a 0.15.0 release with
compatibility would help, it might depend on when things get released but
we can discuss that in another thread.

On Thu, Jul 18, 2019 at 12:03 PM Wes McKinney <we...@gmail.com> wrote:

> hi Bryan -- well, the reason for the current 0.x version is precisely
> to avoid a situation where we are making decisions on the basis of
> maintaining forward / backward compatibility.
>
> One possible way forward on this is to make a 0.15.0 (0.14.2, so there
> is less trouble for Spark to upgrade) release that supports reading
> _both_ old and new variants of the protocol.
>
> On Thu, Jul 18, 2019 at 1:20 PM Bryan Cutler <cu...@gmail.com> wrote:
> >
> > Are we going to say that Arrow 1.0 is not compatible with any version
> > before?  My concern is that Spark 2.4.x might get stuck on Arrow Java
> > 0.14.1 and a lot of users will install PyArrow 1.0.0, which will not
> work.
> > In Spark 3.0.0, though it will be no problem to update both Java and
> Python
> > to 1.0. Having a compatibility mode so that new readers/writers can work
> > with old readers using a 4-byte prefix would solve the problem, but if we
> > don't want to do this will pyarrow be able to raise an error that clearly
> > the new version does not support the old protocol?  For example, would a
> > pyarrow reader see the 0xFFFFFFFF and raise something like "PyArrow
> > detected an old protocol and cannot continue, please use a version <
> 1.0.0"?
> >
> > On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > Hi Francois -- copying the metadata into memory isn't the end of the
> world
> > > but it's a pretty ugly wart. This affects every IPC protocol message
> > > everywhere.
> > >
> > > We have an opportunity to address the wart now but such a fix
> post-1.0.0
> > > will be much more difficult.
> > >
> > > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques <
> > > fsaintjacques@gmail.com> wrote:
> > >
> > > > If the data buffers are still aligned, then I don't think we should
> > > > add a breaking change just for avoiding the copy on the metadata? I'd
> > > > expect said metadata to be small enough that zero-copy doesn't really
> > > > affect performance.
> > > >
> > > > François
> > > >
> > > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield <
> emkornfield@gmail.com>
> > > > wrote:
> > > > >
> > > > > While working on trying to fix undefined behavior for unaligned
> memory
> > > > > accesses [1], I ran into an issue with the IPC specification [2]
> which
> > > > > prevents us from ever achieving zero-copy memory mapping and having
> > > > aligned
> > > > > accesses (i.e. clean UBSan runs).
> > > > >
> > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> > > accesses.
> > > > >
> > > > > In the IPC format we align each message to 8-byte boundaries.  We
> then
> > > > > write a int32_t integer to to denote the size of flat buffer
> metadata,
> > > > > followed immediately  by the flatbuffer metadata.  This means the
> > > > > flatbuffer metadata will never be 8 byte aligned.
> > > > >
> > > > > Do people care?  A simple fix  would be to use int64_t instead of
> > > int32_t
> > > > > for length.  However, any fix essentially breaks all previous
> client
> > > > > library versions or incurs a memory copy.
> > > > >
> > > > > [1] https://github.com/apache/arrow/pull/4757
> > > > > [2] https://arrow.apache.org/docs/ipc.html
> > > >
> > >
>

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.

hi Bryan -- well, the reason for the current 0.x version is precisely
to avoid a situation where we are making decisions on the basis of
maintaining forward / backward compatibility.

One possible way forward on this is to make a 0.15.0 (0.14.2, so there
is less trouble for Spark to upgrade) release that supports reading
_both_ old and new variants of the protocol.

On Thu, Jul 18, 2019 at 1:20 PM Bryan Cutler <cu...@gmail.com> wrote:
>
> Are we going to say that Arrow 1.0 is not compatible with any version
> before?  My concern is that Spark 2.4.x might get stuck on Arrow Java
> 0.14.1 and a lot of users will install PyArrow 1.0.0, which will not work.
> In Spark 3.0.0, though it will be no problem to update both Java and Python
> to 1.0. Having a compatibility mode so that new readers/writers can work
> with old readers using a 4-byte prefix would solve the problem, but if we
> don't want to do this will pyarrow be able to raise an error that clearly
> the new version does not support the old protocol?  For example, would a
> pyarrow reader see the 0xFFFFFFFF and raise something like "PyArrow
> detected an old protocol and cannot continue, please use a version < 1.0.0"?
>
> On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney <we...@gmail.com> wrote:
>
> > Hi Francois -- copying the metadata into memory isn't the end of the world
> > but it's a pretty ugly wart. This affects every IPC protocol message
> > everywhere.
> >
> > We have an opportunity to address the wart now but such a fix post-1.0.0
> > will be much more difficult.
> >
> > On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques <
> > fsaintjacques@gmail.com> wrote:
> >
> > > If the data buffers are still aligned, then I don't think we should
> > > add a breaking change just for avoiding the copy on the metadata? I'd
> > > expect said metadata to be small enough that zero-copy doesn't really
> > > affect performance.
> > >
> > > François
> > >
> > > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield <em...@gmail.com>
> > > wrote:
> > > >
> > > > While working on trying to fix undefined behavior for unaligned memory
> > > > accesses [1], I ran into an issue with the IPC specification [2] which
> > > > prevents us from ever achieving zero-copy memory mapping and having
> > > aligned
> > > > accesses (i.e. clean UBSan runs).
> > > >
> > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> > accesses.
> > > >
> > > > In the IPC format we align each message to 8-byte boundaries.  We then
> > > > write a int32_t integer to to denote the size of flat buffer metadata,
> > > > followed immediately  by the flatbuffer metadata.  This means the
> > > > flatbuffer metadata will never be 8 byte aligned.
> > > >
> > > > Do people care?  A simple fix  would be to use int64_t instead of
> > int32_t
> > > > for length.  However, any fix essentially breaks all previous client
> > > > library versions or incurs a memory copy.
> > > >
> > > > [1] https://github.com/apache/arrow/pull/4757
> > > > [2] https://arrow.apache.org/docs/ipc.html
> > >
> >

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Bryan Cutler <cu...@gmail.com>.

Are we going to say that Arrow 1.0 is not compatible with any version
before?  My concern is that Spark 2.4.x might get stuck on Arrow Java
0.14.1 and a lot of users will install PyArrow 1.0.0, which will not work.
In Spark 3.0.0, though it will be no problem to update both Java and Python
to 1.0. Having a compatibility mode so that new readers/writers can work
with old readers using a 4-byte prefix would solve the problem, but if we
don't want to do this will pyarrow be able to raise an error that clearly
the new version does not support the old protocol?  For example, would a
pyarrow reader see the 0xFFFFFFFF and raise something like "PyArrow
detected an old protocol and cannot continue, please use a version < 1.0.0"?

On Thu, Jul 11, 2019 at 12:39 PM Wes McKinney <we...@gmail.com> wrote:

> Hi Francois -- copying the metadata into memory isn't the end of the world
> but it's a pretty ugly wart. This affects every IPC protocol message
> everywhere.
>
> We have an opportunity to address the wart now but such a fix post-1.0.0
> will be much more difficult.
>
> On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques <
> fsaintjacques@gmail.com> wrote:
>
> > If the data buffers are still aligned, then I don't think we should
> > add a breaking change just for avoiding the copy on the metadata? I'd
> > expect said metadata to be small enough that zero-copy doesn't really
> > affect performance.
> >
> > François
> >
> > On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield <em...@gmail.com>
> > wrote:
> > >
> > > While working on trying to fix undefined behavior for unaligned memory
> > > accesses [1], I ran into an issue with the IPC specification [2] which
> > > prevents us from ever achieving zero-copy memory mapping and having
> > aligned
> > > accesses (i.e. clean UBSan runs).
> > >
> > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> accesses.
> > >
> > > In the IPC format we align each message to 8-byte boundaries.  We then
> > > write a int32_t integer to to denote the size of flat buffer metadata,
> > > followed immediately  by the flatbuffer metadata.  This means the
> > > flatbuffer metadata will never be 8 byte aligned.
> > >
> > > Do people care?  A simple fix  would be to use int64_t instead of
> int32_t
> > > for length.  However, any fix essentially breaks all previous client
> > > library versions or incurs a memory copy.
> > >
> > > [1] https://github.com/apache/arrow/pull/4757
> > > [2] https://arrow.apache.org/docs/ipc.html
> >
>

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.

Hi Francois -- copying the metadata into memory isn't the end of the world
but it's a pretty ugly wart. This affects every IPC protocol message
everywhere.

We have an opportunity to address the wart now but such a fix post-1.0.0
will be much more difficult.

On Thu, Jul 11, 2019, 2:05 PM Francois Saint-Jacques <
fsaintjacques@gmail.com> wrote:

> If the data buffers are still aligned, then I don't think we should
> add a breaking change just for avoiding the copy on the metadata? I'd
> expect said metadata to be small enough that zero-copy doesn't really
> affect performance.
>
> François
>
> On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > While working on trying to fix undefined behavior for unaligned memory
> > accesses [1], I ran into an issue with the IPC specification [2] which
> > prevents us from ever achieving zero-copy memory mapping and having
> aligned
> > accesses (i.e. clean UBSan runs).
> >
> > Flatbuffer metadata needs 8-byte alignment to guarantee aligned accesses.
> >
> > In the IPC format we align each message to 8-byte boundaries.  We then
> > write a int32_t integer to to denote the size of flat buffer metadata,
> > followed immediately  by the flatbuffer metadata.  This means the
> > flatbuffer metadata will never be 8 byte aligned.
> >
> > Do people care?  A simple fix  would be to use int64_t instead of int32_t
> > for length.  However, any fix essentially breaks all previous client
> > library versions or incurs a memory copy.
> >
> > [1] https://github.com/apache/arrow/pull/4757
> > [2] https://arrow.apache.org/docs/ipc.html
>

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Francois Saint-Jacques <fs...@gmail.com>.

If the data buffers are still aligned, then I don't think we should
add a breaking change just for avoiding the copy on the metadata? I'd
expect said metadata to be small enough that zero-copy doesn't really
affect performance.

François

On Sun, Jun 30, 2019 at 4:01 AM Micah Kornfield <em...@gmail.com> wrote:
>
> While working on trying to fix undefined behavior for unaligned memory
> accesses [1], I ran into an issue with the IPC specification [2] which
> prevents us from ever achieving zero-copy memory mapping and having aligned
> accesses (i.e. clean UBSan runs).
>
> Flatbuffer metadata needs 8-byte alignment to guarantee aligned accesses.
>
> In the IPC format we align each message to 8-byte boundaries.  We then
> write a int32_t integer to to denote the size of flat buffer metadata,
> followed immediately  by the flatbuffer metadata.  This means the
> flatbuffer metadata will never be 8 byte aligned.
>
> Do people care?  A simple fix  would be to use int64_t instead of int32_t
> for length.  However, any fix essentially breaks all previous client
> library versions or incurs a memory copy.
>
> [1] https://github.com/apache/arrow/pull/4757
> [2] https://arrow.apache.org/docs/ipc.html

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.

Hi Bryan -- it wouldn't be forward compatible when using the 8 byte prefix,
but using the scheme we are proposing old clients would see the new prefix
as malformed (metadata length 0xFFFFFFFF = -1) rather than crashing.

We could possibly expose a forward compatibility option to write the 4 byte
prefix for the benefit of old clients, though that makes the implementation
more complicated

On Thu, Jul 11, 2019, 12:47 PM Bryan Cutler <cu...@gmail.com> wrote:

> So the proposal here will still be backwards compatible with a 4 byte
> prefix? Can you explain a little more how this might work if I have an
> older version of Java using 4 byte prefix and a new version of C++/Python
> with an 8 byte one for a roundtrip Java -> Python -> Java?
>
> On Wed, Jul 10, 2019 at 6:11 AM Wes McKinney <we...@gmail.com> wrote:
>
> > The issue is fairly esoteric, so it will probably take more time to
> > collect feedback. I could create a C++ implementation of this if it
> > helps with the process.
> >
> > On Wed, Jul 10, 2019 at 2:25 AM Micah Kornfield <em...@gmail.com>
> > wrote:
> > >
> > > Does anybody else have thoughts on this?   Other language contributors?
> > >
> > > It seems like we still might not have enough of a consensus for a vote?
> > >
> > > Thanks,
> > > Micah
> > >
> > >
> > >
> > >
> > > On Tue, Jul 2, 2019 at 7:32 AM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > > Correct. The encapsulated IPC message will just be 4 bytes bigger.
> > > >
> > > > On Tue, Jul 2, 2019, 9:31 AM Antoine Pitrou <an...@python.org>
> > wrote:
> > > >
> > > > >
> > > > > I guess I still dont understand how the IPC stream format works :-/
> > > > >
> > > > > To put it clearly: what happens in Flight?  Will a Flight message
> > > > > automatically get the "stream continuation message" in front of it?
> > > > >
> > > > >
> > > > > Le 02/07/2019 à 16:15, Wes McKinney a écrit :
> > > > > > On Tue, Jul 2, 2019 at 4:23 AM Antoine Pitrou <
> antoine@python.org>
> > > > > wrote:
> > > > > >>
> > > > > >>
> > > > > >> Le 02/07/2019 à 00:20, Wes McKinney a écrit :
> > > > > >>> Thanks for the references.
> > > > > >>>
> > > > > >>> If we decided to make a change around this, we could call the
> > first 4
> > > > > >>> bytes a stream continuation marker to make it slightly less
> ugly
> > > > > >>>
> > > > > >>> * 0xFFFFFFFF: continue
> > > > > >>> * 0x00000000: stop
> > > > > >>
> > > > > >> Do you mean it would be a separate IPC message?
> > > > > >
> > > > > > No, I think this is only about how we could change the message
> > prefix
> > > > > > from 4 bytes to 8 bytes
> > > > > >
> > > > > >
> > > > >
> > > >
> >
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#encapsulated-message-format
> > > > > >
> > > > > > Currently a 0x00000000 (0 metadata size) is used as an
> > end-of-stream
> > > > > > marker. So what I was saying is that the first 8 bytes could be
> > > > > >
> > > > > > <4 bytes: stream continuation><int32_t metadata size>
> > > > > >
> > > > > >>
> > > > > >>
> > > > > >>>
> > > > > >>> On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield <
> > > > emkornfield@gmail.com>
> > > > > wrote:
> > > > > >>>>
> > > > > >>>> Hi Wes,
> > > > > >>>> I'm not an expert on this either, my inclination mostly comes
> > from
> > > > > some research I've done.  I think it is important to distinguish
> two
> > > > cases:
> > > > > >>>> 1.  unaligned access at the processor instruction level
> > > > > >>>> 2.  undefined behavior
> > > > > >>>>
> > > > > >>>> From my reading unaligned access is fine on most modern
> > > > architectures
> > > > > and it seems the performance penalty has mostly been eliminated.
> > > > > >>>>
> > > > > >>>> Undefined behavior is a compiler/language concept.  The
> problem
> > is
> > > > > the compiler can choose to do anything in UB scenarios, not just
> the
> > > > > "obvious" translation.  Specifically, the compiler is under no
> > obligation
> > > > > to generate the unaligned access instructions, and if it doesn't
> > SEGVs
> > > > > ensue.  Two examples, both of which relate to SIMD optimizations
> are
> > > > linked
> > > > > below.
> > > > > >>>>
> > > > > >>>> I tend to be on the conservative side with this type of thing
> > but if
> > > > > we have experts on the the ML that can offer a more informed
> > opinion, I
> > > > > would love to hear it.
> > > > > >>>>
> > > > > >>>> [1]
> > > > >
> http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
> > > > > >>>> [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
> > > > > >>>>
> > > > > >>>> On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney <
> > wesmckinn@gmail.com>
> > > > > wrote:
> > > > > >>>>>
> > > > > >>>>> The <0xffffffff><int32_t size> solution is downright ugly
> but I
> > > > think
> > > > > >>>>> it's one of the only ways that achieves
> > > > > >>>>>
> > > > > >>>>> * backward compatibility (new clients can read old data)
> > > > > >>>>> * opt-in forward compatibility (if we want to go to the labor
> > of
> > > > > doing
> > > > > >>>>> so, sort of dangerous)
> > > > > >>>>> * old clients receiving new data do not blow up (they will
> see
> > a
> > > > > >>>>> metadata length of -1)
> > > > > >>>>>
> > > > > >>>>> NB 0xFFFFFFFF <length> would look like:
> > > > > >>>>>
> > > > > >>>>> In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
> > > > > >>>>> Out[13]: array([4294967295,        128], dtype=uint32)
> > > > > >>>>>
> > > > > >>>>> In [14]: np.array([(2 << 32) - 1, 128],
> > > > > >>>>> dtype=np.uint32).view(np.int32)
> > > > > >>>>> Out[14]: array([ -1, 128], dtype=int32)
> > > > > >>>>>
> > > > > >>>>> In [15]: np.array([(2 << 32) - 1, 128],
> > > > > dtype=np.uint32).view(np.uint8)
> > > > > >>>>> Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0],
> > > > dtype=uint8)
> > > > > >>>>>
> > > > > >>>>> Flatbuffers are 32-bit limited so we don't need all 64 bits.
> > > > > >>>>>
> > > > > >>>>> Do you know in what circumstances unaligned reads from
> > Flatbuffers
> > > > > >>>>> might cause an issue? I do not know enough about UB but my
> > > > > >>>>> understanding is that it causes issues on some specialized
> > > > platforms
> > > > > >>>>> where for most modern x86-64 processors and compilers it is
> not
> > > > > really
> > > > > >>>>> an issue (though perhaps a performance issue)
> > > > > >>>>>
> > > > > >>>>> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <
> > > > > emkornfield@gmail.com> wrote:
> > > > > >>>>>>
> > > > > >>>>>> At least on the read-side we can make this detectable by
> using
> > > > > something like <0xffffffff><int32_t size> instead of int64_t.  On
> the
> > > > write
> > > > > side we would need some sort of default mode that we could flip
> > on/off if
> > > > > we wanted to maintain compatibility.
> > > > > >>>>>>
> > > > > >>>>>> I should say I think we should fix it.  Undefined behavior
> is
> > > > > unpaid debt that might never be collected or might cause things to
> > fail
> > > > in
> > > > > difficult to diagnose ways. And pre-1.0.0 is definitely the time.
> > > > > >>>>>>
> > > > > >>>>>> -Micah
> > > > > >>>>>>
> > > > > >>>>>> On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <
> > wesmckinn@gmail.com
> > > > >
> > > > > wrote:
> > > > > >>>>>>>
> > > > > >>>>>>> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <
> > > > wesmckinn@gmail.com>
> > > > > wrote:
> > > > > >>>>>>>>
> > > > > >>>>>>>> hi Micah,
> > > > > >>>>>>>>
> > > > > >>>>>>>> This is definitely unfortunate, I wish we had realized the
> > > > > potential
> > > > > >>>>>>>> implications of having the Flatbuffer message start on a
> > 4-byte
> > > > > >>>>>>>> (rather than 8-byte) boundary. The cost of making such a
> > change
> > > > > now
> > > > > >>>>>>>> would be pretty high since all readers and writers in all
> > > > > languages
> > > > > >>>>>>>> would have to be changed. That being said, the 0.14.0 ->
> > 1.0.0
> > > > > version
> > > > > >>>>>>>> bump is the last opportunity we have to make a change like
> > this,
> > > > > so we
> > > > > >>>>>>>> might as well discuss it now. Note that particular
> > > > implementations
> > > > > >>>>>>>> could implement compatibility functions to handle the 4
> to 8
> > > > byte
> > > > > >>>>>>>> change so that old clients can still be understood. We'd
> > > > probably
> > > > > want
> > > > > >>>>>>>> to do this in C++, for example, since users would pretty
> > quickly
> > > > > >>>>>>>> acquire a new pyarrow version in Spark applications while
> > they
> > > > are
> > > > > >>>>>>>> stuck on an old version of the Java libraries.
> > > > > >>>>>>>
> > > > > >>>>>>> NB such a backwards compatibility fix would not be
> > > > > forward-compatible,
> > > > > >>>>>>> so the PySpark users would need to use a pinned version of
> > > > pyarrow
> > > > > >>>>>>> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
> > > > > >>>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>> - Wes
> > > > > >>>>>>>>
> > > > > >>>>>>>> On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <
> > > > > emkornfield@gmail.com> wrote:
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> While working on trying to fix undefined behavior for
> > unaligned
> > > > > memory
> > > > > >>>>>>>>> accesses [1], I ran into an issue with the IPC
> > specification
> > > > [2]
> > > > > which
> > > > > >>>>>>>>> prevents us from ever achieving zero-copy memory mapping
> > and
> > > > > having aligned
> > > > > >>>>>>>>> accesses (i.e. clean UBSan runs).
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Flatbuffer metadata needs 8-byte alignment to guarantee
> > aligned
> > > > > accesses.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> In the IPC format we align each message to 8-byte
> > boundaries.
> > > > > We then
> > > > > >>>>>>>>> write a int32_t integer to to denote the size of flat
> > buffer
> > > > > metadata,
> > > > > >>>>>>>>> followed immediately  by the flatbuffer metadata.  This
> > means
> > > > the
> > > > > >>>>>>>>> flatbuffer metadata will never be 8 byte aligned.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> Do people care?  A simple fix  would be to use int64_t
> > instead
> > > > > of int32_t
> > > > > >>>>>>>>> for length.  However, any fix essentially breaks all
> > previous
> > > > > client
> > > > > >>>>>>>>> library versions or incurs a memory copy.
> > > > > >>>>>>>>>
> > > > > >>>>>>>>> [1] https://github.com/apache/arrow/pull/4757
> > > > > >>>>>>>>> [2] https://arrow.apache.org/docs/ipc.html
> > > > >
> > > >
> >
>

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Bryan Cutler <cu...@gmail.com>.

So the proposal here will still be backwards compatible with a 4 byte
prefix? Can you explain a little more how this might work if I have an
older version of Java using 4 byte prefix and a new version of C++/Python
with an 8 byte one for a roundtrip Java -> Python -> Java?

On Wed, Jul 10, 2019 at 6:11 AM Wes McKinney <we...@gmail.com> wrote:

> The issue is fairly esoteric, so it will probably take more time to
> collect feedback. I could create a C++ implementation of this if it
> helps with the process.
>
> On Wed, Jul 10, 2019 at 2:25 AM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > Does anybody else have thoughts on this?   Other language contributors?
> >
> > It seems like we still might not have enough of a consensus for a vote?
> >
> > Thanks,
> > Micah
> >
> >
> >
> >
> > On Tue, Jul 2, 2019 at 7:32 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > > Correct. The encapsulated IPC message will just be 4 bytes bigger.
> > >
> > > On Tue, Jul 2, 2019, 9:31 AM Antoine Pitrou <an...@python.org>
> wrote:
> > >
> > > >
> > > > I guess I still dont understand how the IPC stream format works :-/
> > > >
> > > > To put it clearly: what happens in Flight?  Will a Flight message
> > > > automatically get the "stream continuation message" in front of it?
> > > >
> > > >
> > > > Le 02/07/2019 à 16:15, Wes McKinney a écrit :
> > > > > On Tue, Jul 2, 2019 at 4:23 AM Antoine Pitrou <an...@python.org>
> > > > wrote:
> > > > >>
> > > > >>
> > > > >> Le 02/07/2019 à 00:20, Wes McKinney a écrit :
> > > > >>> Thanks for the references.
> > > > >>>
> > > > >>> If we decided to make a change around this, we could call the
> first 4
> > > > >>> bytes a stream continuation marker to make it slightly less ugly
> > > > >>>
> > > > >>> * 0xFFFFFFFF: continue
> > > > >>> * 0x00000000: stop
> > > > >>
> > > > >> Do you mean it would be a separate IPC message?
> > > > >
> > > > > No, I think this is only about how we could change the message
> prefix
> > > > > from 4 bytes to 8 bytes
> > > > >
> > > > >
> > > >
> > >
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#encapsulated-message-format
> > > > >
> > > > > Currently a 0x00000000 (0 metadata size) is used as an
> end-of-stream
> > > > > marker. So what I was saying is that the first 8 bytes could be
> > > > >
> > > > > <4 bytes: stream continuation><int32_t metadata size>
> > > > >
> > > > >>
> > > > >>
> > > > >>>
> > > > >>> On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield <
> > > emkornfield@gmail.com>
> > > > wrote:
> > > > >>>>
> > > > >>>> Hi Wes,
> > > > >>>> I'm not an expert on this either, my inclination mostly comes
> from
> > > > some research I've done.  I think it is important to distinguish two
> > > cases:
> > > > >>>> 1.  unaligned access at the processor instruction level
> > > > >>>> 2.  undefined behavior
> > > > >>>>
> > > > >>>> From my reading unaligned access is fine on most modern
> > > architectures
> > > > and it seems the performance penalty has mostly been eliminated.
> > > > >>>>
> > > > >>>> Undefined behavior is a compiler/language concept.  The problem
> is
> > > > the compiler can choose to do anything in UB scenarios, not just the
> > > > "obvious" translation.  Specifically, the compiler is under no
> obligation
> > > > to generate the unaligned access instructions, and if it doesn't
> SEGVs
> > > > ensue.  Two examples, both of which relate to SIMD optimizations are
> > > linked
> > > > below.
> > > > >>>>
> > > > >>>> I tend to be on the conservative side with this type of thing
> but if
> > > > we have experts on the the ML that can offer a more informed
> opinion, I
> > > > would love to hear it.
> > > > >>>>
> > > > >>>> [1]
> > > > http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
> > > > >>>> [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
> > > > >>>>
> > > > >>>> On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney <
> wesmckinn@gmail.com>
> > > > wrote:
> > > > >>>>>
> > > > >>>>> The <0xffffffff><int32_t size> solution is downright ugly but I
> > > think
> > > > >>>>> it's one of the only ways that achieves
> > > > >>>>>
> > > > >>>>> * backward compatibility (new clients can read old data)
> > > > >>>>> * opt-in forward compatibility (if we want to go to the labor
> of
> > > > doing
> > > > >>>>> so, sort of dangerous)
> > > > >>>>> * old clients receiving new data do not blow up (they will see
> a
> > > > >>>>> metadata length of -1)
> > > > >>>>>
> > > > >>>>> NB 0xFFFFFFFF <length> would look like:
> > > > >>>>>
> > > > >>>>> In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
> > > > >>>>> Out[13]: array([4294967295,        128], dtype=uint32)
> > > > >>>>>
> > > > >>>>> In [14]: np.array([(2 << 32) - 1, 128],
> > > > >>>>> dtype=np.uint32).view(np.int32)
> > > > >>>>> Out[14]: array([ -1, 128], dtype=int32)
> > > > >>>>>
> > > > >>>>> In [15]: np.array([(2 << 32) - 1, 128],
> > > > dtype=np.uint32).view(np.uint8)
> > > > >>>>> Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0],
> > > dtype=uint8)
> > > > >>>>>
> > > > >>>>> Flatbuffers are 32-bit limited so we don't need all 64 bits.
> > > > >>>>>
> > > > >>>>> Do you know in what circumstances unaligned reads from
> Flatbuffers
> > > > >>>>> might cause an issue? I do not know enough about UB but my
> > > > >>>>> understanding is that it causes issues on some specialized
> > > platforms
> > > > >>>>> where for most modern x86-64 processors and compilers it is not
> > > > really
> > > > >>>>> an issue (though perhaps a performance issue)
> > > > >>>>>
> > > > >>>>> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <
> > > > emkornfield@gmail.com> wrote:
> > > > >>>>>>
> > > > >>>>>> At least on the read-side we can make this detectable by using
> > > > something like <0xffffffff><int32_t size> instead of int64_t.  On the
> > > write
> > > > side we would need some sort of default mode that we could flip
> on/off if
> > > > we wanted to maintain compatibility.
> > > > >>>>>>
> > > > >>>>>> I should say I think we should fix it.  Undefined behavior is
> > > > unpaid debt that might never be collected or might cause things to
> fail
> > > in
> > > > difficult to diagnose ways. And pre-1.0.0 is definitely the time.
> > > > >>>>>>
> > > > >>>>>> -Micah
> > > > >>>>>>
> > > > >>>>>> On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <
> wesmckinn@gmail.com
> > > >
> > > > wrote:
> > > > >>>>>>>
> > > > >>>>>>> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <
> > > wesmckinn@gmail.com>
> > > > wrote:
> > > > >>>>>>>>
> > > > >>>>>>>> hi Micah,
> > > > >>>>>>>>
> > > > >>>>>>>> This is definitely unfortunate, I wish we had realized the
> > > > potential
> > > > >>>>>>>> implications of having the Flatbuffer message start on a
> 4-byte
> > > > >>>>>>>> (rather than 8-byte) boundary. The cost of making such a
> change
> > > > now
> > > > >>>>>>>> would be pretty high since all readers and writers in all
> > > > languages
> > > > >>>>>>>> would have to be changed. That being said, the 0.14.0 ->
> 1.0.0
> > > > version
> > > > >>>>>>>> bump is the last opportunity we have to make a change like
> this,
> > > > so we
> > > > >>>>>>>> might as well discuss it now. Note that particular
> > > implementations
> > > > >>>>>>>> could implement compatibility functions to handle the 4 to 8
> > > byte
> > > > >>>>>>>> change so that old clients can still be understood. We'd
> > > probably
> > > > want
> > > > >>>>>>>> to do this in C++, for example, since users would pretty
> quickly
> > > > >>>>>>>> acquire a new pyarrow version in Spark applications while
> they
> > > are
> > > > >>>>>>>> stuck on an old version of the Java libraries.
> > > > >>>>>>>
> > > > >>>>>>> NB such a backwards compatibility fix would not be
> > > > forward-compatible,
> > > > >>>>>>> so the PySpark users would need to use a pinned version of
> > > pyarrow
> > > > >>>>>>> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
> > > > >>>>>>>
> > > > >>>>>>>>
> > > > >>>>>>>> - Wes
> > > > >>>>>>>>
> > > > >>>>>>>> On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <
> > > > emkornfield@gmail.com> wrote:
> > > > >>>>>>>>>
> > > > >>>>>>>>> While working on trying to fix undefined behavior for
> unaligned
> > > > memory
> > > > >>>>>>>>> accesses [1], I ran into an issue with the IPC
> specification
> > > [2]
> > > > which
> > > > >>>>>>>>> prevents us from ever achieving zero-copy memory mapping
> and
> > > > having aligned
> > > > >>>>>>>>> accesses (i.e. clean UBSan runs).
> > > > >>>>>>>>>
> > > > >>>>>>>>> Flatbuffer metadata needs 8-byte alignment to guarantee
> aligned
> > > > accesses.
> > > > >>>>>>>>>
> > > > >>>>>>>>> In the IPC format we align each message to 8-byte
> boundaries.
> > > > We then
> > > > >>>>>>>>> write a int32_t integer to to denote the size of flat
> buffer
> > > > metadata,
> > > > >>>>>>>>> followed immediately  by the flatbuffer metadata.  This
> means
> > > the
> > > > >>>>>>>>> flatbuffer metadata will never be 8 byte aligned.
> > > > >>>>>>>>>
> > > > >>>>>>>>> Do people care?  A simple fix  would be to use int64_t
> instead
> > > > of int32_t
> > > > >>>>>>>>> for length.  However, any fix essentially breaks all
> previous
> > > > client
> > > > >>>>>>>>> library versions or incurs a memory copy.
> > > > >>>>>>>>>
> > > > >>>>>>>>> [1] https://github.com/apache/arrow/pull/4757
> > > > >>>>>>>>> [2] https://arrow.apache.org/docs/ipc.html
> > > >
> > >
>

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.

The issue is fairly esoteric, so it will probably take more time to
collect feedback. I could create a C++ implementation of this if it
helps with the process.

On Wed, Jul 10, 2019 at 2:25 AM Micah Kornfield <em...@gmail.com> wrote:
>
> Does anybody else have thoughts on this?   Other language contributors?
>
> It seems like we still might not have enough of a consensus for a vote?
>
> Thanks,
> Micah
>
>
>
>
> On Tue, Jul 2, 2019 at 7:32 AM Wes McKinney <we...@gmail.com> wrote:
>
> > Correct. The encapsulated IPC message will just be 4 bytes bigger.
> >
> > On Tue, Jul 2, 2019, 9:31 AM Antoine Pitrou <an...@python.org> wrote:
> >
> > >
> > > I guess I still dont understand how the IPC stream format works :-/
> > >
> > > To put it clearly: what happens in Flight?  Will a Flight message
> > > automatically get the "stream continuation message" in front of it?
> > >
> > >
> > > Le 02/07/2019 à 16:15, Wes McKinney a écrit :
> > > > On Tue, Jul 2, 2019 at 4:23 AM Antoine Pitrou <an...@python.org>
> > > wrote:
> > > >>
> > > >>
> > > >> Le 02/07/2019 à 00:20, Wes McKinney a écrit :
> > > >>> Thanks for the references.
> > > >>>
> > > >>> If we decided to make a change around this, we could call the first 4
> > > >>> bytes a stream continuation marker to make it slightly less ugly
> > > >>>
> > > >>> * 0xFFFFFFFF: continue
> > > >>> * 0x00000000: stop
> > > >>
> > > >> Do you mean it would be a separate IPC message?
> > > >
> > > > No, I think this is only about how we could change the message prefix
> > > > from 4 bytes to 8 bytes
> > > >
> > > >
> > >
> > https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#encapsulated-message-format
> > > >
> > > > Currently a 0x00000000 (0 metadata size) is used as an end-of-stream
> > > > marker. So what I was saying is that the first 8 bytes could be
> > > >
> > > > <4 bytes: stream continuation><int32_t metadata size>
> > > >
> > > >>
> > > >>
> > > >>>
> > > >>> On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield <
> > emkornfield@gmail.com>
> > > wrote:
> > > >>>>
> > > >>>> Hi Wes,
> > > >>>> I'm not an expert on this either, my inclination mostly comes from
> > > some research I've done.  I think it is important to distinguish two
> > cases:
> > > >>>> 1.  unaligned access at the processor instruction level
> > > >>>> 2.  undefined behavior
> > > >>>>
> > > >>>> From my reading unaligned access is fine on most modern
> > architectures
> > > and it seems the performance penalty has mostly been eliminated.
> > > >>>>
> > > >>>> Undefined behavior is a compiler/language concept.  The problem is
> > > the compiler can choose to do anything in UB scenarios, not just the
> > > "obvious" translation.  Specifically, the compiler is under no obligation
> > > to generate the unaligned access instructions, and if it doesn't SEGVs
> > > ensue.  Two examples, both of which relate to SIMD optimizations are
> > linked
> > > below.
> > > >>>>
> > > >>>> I tend to be on the conservative side with this type of thing but if
> > > we have experts on the the ML that can offer a more informed opinion, I
> > > would love to hear it.
> > > >>>>
> > > >>>> [1]
> > > http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
> > > >>>> [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
> > > >>>>
> > > >>>> On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >>>>>
> > > >>>>> The <0xffffffff><int32_t size> solution is downright ugly but I
> > think
> > > >>>>> it's one of the only ways that achieves
> > > >>>>>
> > > >>>>> * backward compatibility (new clients can read old data)
> > > >>>>> * opt-in forward compatibility (if we want to go to the labor of
> > > doing
> > > >>>>> so, sort of dangerous)
> > > >>>>> * old clients receiving new data do not blow up (they will see a
> > > >>>>> metadata length of -1)
> > > >>>>>
> > > >>>>> NB 0xFFFFFFFF <length> would look like:
> > > >>>>>
> > > >>>>> In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
> > > >>>>> Out[13]: array([4294967295,        128], dtype=uint32)
> > > >>>>>
> > > >>>>> In [14]: np.array([(2 << 32) - 1, 128],
> > > >>>>> dtype=np.uint32).view(np.int32)
> > > >>>>> Out[14]: array([ -1, 128], dtype=int32)
> > > >>>>>
> > > >>>>> In [15]: np.array([(2 << 32) - 1, 128],
> > > dtype=np.uint32).view(np.uint8)
> > > >>>>> Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0],
> > dtype=uint8)
> > > >>>>>
> > > >>>>> Flatbuffers are 32-bit limited so we don't need all 64 bits.
> > > >>>>>
> > > >>>>> Do you know in what circumstances unaligned reads from Flatbuffers
> > > >>>>> might cause an issue? I do not know enough about UB but my
> > > >>>>> understanding is that it causes issues on some specialized
> > platforms
> > > >>>>> where for most modern x86-64 processors and compilers it is not
> > > really
> > > >>>>> an issue (though perhaps a performance issue)
> > > >>>>>
> > > >>>>> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <
> > > emkornfield@gmail.com> wrote:
> > > >>>>>>
> > > >>>>>> At least on the read-side we can make this detectable by using
> > > something like <0xffffffff><int32_t size> instead of int64_t.  On the
> > write
> > > side we would need some sort of default mode that we could flip on/off if
> > > we wanted to maintain compatibility.
> > > >>>>>>
> > > >>>>>> I should say I think we should fix it.  Undefined behavior is
> > > unpaid debt that might never be collected or might cause things to fail
> > in
> > > difficult to diagnose ways. And pre-1.0.0 is definitely the time.
> > > >>>>>>
> > > >>>>>> -Micah
> > > >>>>>>
> > > >>>>>> On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <wesmckinn@gmail.com
> > >
> > > wrote:
> > > >>>>>>>
> > > >>>>>>> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <
> > wesmckinn@gmail.com>
> > > wrote:
> > > >>>>>>>>
> > > >>>>>>>> hi Micah,
> > > >>>>>>>>
> > > >>>>>>>> This is definitely unfortunate, I wish we had realized the
> > > potential
> > > >>>>>>>> implications of having the Flatbuffer message start on a 4-byte
> > > >>>>>>>> (rather than 8-byte) boundary. The cost of making such a change
> > > now
> > > >>>>>>>> would be pretty high since all readers and writers in all
> > > languages
> > > >>>>>>>> would have to be changed. That being said, the 0.14.0 -> 1.0.0
> > > version
> > > >>>>>>>> bump is the last opportunity we have to make a change like this,
> > > so we
> > > >>>>>>>> might as well discuss it now. Note that particular
> > implementations
> > > >>>>>>>> could implement compatibility functions to handle the 4 to 8
> > byte
> > > >>>>>>>> change so that old clients can still be understood. We'd
> > probably
> > > want
> > > >>>>>>>> to do this in C++, for example, since users would pretty quickly
> > > >>>>>>>> acquire a new pyarrow version in Spark applications while they
> > are
> > > >>>>>>>> stuck on an old version of the Java libraries.
> > > >>>>>>>
> > > >>>>>>> NB such a backwards compatibility fix would not be
> > > forward-compatible,
> > > >>>>>>> so the PySpark users would need to use a pinned version of
> > pyarrow
> > > >>>>>>> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
> > > >>>>>>>
> > > >>>>>>>>
> > > >>>>>>>> - Wes
> > > >>>>>>>>
> > > >>>>>>>> On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <
> > > emkornfield@gmail.com> wrote:
> > > >>>>>>>>>
> > > >>>>>>>>> While working on trying to fix undefined behavior for unaligned
> > > memory
> > > >>>>>>>>> accesses [1], I ran into an issue with the IPC specification
> > [2]
> > > which
> > > >>>>>>>>> prevents us from ever achieving zero-copy memory mapping and
> > > having aligned
> > > >>>>>>>>> accesses (i.e. clean UBSan runs).
> > > >>>>>>>>>
> > > >>>>>>>>> Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> > > accesses.
> > > >>>>>>>>>
> > > >>>>>>>>> In the IPC format we align each message to 8-byte boundaries.
> > > We then
> > > >>>>>>>>> write a int32_t integer to to denote the size of flat buffer
> > > metadata,
> > > >>>>>>>>> followed immediately  by the flatbuffer metadata.  This means
> > the
> > > >>>>>>>>> flatbuffer metadata will never be 8 byte aligned.
> > > >>>>>>>>>
> > > >>>>>>>>> Do people care?  A simple fix  would be to use int64_t instead
> > > of int32_t
> > > >>>>>>>>> for length.  However, any fix essentially breaks all previous
> > > client
> > > >>>>>>>>> library versions or incurs a memory copy.
> > > >>>>>>>>>
> > > >>>>>>>>> [1] https://github.com/apache/arrow/pull/4757
> > > >>>>>>>>> [2] https://arrow.apache.org/docs/ipc.html
> > >
> >

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Micah Kornfield <em...@gmail.com>.

Does anybody else have thoughts on this?   Other language contributors?

It seems like we still might not have enough of a consensus for a vote?

Thanks,
Micah




On Tue, Jul 2, 2019 at 7:32 AM Wes McKinney <we...@gmail.com> wrote:

> Correct. The encapsulated IPC message will just be 4 bytes bigger.
>
> On Tue, Jul 2, 2019, 9:31 AM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > I guess I still dont understand how the IPC stream format works :-/
> >
> > To put it clearly: what happens in Flight?  Will a Flight message
> > automatically get the "stream continuation message" in front of it?
> >
> >
> > Le 02/07/2019 à 16:15, Wes McKinney a écrit :
> > > On Tue, Jul 2, 2019 at 4:23 AM Antoine Pitrou <an...@python.org>
> > wrote:
> > >>
> > >>
> > >> Le 02/07/2019 à 00:20, Wes McKinney a écrit :
> > >>> Thanks for the references.
> > >>>
> > >>> If we decided to make a change around this, we could call the first 4
> > >>> bytes a stream continuation marker to make it slightly less ugly
> > >>>
> > >>> * 0xFFFFFFFF: continue
> > >>> * 0x00000000: stop
> > >>
> > >> Do you mean it would be a separate IPC message?
> > >
> > > No, I think this is only about how we could change the message prefix
> > > from 4 bytes to 8 bytes
> > >
> > >
> >
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#encapsulated-message-format
> > >
> > > Currently a 0x00000000 (0 metadata size) is used as an end-of-stream
> > > marker. So what I was saying is that the first 8 bytes could be
> > >
> > > <4 bytes: stream continuation><int32_t metadata size>
> > >
> > >>
> > >>
> > >>>
> > >>> On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield <
> emkornfield@gmail.com>
> > wrote:
> > >>>>
> > >>>> Hi Wes,
> > >>>> I'm not an expert on this either, my inclination mostly comes from
> > some research I've done.  I think it is important to distinguish two
> cases:
> > >>>> 1.  unaligned access at the processor instruction level
> > >>>> 2.  undefined behavior
> > >>>>
> > >>>> From my reading unaligned access is fine on most modern
> architectures
> > and it seems the performance penalty has mostly been eliminated.
> > >>>>
> > >>>> Undefined behavior is a compiler/language concept.  The problem is
> > the compiler can choose to do anything in UB scenarios, not just the
> > "obvious" translation.  Specifically, the compiler is under no obligation
> > to generate the unaligned access instructions, and if it doesn't SEGVs
> > ensue.  Two examples, both of which relate to SIMD optimizations are
> linked
> > below.
> > >>>>
> > >>>> I tend to be on the conservative side with this type of thing but if
> > we have experts on the the ML that can offer a more informed opinion, I
> > would love to hear it.
> > >>>>
> > >>>> [1]
> > http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
> > >>>> [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
> > >>>>
> > >>>> On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >>>>>
> > >>>>> The <0xffffffff><int32_t size> solution is downright ugly but I
> think
> > >>>>> it's one of the only ways that achieves
> > >>>>>
> > >>>>> * backward compatibility (new clients can read old data)
> > >>>>> * opt-in forward compatibility (if we want to go to the labor of
> > doing
> > >>>>> so, sort of dangerous)
> > >>>>> * old clients receiving new data do not blow up (they will see a
> > >>>>> metadata length of -1)
> > >>>>>
> > >>>>> NB 0xFFFFFFFF <length> would look like:
> > >>>>>
> > >>>>> In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
> > >>>>> Out[13]: array([4294967295,        128], dtype=uint32)
> > >>>>>
> > >>>>> In [14]: np.array([(2 << 32) - 1, 128],
> > >>>>> dtype=np.uint32).view(np.int32)
> > >>>>> Out[14]: array([ -1, 128], dtype=int32)
> > >>>>>
> > >>>>> In [15]: np.array([(2 << 32) - 1, 128],
> > dtype=np.uint32).view(np.uint8)
> > >>>>> Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0],
> dtype=uint8)
> > >>>>>
> > >>>>> Flatbuffers are 32-bit limited so we don't need all 64 bits.
> > >>>>>
> > >>>>> Do you know in what circumstances unaligned reads from Flatbuffers
> > >>>>> might cause an issue? I do not know enough about UB but my
> > >>>>> understanding is that it causes issues on some specialized
> platforms
> > >>>>> where for most modern x86-64 processors and compilers it is not
> > really
> > >>>>> an issue (though perhaps a performance issue)
> > >>>>>
> > >>>>> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <
> > emkornfield@gmail.com> wrote:
> > >>>>>>
> > >>>>>> At least on the read-side we can make this detectable by using
> > something like <0xffffffff><int32_t size> instead of int64_t.  On the
> write
> > side we would need some sort of default mode that we could flip on/off if
> > we wanted to maintain compatibility.
> > >>>>>>
> > >>>>>> I should say I think we should fix it.  Undefined behavior is
> > unpaid debt that might never be collected or might cause things to fail
> in
> > difficult to diagnose ways. And pre-1.0.0 is definitely the time.
> > >>>>>>
> > >>>>>> -Micah
> > >>>>>>
> > >>>>>> On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <wesmckinn@gmail.com
> >
> > wrote:
> > >>>>>>>
> > >>>>>>> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <
> wesmckinn@gmail.com>
> > wrote:
> > >>>>>>>>
> > >>>>>>>> hi Micah,
> > >>>>>>>>
> > >>>>>>>> This is definitely unfortunate, I wish we had realized the
> > potential
> > >>>>>>>> implications of having the Flatbuffer message start on a 4-byte
> > >>>>>>>> (rather than 8-byte) boundary. The cost of making such a change
> > now
> > >>>>>>>> would be pretty high since all readers and writers in all
> > languages
> > >>>>>>>> would have to be changed. That being said, the 0.14.0 -> 1.0.0
> > version
> > >>>>>>>> bump is the last opportunity we have to make a change like this,
> > so we
> > >>>>>>>> might as well discuss it now. Note that particular
> implementations
> > >>>>>>>> could implement compatibility functions to handle the 4 to 8
> byte
> > >>>>>>>> change so that old clients can still be understood. We'd
> probably
> > want
> > >>>>>>>> to do this in C++, for example, since users would pretty quickly
> > >>>>>>>> acquire a new pyarrow version in Spark applications while they
> are
> > >>>>>>>> stuck on an old version of the Java libraries.
> > >>>>>>>
> > >>>>>>> NB such a backwards compatibility fix would not be
> > forward-compatible,
> > >>>>>>> so the PySpark users would need to use a pinned version of
> pyarrow
> > >>>>>>> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
> > >>>>>>>
> > >>>>>>>>
> > >>>>>>>> - Wes
> > >>>>>>>>
> > >>>>>>>> On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <
> > emkornfield@gmail.com> wrote:
> > >>>>>>>>>
> > >>>>>>>>> While working on trying to fix undefined behavior for unaligned
> > memory
> > >>>>>>>>> accesses [1], I ran into an issue with the IPC specification
> [2]
> > which
> > >>>>>>>>> prevents us from ever achieving zero-copy memory mapping and
> > having aligned
> > >>>>>>>>> accesses (i.e. clean UBSan runs).
> > >>>>>>>>>
> > >>>>>>>>> Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> > accesses.
> > >>>>>>>>>
> > >>>>>>>>> In the IPC format we align each message to 8-byte boundaries.
> > We then
> > >>>>>>>>> write a int32_t integer to to denote the size of flat buffer
> > metadata,
> > >>>>>>>>> followed immediately  by the flatbuffer metadata.  This means
> the
> > >>>>>>>>> flatbuffer metadata will never be 8 byte aligned.
> > >>>>>>>>>
> > >>>>>>>>> Do people care?  A simple fix  would be to use int64_t instead
> > of int32_t
> > >>>>>>>>> for length.  However, any fix essentially breaks all previous
> > client
> > >>>>>>>>> library versions or incurs a memory copy.
> > >>>>>>>>>
> > >>>>>>>>> [1] https://github.com/apache/arrow/pull/4757
> > >>>>>>>>> [2] https://arrow.apache.org/docs/ipc.html
> >
>

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.

Correct. The encapsulated IPC message will just be 4 bytes bigger.

On Tue, Jul 2, 2019, 9:31 AM Antoine Pitrou <an...@python.org> wrote:

>
> I guess I still dont understand how the IPC stream format works :-/
>
> To put it clearly: what happens in Flight?  Will a Flight message
> automatically get the "stream continuation message" in front of it?
>
>
> Le 02/07/2019 à 16:15, Wes McKinney a écrit :
> > On Tue, Jul 2, 2019 at 4:23 AM Antoine Pitrou <an...@python.org>
> wrote:
> >>
> >>
> >> Le 02/07/2019 à 00:20, Wes McKinney a écrit :
> >>> Thanks for the references.
> >>>
> >>> If we decided to make a change around this, we could call the first 4
> >>> bytes a stream continuation marker to make it slightly less ugly
> >>>
> >>> * 0xFFFFFFFF: continue
> >>> * 0x00000000: stop
> >>
> >> Do you mean it would be a separate IPC message?
> >
> > No, I think this is only about how we could change the message prefix
> > from 4 bytes to 8 bytes
> >
> >
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#encapsulated-message-format
> >
> > Currently a 0x00000000 (0 metadata size) is used as an end-of-stream
> > marker. So what I was saying is that the first 8 bytes could be
> >
> > <4 bytes: stream continuation><int32_t metadata size>
> >
> >>
> >>
> >>>
> >>> On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >>>>
> >>>> Hi Wes,
> >>>> I'm not an expert on this either, my inclination mostly comes from
> some research I've done.  I think it is important to distinguish two cases:
> >>>> 1.  unaligned access at the processor instruction level
> >>>> 2.  undefined behavior
> >>>>
> >>>> From my reading unaligned access is fine on most modern architectures
> and it seems the performance penalty has mostly been eliminated.
> >>>>
> >>>> Undefined behavior is a compiler/language concept.  The problem is
> the compiler can choose to do anything in UB scenarios, not just the
> "obvious" translation.  Specifically, the compiler is under no obligation
> to generate the unaligned access instructions, and if it doesn't SEGVs
> ensue.  Two examples, both of which relate to SIMD optimizations are linked
> below.
> >>>>
> >>>> I tend to be on the conservative side with this type of thing but if
> we have experts on the the ML that can offer a more informed opinion, I
> would love to hear it.
> >>>>
> >>>> [1]
> http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
> >>>> [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
> >>>>
> >>>> On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>>>>
> >>>>> The <0xffffffff><int32_t size> solution is downright ugly but I think
> >>>>> it's one of the only ways that achieves
> >>>>>
> >>>>> * backward compatibility (new clients can read old data)
> >>>>> * opt-in forward compatibility (if we want to go to the labor of
> doing
> >>>>> so, sort of dangerous)
> >>>>> * old clients receiving new data do not blow up (they will see a
> >>>>> metadata length of -1)
> >>>>>
> >>>>> NB 0xFFFFFFFF <length> would look like:
> >>>>>
> >>>>> In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
> >>>>> Out[13]: array([4294967295,        128], dtype=uint32)
> >>>>>
> >>>>> In [14]: np.array([(2 << 32) - 1, 128],
> >>>>> dtype=np.uint32).view(np.int32)
> >>>>> Out[14]: array([ -1, 128], dtype=int32)
> >>>>>
> >>>>> In [15]: np.array([(2 << 32) - 1, 128],
> dtype=np.uint32).view(np.uint8)
> >>>>> Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0], dtype=uint8)
> >>>>>
> >>>>> Flatbuffers are 32-bit limited so we don't need all 64 bits.
> >>>>>
> >>>>> Do you know in what circumstances unaligned reads from Flatbuffers
> >>>>> might cause an issue? I do not know enough about UB but my
> >>>>> understanding is that it causes issues on some specialized platforms
> >>>>> where for most modern x86-64 processors and compilers it is not
> really
> >>>>> an issue (though perhaps a performance issue)
> >>>>>
> >>>>> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <
> emkornfield@gmail.com> wrote:
> >>>>>>
> >>>>>> At least on the read-side we can make this detectable by using
> something like <0xffffffff><int32_t size> instead of int64_t.  On the write
> side we would need some sort of default mode that we could flip on/off if
> we wanted to maintain compatibility.
> >>>>>>
> >>>>>> I should say I think we should fix it.  Undefined behavior is
> unpaid debt that might never be collected or might cause things to fail in
> difficult to diagnose ways. And pre-1.0.0 is definitely the time.
> >>>>>>
> >>>>>> -Micah
> >>>>>>
> >>>>>> On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>>>>>>
> >>>>>>> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>>>>>>>
> >>>>>>>> hi Micah,
> >>>>>>>>
> >>>>>>>> This is definitely unfortunate, I wish we had realized the
> potential
> >>>>>>>> implications of having the Flatbuffer message start on a 4-byte
> >>>>>>>> (rather than 8-byte) boundary. The cost of making such a change
> now
> >>>>>>>> would be pretty high since all readers and writers in all
> languages
> >>>>>>>> would have to be changed. That being said, the 0.14.0 -> 1.0.0
> version
> >>>>>>>> bump is the last opportunity we have to make a change like this,
> so we
> >>>>>>>> might as well discuss it now. Note that particular implementations
> >>>>>>>> could implement compatibility functions to handle the 4 to 8 byte
> >>>>>>>> change so that old clients can still be understood. We'd probably
> want
> >>>>>>>> to do this in C++, for example, since users would pretty quickly
> >>>>>>>> acquire a new pyarrow version in Spark applications while they are
> >>>>>>>> stuck on an old version of the Java libraries.
> >>>>>>>
> >>>>>>> NB such a backwards compatibility fix would not be
> forward-compatible,
> >>>>>>> so the PySpark users would need to use a pinned version of pyarrow
> >>>>>>> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
> >>>>>>>
> >>>>>>>>
> >>>>>>>> - Wes
> >>>>>>>>
> >>>>>>>> On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <
> emkornfield@gmail.com> wrote:
> >>>>>>>>>
> >>>>>>>>> While working on trying to fix undefined behavior for unaligned
> memory
> >>>>>>>>> accesses [1], I ran into an issue with the IPC specification [2]
> which
> >>>>>>>>> prevents us from ever achieving zero-copy memory mapping and
> having aligned
> >>>>>>>>> accesses (i.e. clean UBSan runs).
> >>>>>>>>>
> >>>>>>>>> Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> accesses.
> >>>>>>>>>
> >>>>>>>>> In the IPC format we align each message to 8-byte boundaries.
> We then
> >>>>>>>>> write a int32_t integer to to denote the size of flat buffer
> metadata,
> >>>>>>>>> followed immediately  by the flatbuffer metadata.  This means the
> >>>>>>>>> flatbuffer metadata will never be 8 byte aligned.
> >>>>>>>>>
> >>>>>>>>> Do people care?  A simple fix  would be to use int64_t instead
> of int32_t
> >>>>>>>>> for length.  However, any fix essentially breaks all previous
> client
> >>>>>>>>> library versions or incurs a memory copy.
> >>>>>>>>>
> >>>>>>>>> [1] https://github.com/apache/arrow/pull/4757
> >>>>>>>>> [2] https://arrow.apache.org/docs/ipc.html
>

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Antoine Pitrou <an...@python.org>.

I guess I still dont understand how the IPC stream format works :-/

To put it clearly: what happens in Flight?  Will a Flight message
automatically get the "stream continuation message" in front of it?


Le 02/07/2019 à 16:15, Wes McKinney a écrit :
> On Tue, Jul 2, 2019 at 4:23 AM Antoine Pitrou <an...@python.org> wrote:
>>
>>
>> Le 02/07/2019 à 00:20, Wes McKinney a écrit :
>>> Thanks for the references.
>>>
>>> If we decided to make a change around this, we could call the first 4
>>> bytes a stream continuation marker to make it slightly less ugly
>>>
>>> * 0xFFFFFFFF: continue
>>> * 0x00000000: stop
>>
>> Do you mean it would be a separate IPC message?
> 
> No, I think this is only about how we could change the message prefix
> from 4 bytes to 8 bytes
> 
> https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#encapsulated-message-format
> 
> Currently a 0x00000000 (0 metadata size) is used as an end-of-stream
> marker. So what I was saying is that the first 8 bytes could be
> 
> <4 bytes: stream continuation><int32_t metadata size>
> 
>>
>>
>>>
>>> On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield <em...@gmail.com> wrote:
>>>>
>>>> Hi Wes,
>>>> I'm not an expert on this either, my inclination mostly comes from some research I've done.  I think it is important to distinguish two cases:
>>>> 1.  unaligned access at the processor instruction level
>>>> 2.  undefined behavior
>>>>
>>>> From my reading unaligned access is fine on most modern architectures and it seems the performance penalty has mostly been eliminated.
>>>>
>>>> Undefined behavior is a compiler/language concept.  The problem is the compiler can choose to do anything in UB scenarios, not just the "obvious" translation.  Specifically, the compiler is under no obligation to generate the unaligned access instructions, and if it doesn't SEGVs ensue.  Two examples, both of which relate to SIMD optimizations are linked below.
>>>>
>>>> I tend to be on the conservative side with this type of thing but if we have experts on the the ML that can offer a more informed opinion, I would love to hear it.
>>>>
>>>> [1] http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
>>>> [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
>>>>
>>>> On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney <we...@gmail.com> wrote:
>>>>>
>>>>> The <0xffffffff><int32_t size> solution is downright ugly but I think
>>>>> it's one of the only ways that achieves
>>>>>
>>>>> * backward compatibility (new clients can read old data)
>>>>> * opt-in forward compatibility (if we want to go to the labor of doing
>>>>> so, sort of dangerous)
>>>>> * old clients receiving new data do not blow up (they will see a
>>>>> metadata length of -1)
>>>>>
>>>>> NB 0xFFFFFFFF <length> would look like:
>>>>>
>>>>> In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
>>>>> Out[13]: array([4294967295,        128], dtype=uint32)
>>>>>
>>>>> In [14]: np.array([(2 << 32) - 1, 128],
>>>>> dtype=np.uint32).view(np.int32)
>>>>> Out[14]: array([ -1, 128], dtype=int32)
>>>>>
>>>>> In [15]: np.array([(2 << 32) - 1, 128], dtype=np.uint32).view(np.uint8)
>>>>> Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0], dtype=uint8)
>>>>>
>>>>> Flatbuffers are 32-bit limited so we don't need all 64 bits.
>>>>>
>>>>> Do you know in what circumstances unaligned reads from Flatbuffers
>>>>> might cause an issue? I do not know enough about UB but my
>>>>> understanding is that it causes issues on some specialized platforms
>>>>> where for most modern x86-64 processors and compilers it is not really
>>>>> an issue (though perhaps a performance issue)
>>>>>
>>>>> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <em...@gmail.com> wrote:
>>>>>>
>>>>>> At least on the read-side we can make this detectable by using something like <0xffffffff><int32_t size> instead of int64_t.  On the write side we would need some sort of default mode that we could flip on/off if we wanted to maintain compatibility.
>>>>>>
>>>>>> I should say I think we should fix it.  Undefined behavior is unpaid debt that might never be collected or might cause things to fail in difficult to diagnose ways. And pre-1.0.0 is definitely the time.
>>>>>>
>>>>>> -Micah
>>>>>>
>>>>>> On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:
>>>>>>>
>>>>>>> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <we...@gmail.com> wrote:
>>>>>>>>
>>>>>>>> hi Micah,
>>>>>>>>
>>>>>>>> This is definitely unfortunate, I wish we had realized the potential
>>>>>>>> implications of having the Flatbuffer message start on a 4-byte
>>>>>>>> (rather than 8-byte) boundary. The cost of making such a change now
>>>>>>>> would be pretty high since all readers and writers in all languages
>>>>>>>> would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
>>>>>>>> bump is the last opportunity we have to make a change like this, so we
>>>>>>>> might as well discuss it now. Note that particular implementations
>>>>>>>> could implement compatibility functions to handle the 4 to 8 byte
>>>>>>>> change so that old clients can still be understood. We'd probably want
>>>>>>>> to do this in C++, for example, since users would pretty quickly
>>>>>>>> acquire a new pyarrow version in Spark applications while they are
>>>>>>>> stuck on an old version of the Java libraries.
>>>>>>>
>>>>>>> NB such a backwards compatibility fix would not be forward-compatible,
>>>>>>> so the PySpark users would need to use a pinned version of pyarrow
>>>>>>> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
>>>>>>>
>>>>>>>>
>>>>>>>> - Wes
>>>>>>>>
>>>>>>>> On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <em...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>> While working on trying to fix undefined behavior for unaligned memory
>>>>>>>>> accesses [1], I ran into an issue with the IPC specification [2] which
>>>>>>>>> prevents us from ever achieving zero-copy memory mapping and having aligned
>>>>>>>>> accesses (i.e. clean UBSan runs).
>>>>>>>>>
>>>>>>>>> Flatbuffer metadata needs 8-byte alignment to guarantee aligned accesses.
>>>>>>>>>
>>>>>>>>> In the IPC format we align each message to 8-byte boundaries.  We then
>>>>>>>>> write a int32_t integer to to denote the size of flat buffer metadata,
>>>>>>>>> followed immediately  by the flatbuffer metadata.  This means the
>>>>>>>>> flatbuffer metadata will never be 8 byte aligned.
>>>>>>>>>
>>>>>>>>> Do people care?  A simple fix  would be to use int64_t instead of int32_t
>>>>>>>>> for length.  However, any fix essentially breaks all previous client
>>>>>>>>> library versions or incurs a memory copy.
>>>>>>>>>
>>>>>>>>> [1] https://github.com/apache/arrow/pull/4757
>>>>>>>>> [2] https://arrow.apache.org/docs/ipc.html

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.

On Tue, Jul 2, 2019 at 4:23 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 02/07/2019 à 00:20, Wes McKinney a écrit :
> > Thanks for the references.
> >
> > If we decided to make a change around this, we could call the first 4
> > bytes a stream continuation marker to make it slightly less ugly
> >
> > * 0xFFFFFFFF: continue
> > * 0x00000000: stop
>
> Do you mean it would be a separate IPC message?

No, I think this is only about how we could change the message prefix
from 4 bytes to 8 bytes

https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#encapsulated-message-format

Currently a 0x00000000 (0 metadata size) is used as an end-of-stream
marker. So what I was saying is that the first 8 bytes could be

<4 bytes: stream continuation><int32_t metadata size>

>
>
> >
> > On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield <em...@gmail.com> wrote:
> >>
> >> Hi Wes,
> >> I'm not an expert on this either, my inclination mostly comes from some research I've done.  I think it is important to distinguish two cases:
> >> 1.  unaligned access at the processor instruction level
> >> 2.  undefined behavior
> >>
> >> From my reading unaligned access is fine on most modern architectures and it seems the performance penalty has mostly been eliminated.
> >>
> >> Undefined behavior is a compiler/language concept.  The problem is the compiler can choose to do anything in UB scenarios, not just the "obvious" translation.  Specifically, the compiler is under no obligation to generate the unaligned access instructions, and if it doesn't SEGVs ensue.  Two examples, both of which relate to SIMD optimizations are linked below.
> >>
> >> I tend to be on the conservative side with this type of thing but if we have experts on the the ML that can offer a more informed opinion, I would love to hear it.
> >>
> >> [1] http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
> >> [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
> >>
> >> On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney <we...@gmail.com> wrote:
> >>>
> >>> The <0xffffffff><int32_t size> solution is downright ugly but I think
> >>> it's one of the only ways that achieves
> >>>
> >>> * backward compatibility (new clients can read old data)
> >>> * opt-in forward compatibility (if we want to go to the labor of doing
> >>> so, sort of dangerous)
> >>> * old clients receiving new data do not blow up (they will see a
> >>> metadata length of -1)
> >>>
> >>> NB 0xFFFFFFFF <length> would look like:
> >>>
> >>> In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
> >>> Out[13]: array([4294967295,        128], dtype=uint32)
> >>>
> >>> In [14]: np.array([(2 << 32) - 1, 128],
> >>> dtype=np.uint32).view(np.int32)
> >>> Out[14]: array([ -1, 128], dtype=int32)
> >>>
> >>> In [15]: np.array([(2 << 32) - 1, 128], dtype=np.uint32).view(np.uint8)
> >>> Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0], dtype=uint8)
> >>>
> >>> Flatbuffers are 32-bit limited so we don't need all 64 bits.
> >>>
> >>> Do you know in what circumstances unaligned reads from Flatbuffers
> >>> might cause an issue? I do not know enough about UB but my
> >>> understanding is that it causes issues on some specialized platforms
> >>> where for most modern x86-64 processors and compilers it is not really
> >>> an issue (though perhaps a performance issue)
> >>>
> >>> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <em...@gmail.com> wrote:
> >>>>
> >>>> At least on the read-side we can make this detectable by using something like <0xffffffff><int32_t size> instead of int64_t.  On the write side we would need some sort of default mode that we could flip on/off if we wanted to maintain compatibility.
> >>>>
> >>>> I should say I think we should fix it.  Undefined behavior is unpaid debt that might never be collected or might cause things to fail in difficult to diagnose ways. And pre-1.0.0 is definitely the time.
> >>>>
> >>>> -Micah
> >>>>
> >>>> On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:
> >>>>>
> >>>>> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <we...@gmail.com> wrote:
> >>>>>>
> >>>>>> hi Micah,
> >>>>>>
> >>>>>> This is definitely unfortunate, I wish we had realized the potential
> >>>>>> implications of having the Flatbuffer message start on a 4-byte
> >>>>>> (rather than 8-byte) boundary. The cost of making such a change now
> >>>>>> would be pretty high since all readers and writers in all languages
> >>>>>> would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
> >>>>>> bump is the last opportunity we have to make a change like this, so we
> >>>>>> might as well discuss it now. Note that particular implementations
> >>>>>> could implement compatibility functions to handle the 4 to 8 byte
> >>>>>> change so that old clients can still be understood. We'd probably want
> >>>>>> to do this in C++, for example, since users would pretty quickly
> >>>>>> acquire a new pyarrow version in Spark applications while they are
> >>>>>> stuck on an old version of the Java libraries.
> >>>>>
> >>>>> NB such a backwards compatibility fix would not be forward-compatible,
> >>>>> so the PySpark users would need to use a pinned version of pyarrow
> >>>>> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
> >>>>>
> >>>>>>
> >>>>>> - Wes
> >>>>>>
> >>>>>> On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <em...@gmail.com> wrote:
> >>>>>>>
> >>>>>>> While working on trying to fix undefined behavior for unaligned memory
> >>>>>>> accesses [1], I ran into an issue with the IPC specification [2] which
> >>>>>>> prevents us from ever achieving zero-copy memory mapping and having aligned
> >>>>>>> accesses (i.e. clean UBSan runs).
> >>>>>>>
> >>>>>>> Flatbuffer metadata needs 8-byte alignment to guarantee aligned accesses.
> >>>>>>>
> >>>>>>> In the IPC format we align each message to 8-byte boundaries.  We then
> >>>>>>> write a int32_t integer to to denote the size of flat buffer metadata,
> >>>>>>> followed immediately  by the flatbuffer metadata.  This means the
> >>>>>>> flatbuffer metadata will never be 8 byte aligned.
> >>>>>>>
> >>>>>>> Do people care?  A simple fix  would be to use int64_t instead of int32_t
> >>>>>>> for length.  However, any fix essentially breaks all previous client
> >>>>>>> library versions or incurs a memory copy.
> >>>>>>>
> >>>>>>> [1] https://github.com/apache/arrow/pull/4757
> >>>>>>> [2] https://arrow.apache.org/docs/ipc.html

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Antoine Pitrou <an...@python.org>.

Le 02/07/2019 à 00:20, Wes McKinney a écrit :
> Thanks for the references.
> 
> If we decided to make a change around this, we could call the first 4
> bytes a stream continuation marker to make it slightly less ugly
> 
> * 0xFFFFFFFF: continue
> * 0x00000000: stop

Do you mean it would be a separate IPC message?


> 
> On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield <em...@gmail.com> wrote:
>>
>> Hi Wes,
>> I'm not an expert on this either, my inclination mostly comes from some research I've done.  I think it is important to distinguish two cases:
>> 1.  unaligned access at the processor instruction level
>> 2.  undefined behavior
>>
>> From my reading unaligned access is fine on most modern architectures and it seems the performance penalty has mostly been eliminated.
>>
>> Undefined behavior is a compiler/language concept.  The problem is the compiler can choose to do anything in UB scenarios, not just the "obvious" translation.  Specifically, the compiler is under no obligation to generate the unaligned access instructions, and if it doesn't SEGVs ensue.  Two examples, both of which relate to SIMD optimizations are linked below.
>>
>> I tend to be on the conservative side with this type of thing but if we have experts on the the ML that can offer a more informed opinion, I would love to hear it.
>>
>> [1] http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
>> [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
>>
>> On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney <we...@gmail.com> wrote:
>>>
>>> The <0xffffffff><int32_t size> solution is downright ugly but I think
>>> it's one of the only ways that achieves
>>>
>>> * backward compatibility (new clients can read old data)
>>> * opt-in forward compatibility (if we want to go to the labor of doing
>>> so, sort of dangerous)
>>> * old clients receiving new data do not blow up (they will see a
>>> metadata length of -1)
>>>
>>> NB 0xFFFFFFFF <length> would look like:
>>>
>>> In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
>>> Out[13]: array([4294967295,        128], dtype=uint32)
>>>
>>> In [14]: np.array([(2 << 32) - 1, 128],
>>> dtype=np.uint32).view(np.int32)
>>> Out[14]: array([ -1, 128], dtype=int32)
>>>
>>> In [15]: np.array([(2 << 32) - 1, 128], dtype=np.uint32).view(np.uint8)
>>> Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0], dtype=uint8)
>>>
>>> Flatbuffers are 32-bit limited so we don't need all 64 bits.
>>>
>>> Do you know in what circumstances unaligned reads from Flatbuffers
>>> might cause an issue? I do not know enough about UB but my
>>> understanding is that it causes issues on some specialized platforms
>>> where for most modern x86-64 processors and compilers it is not really
>>> an issue (though perhaps a performance issue)
>>>
>>> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <em...@gmail.com> wrote:
>>>>
>>>> At least on the read-side we can make this detectable by using something like <0xffffffff><int32_t size> instead of int64_t.  On the write side we would need some sort of default mode that we could flip on/off if we wanted to maintain compatibility.
>>>>
>>>> I should say I think we should fix it.  Undefined behavior is unpaid debt that might never be collected or might cause things to fail in difficult to diagnose ways. And pre-1.0.0 is definitely the time.
>>>>
>>>> -Micah
>>>>
>>>> On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:
>>>>>
>>>>> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <we...@gmail.com> wrote:
>>>>>>
>>>>>> hi Micah,
>>>>>>
>>>>>> This is definitely unfortunate, I wish we had realized the potential
>>>>>> implications of having the Flatbuffer message start on a 4-byte
>>>>>> (rather than 8-byte) boundary. The cost of making such a change now
>>>>>> would be pretty high since all readers and writers in all languages
>>>>>> would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
>>>>>> bump is the last opportunity we have to make a change like this, so we
>>>>>> might as well discuss it now. Note that particular implementations
>>>>>> could implement compatibility functions to handle the 4 to 8 byte
>>>>>> change so that old clients can still be understood. We'd probably want
>>>>>> to do this in C++, for example, since users would pretty quickly
>>>>>> acquire a new pyarrow version in Spark applications while they are
>>>>>> stuck on an old version of the Java libraries.
>>>>>
>>>>> NB such a backwards compatibility fix would not be forward-compatible,
>>>>> so the PySpark users would need to use a pinned version of pyarrow
>>>>> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
>>>>>
>>>>>>
>>>>>> - Wes
>>>>>>
>>>>>> On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <em...@gmail.com> wrote:
>>>>>>>
>>>>>>> While working on trying to fix undefined behavior for unaligned memory
>>>>>>> accesses [1], I ran into an issue with the IPC specification [2] which
>>>>>>> prevents us from ever achieving zero-copy memory mapping and having aligned
>>>>>>> accesses (i.e. clean UBSan runs).
>>>>>>>
>>>>>>> Flatbuffer metadata needs 8-byte alignment to guarantee aligned accesses.
>>>>>>>
>>>>>>> In the IPC format we align each message to 8-byte boundaries.  We then
>>>>>>> write a int32_t integer to to denote the size of flat buffer metadata,
>>>>>>> followed immediately  by the flatbuffer metadata.  This means the
>>>>>>> flatbuffer metadata will never be 8 byte aligned.
>>>>>>>
>>>>>>> Do people care?  A simple fix  would be to use int64_t instead of int32_t
>>>>>>> for length.  However, any fix essentially breaks all previous client
>>>>>>> library versions or incurs a memory copy.
>>>>>>>
>>>>>>> [1] https://github.com/apache/arrow/pull/4757
>>>>>>> [2] https://arrow.apache.org/docs/ipc.html

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.

Thanks for the references.

If we decided to make a change around this, we could call the first 4
bytes a stream continuation marker to make it slightly less ugly

* 0xFFFFFFFF: continue
* 0x00000000: stop

On Mon, Jul 1, 2019 at 4:35 PM Micah Kornfield <em...@gmail.com> wrote:
>
> Hi Wes,
> I'm not an expert on this either, my inclination mostly comes from some research I've done.  I think it is important to distinguish two cases:
> 1.  unaligned access at the processor instruction level
> 2.  undefined behavior
>
> From my reading unaligned access is fine on most modern architectures and it seems the performance penalty has mostly been eliminated.
>
> Undefined behavior is a compiler/language concept.  The problem is the compiler can choose to do anything in UB scenarios, not just the "obvious" translation.  Specifically, the compiler is under no obligation to generate the unaligned access instructions, and if it doesn't SEGVs ensue.  Two examples, both of which relate to SIMD optimizations are linked below.
>
> I tend to be on the conservative side with this type of thing but if we have experts on the the ML that can offer a more informed opinion, I would love to hear it.
>
> [1] http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
> [2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709
>
> On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> The <0xffffffff><int32_t size> solution is downright ugly but I think
>> it's one of the only ways that achieves
>>
>> * backward compatibility (new clients can read old data)
>> * opt-in forward compatibility (if we want to go to the labor of doing
>> so, sort of dangerous)
>> * old clients receiving new data do not blow up (they will see a
>> metadata length of -1)
>>
>> NB 0xFFFFFFFF <length> would look like:
>>
>> In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
>> Out[13]: array([4294967295,        128], dtype=uint32)
>>
>> In [14]: np.array([(2 << 32) - 1, 128],
>> dtype=np.uint32).view(np.int32)
>> Out[14]: array([ -1, 128], dtype=int32)
>>
>> In [15]: np.array([(2 << 32) - 1, 128], dtype=np.uint32).view(np.uint8)
>> Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0], dtype=uint8)
>>
>> Flatbuffers are 32-bit limited so we don't need all 64 bits.
>>
>> Do you know in what circumstances unaligned reads from Flatbuffers
>> might cause an issue? I do not know enough about UB but my
>> understanding is that it causes issues on some specialized platforms
>> where for most modern x86-64 processors and compilers it is not really
>> an issue (though perhaps a performance issue)
>>
>> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <em...@gmail.com> wrote:
>> >
>> > At least on the read-side we can make this detectable by using something like <0xffffffff><int32_t size> instead of int64_t.  On the write side we would need some sort of default mode that we could flip on/off if we wanted to maintain compatibility.
>> >
>> > I should say I think we should fix it.  Undefined behavior is unpaid debt that might never be collected or might cause things to fail in difficult to diagnose ways. And pre-1.0.0 is definitely the time.
>> >
>> > -Micah
>> >
>> > On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:
>> >>
>> >> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <we...@gmail.com> wrote:
>> >> >
>> >> > hi Micah,
>> >> >
>> >> > This is definitely unfortunate, I wish we had realized the potential
>> >> > implications of having the Flatbuffer message start on a 4-byte
>> >> > (rather than 8-byte) boundary. The cost of making such a change now
>> >> > would be pretty high since all readers and writers in all languages
>> >> > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
>> >> > bump is the last opportunity we have to make a change like this, so we
>> >> > might as well discuss it now. Note that particular implementations
>> >> > could implement compatibility functions to handle the 4 to 8 byte
>> >> > change so that old clients can still be understood. We'd probably want
>> >> > to do this in C++, for example, since users would pretty quickly
>> >> > acquire a new pyarrow version in Spark applications while they are
>> >> > stuck on an old version of the Java libraries.
>> >>
>> >> NB such a backwards compatibility fix would not be forward-compatible,
>> >> so the PySpark users would need to use a pinned version of pyarrow
>> >> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
>> >>
>> >> >
>> >> > - Wes
>> >> >
>> >> > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <em...@gmail.com> wrote:
>> >> > >
>> >> > > While working on trying to fix undefined behavior for unaligned memory
>> >> > > accesses [1], I ran into an issue with the IPC specification [2] which
>> >> > > prevents us from ever achieving zero-copy memory mapping and having aligned
>> >> > > accesses (i.e. clean UBSan runs).
>> >> > >
>> >> > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned accesses.
>> >> > >
>> >> > > In the IPC format we align each message to 8-byte boundaries.  We then
>> >> > > write a int32_t integer to to denote the size of flat buffer metadata,
>> >> > > followed immediately  by the flatbuffer metadata.  This means the
>> >> > > flatbuffer metadata will never be 8 byte aligned.
>> >> > >
>> >> > > Do people care?  A simple fix  would be to use int64_t instead of int32_t
>> >> > > for length.  However, any fix essentially breaks all previous client
>> >> > > library versions or incurs a memory copy.
>> >> > >
>> >> > > [1] https://github.com/apache/arrow/pull/4757
>> >> > > [2] https://arrow.apache.org/docs/ipc.html

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Micah Kornfield <em...@gmail.com>.

Hi Wes,
I'm not an expert on this either, my inclination mostly comes from some
research I've done.  I think it is important to distinguish two cases:
1.  unaligned access at the processor instruction level
2.  undefined behavior

From my reading unaligned access is fine on most modern architectures and
it seems the performance penalty has mostly been eliminated.

Undefined behavior is a compiler/language concept.  The problem is the
compiler can choose to do anything in UB scenarios, not just the "obvious"
translation.  Specifically, the compiler is under no obligation to generate
the unaligned access instructions, and if it doesn't SEGVs ensue.  Two
examples, both of which relate to SIMD optimizations are linked below.

I tend to be on the conservative side with this type of thing but if we
have experts on the the ML that can offer a more informed opinion, I would
love to hear it.

[1] http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html
[2] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=65709

On Mon, Jul 1, 2019 at 1:41 PM Wes McKinney <we...@gmail.com> wrote:

> The <0xffffffff><int32_t size> solution is downright ugly but I think
> it's one of the only ways that achieves
>
> * backward compatibility (new clients can read old data)
> * opt-in forward compatibility (if we want to go to the labor of doing
> so, sort of dangerous)
> * old clients receiving new data do not blow up (they will see a
> metadata length of -1)
>
> NB 0xFFFFFFFF <length> would look like:
>
> In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
> Out[13]: array([4294967295,        128], dtype=uint32)
>
> In [14]: np.array([(2 << 32) - 1, 128],
> dtype=np.uint32).view(np.int32)
> Out[14]: array([ -1, 128], dtype=int32)
>
> In [15]: np.array([(2 << 32) - 1, 128], dtype=np.uint32).view(np.uint8)
> Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0], dtype=uint8)
>
> Flatbuffers are 32-bit limited so we don't need all 64 bits.
>
> Do you know in what circumstances unaligned reads from Flatbuffers
> might cause an issue? I do not know enough about UB but my
> understanding is that it causes issues on some specialized platforms
> where for most modern x86-64 processors and compilers it is not really
> an issue (though perhaps a performance issue)
>
> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > At least on the read-side we can make this detectable by using something
> like <0xffffffff><int32_t size> instead of int64_t.  On the write side we
> would need some sort of default mode that we could flip on/off if we wanted
> to maintain compatibility.
> >
> > I should say I think we should fix it.  Undefined behavior is unpaid
> debt that might never be collected or might cause things to fail in
> difficult to diagnose ways. And pre-1.0.0 is definitely the time.
> >
> > -Micah
> >
> > On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <we...@gmail.com>
> wrote:
> >> >
> >> > hi Micah,
> >> >
> >> > This is definitely unfortunate, I wish we had realized the potential
> >> > implications of having the Flatbuffer message start on a 4-byte
> >> > (rather than 8-byte) boundary. The cost of making such a change now
> >> > would be pretty high since all readers and writers in all languages
> >> > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
> >> > bump is the last opportunity we have to make a change like this, so we
> >> > might as well discuss it now. Note that particular implementations
> >> > could implement compatibility functions to handle the 4 to 8 byte
> >> > change so that old clients can still be understood. We'd probably want
> >> > to do this in C++, for example, since users would pretty quickly
> >> > acquire a new pyarrow version in Spark applications while they are
> >> > stuck on an old version of the Java libraries.
> >>
> >> NB such a backwards compatibility fix would not be forward-compatible,
> >> so the PySpark users would need to use a pinned version of pyarrow
> >> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
> >>
> >> >
> >> > - Wes
> >> >
> >> > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <
> emkornfield@gmail.com> wrote:
> >> > >
> >> > > While working on trying to fix undefined behavior for unaligned
> memory
> >> > > accesses [1], I ran into an issue with the IPC specification [2]
> which
> >> > > prevents us from ever achieving zero-copy memory mapping and having
> aligned
> >> > > accesses (i.e. clean UBSan runs).
> >> > >
> >> > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> accesses.
> >> > >
> >> > > In the IPC format we align each message to 8-byte boundaries.  We
> then
> >> > > write a int32_t integer to to denote the size of flat buffer
> metadata,
> >> > > followed immediately  by the flatbuffer metadata.  This means the
> >> > > flatbuffer metadata will never be 8 byte aligned.
> >> > >
> >> > > Do people care?  A simple fix  would be to use int64_t instead of
> int32_t
> >> > > for length.  However, any fix essentially breaks all previous client
> >> > > library versions or incurs a memory copy.
> >> > >
> >> > > [1] https://github.com/apache/arrow/pull/4757
> >> > > [2] https://arrow.apache.org/docs/ipc.html
>

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.

The <0xffffffff><int32_t size> solution is downright ugly but I think
it's one of the only ways that achieves

* backward compatibility (new clients can read old data)
* opt-in forward compatibility (if we want to go to the labor of doing
so, sort of dangerous)
* old clients receiving new data do not blow up (they will see a
metadata length of -1)

NB 0xFFFFFFFF <length> would look like:

In [13]: np.array([(2 << 32) - 1, 128], dtype=np.uint32)
Out[13]: array([4294967295,        128], dtype=uint32)

In [14]: np.array([(2 << 32) - 1, 128],
dtype=np.uint32).view(np.int32)
Out[14]: array([ -1, 128], dtype=int32)

In [15]: np.array([(2 << 32) - 1, 128], dtype=np.uint32).view(np.uint8)
Out[15]: array([255, 255, 255, 255, 128,   0,   0,   0], dtype=uint8)

Flatbuffers are 32-bit limited so we don't need all 64 bits.

Do you know in what circumstances unaligned reads from Flatbuffers
might cause an issue? I do not know enough about UB but my
understanding is that it causes issues on some specialized platforms
where for most modern x86-64 processors and compilers it is not really
an issue (though perhaps a performance issue)

On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <em...@gmail.com> wrote:
>
> At least on the read-side we can make this detectable by using something like <0xffffffff><int32_t size> instead of int64_t.  On the write side we would need some sort of default mode that we could flip on/off if we wanted to maintain compatibility.
>
> I should say I think we should fix it.  Undefined behavior is unpaid debt that might never be collected or might cause things to fail in difficult to diagnose ways. And pre-1.0.0 is definitely the time.
>
> -Micah
>
> On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <we...@gmail.com> wrote:
>> >
>> > hi Micah,
>> >
>> > This is definitely unfortunate, I wish we had realized the potential
>> > implications of having the Flatbuffer message start on a 4-byte
>> > (rather than 8-byte) boundary. The cost of making such a change now
>> > would be pretty high since all readers and writers in all languages
>> > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
>> > bump is the last opportunity we have to make a change like this, so we
>> > might as well discuss it now. Note that particular implementations
>> > could implement compatibility functions to handle the 4 to 8 byte
>> > change so that old clients can still be understood. We'd probably want
>> > to do this in C++, for example, since users would pretty quickly
>> > acquire a new pyarrow version in Spark applications while they are
>> > stuck on an old version of the Java libraries.
>>
>> NB such a backwards compatibility fix would not be forward-compatible,
>> so the PySpark users would need to use a pinned version of pyarrow
>> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
>>
>> >
>> > - Wes
>> >
>> > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <em...@gmail.com> wrote:
>> > >
>> > > While working on trying to fix undefined behavior for unaligned memory
>> > > accesses [1], I ran into an issue with the IPC specification [2] which
>> > > prevents us from ever achieving zero-copy memory mapping and having aligned
>> > > accesses (i.e. clean UBSan runs).
>> > >
>> > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned accesses.
>> > >
>> > > In the IPC format we align each message to 8-byte boundaries.  We then
>> > > write a int32_t integer to to denote the size of flat buffer metadata,
>> > > followed immediately  by the flatbuffer metadata.  This means the
>> > > flatbuffer metadata will never be 8 byte aligned.
>> > >
>> > > Do people care?  A simple fix  would be to use int64_t instead of int32_t
>> > > for length.  However, any fix essentially breaks all previous client
>> > > library versions or incurs a memory copy.
>> > >
>> > > [1] https://github.com/apache/arrow/pull/4757
>> > > [2] https://arrow.apache.org/docs/ipc.html

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.

hi John,

This is off topic, so please feel free to start a discussion thread
and present a detailed proposal. I believe your use case can be
addressed by pre-allocating record batches and maintaining application
level metadata about what portion of the record batches has been
"filled" (so the unfilled records can be dropped by slicing). I don't
think any change to the binary protocol is warranted

Thanks,
Wes

On Sun, Jun 30, 2019 at 9:03 PM John Muehlhausen <jg...@jgm.org> wrote:
>
> If there is going to be a breaking change to the IPC format, I'd appreciate
> some discussion about an idea I had for RecordBatch metadata.  I previously
> promised to create a discussion thread with an initial write-up but have
> not yet done so.  I will try to do this tomorrow.  (The basic idea is to
> have readers read only N records of a RecordBatch rather than all of them,
> where in the usual case these row counts are the same.  This opens the door
> to reading partially populated RecordBatches which will be extremely useful
> in low-latency streaming applications.)
>
> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > At least on the read-side we can make this detectable by using something
> > like <0xffffffff><int32_t size> instead of int64_t.  On the write side we
> > would need some sort of default mode that we could flip on/off if we wanted
> > to maintain compatibility.
> >
> > I should say I think we should fix it.  Undefined behavior is unpaid debt
> > that might never be collected or might cause things to fail in difficult to
> > diagnose ways. And pre-1.0.0 is definitely the time.
> >
> > -Micah
> >
> > On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >
> > > > hi Micah,
> > > >
> > > > This is definitely unfortunate, I wish we had realized the potential
> > > > implications of having the Flatbuffer message start on a 4-byte
> > > > (rather than 8-byte) boundary. The cost of making such a change now
> > > > would be pretty high since all readers and writers in all languages
> > > > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
> > > > bump is the last opportunity we have to make a change like this, so we
> > > > might as well discuss it now. Note that particular implementations
> > > > could implement compatibility functions to handle the 4 to 8 byte
> > > > change so that old clients can still be understood. We'd probably want
> > > > to do this in C++, for example, since users would pretty quickly
> > > > acquire a new pyarrow version in Spark applications while they are
> > > > stuck on an old version of the Java libraries.
> > >
> > > NB such a backwards compatibility fix would not be forward-compatible,
> > > so the PySpark users would need to use a pinned version of pyarrow
> > > until Spark upgraded to Arrow 1.0.0. Maybe that's OK
> > >
> > > >
> > > > - Wes
> > > >
> > > > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <emkornfield@gmail.com
> > >
> > > wrote:
> > > > >
> > > > > While working on trying to fix undefined behavior for unaligned
> > memory
> > > > > accesses [1], I ran into an issue with the IPC specification [2]
> > which
> > > > > prevents us from ever achieving zero-copy memory mapping and having
> > > aligned
> > > > > accesses (i.e. clean UBSan runs).
> > > > >
> > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> > > accesses.
> > > > >
> > > > > In the IPC format we align each message to 8-byte boundaries.  We
> > then
> > > > > write a int32_t integer to to denote the size of flat buffer
> > metadata,
> > > > > followed immediately  by the flatbuffer metadata.  This means the
> > > > > flatbuffer metadata will never be 8 byte aligned.
> > > > >
> > > > > Do people care?  A simple fix  would be to use int64_t instead of
> > > int32_t
> > > > > for length.  However, any fix essentially breaks all previous client
> > > > > library versions or incurs a memory copy.
> > > > >
> > > > > [1] https://github.com/apache/arrow/pull/4757
> > > > > [2] https://arrow.apache.org/docs/ipc.html
> > >
> >

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by John Muehlhausen <jg...@jgm.org>.

If there is going to be a breaking change to the IPC format, I'd appreciate
some discussion about an idea I had for RecordBatch metadata.  I previously
promised to create a discussion thread with an initial write-up but have
not yet done so.  I will try to do this tomorrow.  (The basic idea is to
have readers read only N records of a RecordBatch rather than all of them,
where in the usual case these row counts are the same.  This opens the door
to reading partially populated RecordBatches which will be extremely useful
in low-latency streaming applications.)

On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <em...@gmail.com>
wrote:

> At least on the read-side we can make this detectable by using something
> like <0xffffffff><int32_t size> instead of int64_t.  On the write side we
> would need some sort of default mode that we could flip on/off if we wanted
> to maintain compatibility.
>
> I should say I think we should fix it.  Undefined behavior is unpaid debt
> that might never be collected or might cause things to fail in difficult to
> diagnose ways. And pre-1.0.0 is definitely the time.
>
> -Micah
>
> On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:
>
> > On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > hi Micah,
> > >
> > > This is definitely unfortunate, I wish we had realized the potential
> > > implications of having the Flatbuffer message start on a 4-byte
> > > (rather than 8-byte) boundary. The cost of making such a change now
> > > would be pretty high since all readers and writers in all languages
> > > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
> > > bump is the last opportunity we have to make a change like this, so we
> > > might as well discuss it now. Note that particular implementations
> > > could implement compatibility functions to handle the 4 to 8 byte
> > > change so that old clients can still be understood. We'd probably want
> > > to do this in C++, for example, since users would pretty quickly
> > > acquire a new pyarrow version in Spark applications while they are
> > > stuck on an old version of the Java libraries.
> >
> > NB such a backwards compatibility fix would not be forward-compatible,
> > so the PySpark users would need to use a pinned version of pyarrow
> > until Spark upgraded to Arrow 1.0.0. Maybe that's OK
> >
> > >
> > > - Wes
> > >
> > > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <emkornfield@gmail.com
> >
> > wrote:
> > > >
> > > > While working on trying to fix undefined behavior for unaligned
> memory
> > > > accesses [1], I ran into an issue with the IPC specification [2]
> which
> > > > prevents us from ever achieving zero-copy memory mapping and having
> > aligned
> > > > accesses (i.e. clean UBSan runs).
> > > >
> > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> > accesses.
> > > >
> > > > In the IPC format we align each message to 8-byte boundaries.  We
> then
> > > > write a int32_t integer to to denote the size of flat buffer
> metadata,
> > > > followed immediately  by the flatbuffer metadata.  This means the
> > > > flatbuffer metadata will never be 8 byte aligned.
> > > >
> > > > Do people care?  A simple fix  would be to use int64_t instead of
> > int32_t
> > > > for length.  However, any fix essentially breaks all previous client
> > > > library versions or incurs a memory copy.
> > > >
> > > > [1] https://github.com/apache/arrow/pull/4757
> > > > [2] https://arrow.apache.org/docs/ipc.html
> >
>

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Micah Kornfield <em...@gmail.com>.

At least on the read-side we can make this detectable by using something
like <0xffffffff><int32_t size> instead of int64_t.  On the write side we
would need some sort of default mode that we could flip on/off if we wanted
to maintain compatibility.

I should say I think we should fix it.  Undefined behavior is unpaid debt
that might never be collected or might cause things to fail in difficult to
diagnose ways. And pre-1.0.0 is definitely the time.

-Micah

On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:

> On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > hi Micah,
> >
> > This is definitely unfortunate, I wish we had realized the potential
> > implications of having the Flatbuffer message start on a 4-byte
> > (rather than 8-byte) boundary. The cost of making such a change now
> > would be pretty high since all readers and writers in all languages
> > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
> > bump is the last opportunity we have to make a change like this, so we
> > might as well discuss it now. Note that particular implementations
> > could implement compatibility functions to handle the 4 to 8 byte
> > change so that old clients can still be understood. We'd probably want
> > to do this in C++, for example, since users would pretty quickly
> > acquire a new pyarrow version in Spark applications while they are
> > stuck on an old version of the Java libraries.
>
> NB such a backwards compatibility fix would not be forward-compatible,
> so the PySpark users would need to use a pinned version of pyarrow
> until Spark upgraded to Arrow 1.0.0. Maybe that's OK
>
> >
> > - Wes
> >
> > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <em...@gmail.com>
> wrote:
> > >
> > > While working on trying to fix undefined behavior for unaligned memory
> > > accesses [1], I ran into an issue with the IPC specification [2] which
> > > prevents us from ever achieving zero-copy memory mapping and having
> aligned
> > > accesses (i.e. clean UBSan runs).
> > >
> > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> accesses.
> > >
> > > In the IPC format we align each message to 8-byte boundaries.  We then
> > > write a int32_t integer to to denote the size of flat buffer metadata,
> > > followed immediately  by the flatbuffer metadata.  This means the
> > > flatbuffer metadata will never be 8 byte aligned.
> > >
> > > Do people care?  A simple fix  would be to use int64_t instead of
> int32_t
> > > for length.  However, any fix essentially breaks all previous client
> > > library versions or incurs a memory copy.
> > >
> > > [1] https://github.com/apache/arrow/pull/4757
> > > [2] https://arrow.apache.org/docs/ipc.html
>

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.

On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <we...@gmail.com> wrote:
>
> hi Micah,
>
> This is definitely unfortunate, I wish we had realized the potential
> implications of having the Flatbuffer message start on a 4-byte
> (rather than 8-byte) boundary. The cost of making such a change now
> would be pretty high since all readers and writers in all languages
> would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
> bump is the last opportunity we have to make a change like this, so we
> might as well discuss it now. Note that particular implementations
> could implement compatibility functions to handle the 4 to 8 byte
> change so that old clients can still be understood. We'd probably want
> to do this in C++, for example, since users would pretty quickly
> acquire a new pyarrow version in Spark applications while they are
> stuck on an old version of the Java libraries.

NB such a backwards compatibility fix would not be forward-compatible,
so the PySpark users would need to use a pinned version of pyarrow
until Spark upgraded to Arrow 1.0.0. Maybe that's OK

>
> - Wes
>
> On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <em...@gmail.com> wrote:
> >
> > While working on trying to fix undefined behavior for unaligned memory
> > accesses [1], I ran into an issue with the IPC specification [2] which
> > prevents us from ever achieving zero-copy memory mapping and having aligned
> > accesses (i.e. clean UBSan runs).
> >
> > Flatbuffer metadata needs 8-byte alignment to guarantee aligned accesses.
> >
> > In the IPC format we align each message to 8-byte boundaries.  We then
> > write a int32_t integer to to denote the size of flat buffer metadata,
> > followed immediately  by the flatbuffer metadata.  This means the
> > flatbuffer metadata will never be 8 byte aligned.
> >
> > Do people care?  A simple fix  would be to use int64_t instead of int32_t
> > for length.  However, any fix essentially breaks all previous client
> > library versions or incurs a memory copy.
> >
> > [1] https://github.com/apache/arrow/pull/4757
> > [2] https://arrow.apache.org/docs/ipc.html

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.

hi Micah,

This is definitely unfortunate, I wish we had realized the potential
implications of having the Flatbuffer message start on a 4-byte
(rather than 8-byte) boundary. The cost of making such a change now
would be pretty high since all readers and writers in all languages
would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
bump is the last opportunity we have to make a change like this, so we
might as well discuss it now. Note that particular implementations
could implement compatibility functions to handle the 4 to 8 byte
change so that old clients can still be understood. We'd probably want
to do this in C++, for example, since users would pretty quickly
acquire a new pyarrow version in Spark applications while they are
stuck on an old version of the Java libraries.

- Wes

On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <em...@gmail.com> wrote:
>
> While working on trying to fix undefined behavior for unaligned memory
> accesses [1], I ran into an issue with the IPC specification [2] which
> prevents us from ever achieving zero-copy memory mapping and having aligned
> accesses (i.e. clean UBSan runs).
>
> Flatbuffer metadata needs 8-byte alignment to guarantee aligned accesses.
>
> In the IPC format we align each message to 8-byte boundaries.  We then
> write a int32_t integer to to denote the size of flat buffer metadata,
> followed immediately  by the flatbuffer metadata.  This means the
> flatbuffer metadata will never be 8 byte aligned.
>
> Do people care?  A simple fix  would be to use int64_t instead of int32_t
> for length.  However, any fix essentially breaks all previous client
> library versions or incurs a memory copy.
>
> [1] https://github.com/apache/arrow/pull/4757
> [2] https://arrow.apache.org/docs/ipc.html