You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2019/08/06 20:15:56 UTC

[VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements

hi all,

As we've been discussing for the last 5 weeks or so [1], there is a
need to introduce 4 bytes of padding into the preamble of the
"encapsulated IPC message" format to ensure that the Flatbuffers
metadata payload begins on an 8-byte aligned memory offset. The
alternative to this would be for Arrow implementations where alignment
is important (e.g. C or C++) to copy the metadata (which is not always
small) into memory when it is unaligned.

Micah has proposed to address this by adding a 4-byte "continuation"
value at the beginning of the payload having the value 0xFFFFFFFF. The
reason to do it this way is that old clients will see an invalid
length (what is currently the first 4 bytes of the message -- a 32-bit
little endian signed integer indicating the metadata length) rather
than potentially crashing on a valid length.

This would be a backwards incompatible protocol change, so older Arrow
libraries would not be able to read these new messages. Maintaining
forward compatibility (reading data produced by older libraries) would
be possible as we can reason that a value other than the continuation
value was produced by an older library (and then validate the
Flatbuffer message of course). Arrow implementations could offer a
backward compatibility mode for the sake of old readers if they desire
(this may also assist with testing).

The PR making these changes to the IPC documentation is here

https://github.com/apache/arrow/pull/4951

Please vote to accept this change. This vote will be open for at least 72 hours

[ ] +1 Adopt the Arrow protocol change
[ ] +0
[ ] -1 I disagree because...

Here is my vote: +1

Thanks,
Wes

[1]: https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E

Re: [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements

Posted by Wes McKinney <we...@gmail.com>.
I'm canceling the vote until we can refine the proposal a bit more.

I think that having an 8-byte EOS as 0x0000000000000000 is OK -- I
think my concern about backwards compatibility is unwarranted.

We also previously added the EOS to the File Format

https://github.com/apache/arrow/blob/master/docs/source/format/IPC.rst#file-format

I think we should formalize this decision with this vote. Note that
making the EOS 8 bytes has the effect of aligning the file footer,
which otherwise would cause possible undefined behavior in C++.

I'll wait a day or two for more opinions to percolate and then call a new vote

On Wed, Aug 7, 2019 at 9:35 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Hmm... not sure about that.  IMHO, if the old format is detected, then a
> 4-byte EOS marker should be used.  If the new format is detected, then a
> 8-byte EOS marker should be used.
>
> Regards
>
> Antoine.
>
>
> Le 07/08/2019 à 16:16, Wes McKinney a écrit :
> > You make a good point. For backward compatibility reasons, bytes 5
> > through 8 would need to be unspecified padding bytes, I think. Does
> > that sound right?
> >
> > On Wed, Aug 7, 2019 at 9:02 AM Antoine Pitrou <an...@python.org> wrote:
> >>
> >>
> >> This may be coming a bit late, but I realize we could take the
> >> opportunity to *also* make the end-of-stream marker a 8-bytes marker
> >> (rather than 4-bytes).  What do you think?
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 06/08/2019 à 22:15, Wes McKinney a écrit :
> >>> hi all,
> >>>
> >>> As we've been discussing for the last 5 weeks or so [1], there is a
> >>> need to introduce 4 bytes of padding into the preamble of the
> >>> "encapsulated IPC message" format to ensure that the Flatbuffers
> >>> metadata payload begins on an 8-byte aligned memory offset. The
> >>> alternative to this would be for Arrow implementations where alignment
> >>> is important (e.g. C or C++) to copy the metadata (which is not always
> >>> small) into memory when it is unaligned.
> >>>
> >>> Micah has proposed to address this by adding a 4-byte "continuation"
> >>> value at the beginning of the payload having the value 0xFFFFFFFF. The
> >>> reason to do it this way is that old clients will see an invalid
> >>> length (what is currently the first 4 bytes of the message -- a 32-bit
> >>> little endian signed integer indicating the metadata length) rather
> >>> than potentially crashing on a valid length.
> >>>
> >>> This would be a backwards incompatible protocol change, so older Arrow
> >>> libraries would not be able to read these new messages. Maintaining
> >>> forward compatibility (reading data produced by older libraries) would
> >>> be possible as we can reason that a value other than the continuation
> >>> value was produced by an older library (and then validate the
> >>> Flatbuffer message of course). Arrow implementations could offer a
> >>> backward compatibility mode for the sake of old readers if they desire
> >>> (this may also assist with testing).
> >>>
> >>> The PR making these changes to the IPC documentation is here
> >>>
> >>> https://github.com/apache/arrow/pull/4951
> >>>
> >>> Please vote to accept this change. This vote will be open for at least 72 hours
> >>>
> >>> [ ] +1 Adopt the Arrow protocol change
> >>> [ ] +0
> >>> [ ] -1 I disagree because...
> >>>
> >>> Here is my vote: +1
> >>>
> >>> Thanks,
> >>> Wes
> >>>
> >>> [1]: https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E
> >>>

Re: [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements

Posted by Antoine Pitrou <an...@python.org>.
Hmm... not sure about that.  IMHO, if the old format is detected, then a
4-byte EOS marker should be used.  If the new format is detected, then a
8-byte EOS marker should be used.

Regards

Antoine.


Le 07/08/2019 à 16:16, Wes McKinney a écrit :
> You make a good point. For backward compatibility reasons, bytes 5
> through 8 would need to be unspecified padding bytes, I think. Does
> that sound right?
> 
> On Wed, Aug 7, 2019 at 9:02 AM Antoine Pitrou <an...@python.org> wrote:
>>
>>
>> This may be coming a bit late, but I realize we could take the
>> opportunity to *also* make the end-of-stream marker a 8-bytes marker
>> (rather than 4-bytes).  What do you think?
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 06/08/2019 à 22:15, Wes McKinney a écrit :
>>> hi all,
>>>
>>> As we've been discussing for the last 5 weeks or so [1], there is a
>>> need to introduce 4 bytes of padding into the preamble of the
>>> "encapsulated IPC message" format to ensure that the Flatbuffers
>>> metadata payload begins on an 8-byte aligned memory offset. The
>>> alternative to this would be for Arrow implementations where alignment
>>> is important (e.g. C or C++) to copy the metadata (which is not always
>>> small) into memory when it is unaligned.
>>>
>>> Micah has proposed to address this by adding a 4-byte "continuation"
>>> value at the beginning of the payload having the value 0xFFFFFFFF. The
>>> reason to do it this way is that old clients will see an invalid
>>> length (what is currently the first 4 bytes of the message -- a 32-bit
>>> little endian signed integer indicating the metadata length) rather
>>> than potentially crashing on a valid length.
>>>
>>> This would be a backwards incompatible protocol change, so older Arrow
>>> libraries would not be able to read these new messages. Maintaining
>>> forward compatibility (reading data produced by older libraries) would
>>> be possible as we can reason that a value other than the continuation
>>> value was produced by an older library (and then validate the
>>> Flatbuffer message of course). Arrow implementations could offer a
>>> backward compatibility mode for the sake of old readers if they desire
>>> (this may also assist with testing).
>>>
>>> The PR making these changes to the IPC documentation is here
>>>
>>> https://github.com/apache/arrow/pull/4951
>>>
>>> Please vote to accept this change. This vote will be open for at least 72 hours
>>>
>>> [ ] +1 Adopt the Arrow protocol change
>>> [ ] +0
>>> [ ] -1 I disagree because...
>>>
>>> Here is my vote: +1
>>>
>>> Thanks,
>>> Wes
>>>
>>> [1]: https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E
>>>

Re: [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements

Posted by Wes McKinney <we...@gmail.com>.
You make a good point. For backward compatibility reasons, bytes 5
through 8 would need to be unspecified padding bytes, I think. Does
that sound right?

On Wed, Aug 7, 2019 at 9:02 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> This may be coming a bit late, but I realize we could take the
> opportunity to *also* make the end-of-stream marker a 8-bytes marker
> (rather than 4-bytes).  What do you think?
>
> Regards
>
> Antoine.
>
>
> Le 06/08/2019 à 22:15, Wes McKinney a écrit :
> > hi all,
> >
> > As we've been discussing for the last 5 weeks or so [1], there is a
> > need to introduce 4 bytes of padding into the preamble of the
> > "encapsulated IPC message" format to ensure that the Flatbuffers
> > metadata payload begins on an 8-byte aligned memory offset. The
> > alternative to this would be for Arrow implementations where alignment
> > is important (e.g. C or C++) to copy the metadata (which is not always
> > small) into memory when it is unaligned.
> >
> > Micah has proposed to address this by adding a 4-byte "continuation"
> > value at the beginning of the payload having the value 0xFFFFFFFF. The
> > reason to do it this way is that old clients will see an invalid
> > length (what is currently the first 4 bytes of the message -- a 32-bit
> > little endian signed integer indicating the metadata length) rather
> > than potentially crashing on a valid length.
> >
> > This would be a backwards incompatible protocol change, so older Arrow
> > libraries would not be able to read these new messages. Maintaining
> > forward compatibility (reading data produced by older libraries) would
> > be possible as we can reason that a value other than the continuation
> > value was produced by an older library (and then validate the
> > Flatbuffer message of course). Arrow implementations could offer a
> > backward compatibility mode for the sake of old readers if they desire
> > (this may also assist with testing).
> >
> > The PR making these changes to the IPC documentation is here
> >
> > https://github.com/apache/arrow/pull/4951
> >
> > Please vote to accept this change. This vote will be open for at least 72 hours
> >
> > [ ] +1 Adopt the Arrow protocol change
> > [ ] +0
> > [ ] -1 I disagree because...
> >
> > Here is my vote: +1
> >
> > Thanks,
> > Wes
> >
> > [1]: https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E
> >

Re: [VOTE] Alter Arrow binary protocol to address 8-byte Flatbuffer alignment requirements

Posted by Antoine Pitrou <an...@python.org>.
This may be coming a bit late, but I realize we could take the
opportunity to *also* make the end-of-stream marker a 8-bytes marker
(rather than 4-bytes).  What do you think?

Regards

Antoine.


Le 06/08/2019 à 22:15, Wes McKinney a écrit :
> hi all,
> 
> As we've been discussing for the last 5 weeks or so [1], there is a
> need to introduce 4 bytes of padding into the preamble of the
> "encapsulated IPC message" format to ensure that the Flatbuffers
> metadata payload begins on an 8-byte aligned memory offset. The
> alternative to this would be for Arrow implementations where alignment
> is important (e.g. C or C++) to copy the metadata (which is not always
> small) into memory when it is unaligned.
> 
> Micah has proposed to address this by adding a 4-byte "continuation"
> value at the beginning of the payload having the value 0xFFFFFFFF. The
> reason to do it this way is that old clients will see an invalid
> length (what is currently the first 4 bytes of the message -- a 32-bit
> little endian signed integer indicating the metadata length) rather
> than potentially crashing on a valid length.
> 
> This would be a backwards incompatible protocol change, so older Arrow
> libraries would not be able to read these new messages. Maintaining
> forward compatibility (reading data produced by older libraries) would
> be possible as we can reason that a value other than the continuation
> value was produced by an older library (and then validate the
> Flatbuffer message of course). Arrow implementations could offer a
> backward compatibility mode for the sake of old readers if they desire
> (this may also assist with testing).
> 
> The PR making these changes to the IPC documentation is here
> 
> https://github.com/apache/arrow/pull/4951
> 
> Please vote to accept this change. This vote will be open for at least 72 hours
> 
> [ ] +1 Adopt the Arrow protocol change
> [ ] +0
> [ ] -1 I disagree because...
> 
> Here is my vote: +1
> 
> Thanks,
> Wes
> 
> [1]: https://lists.apache.org/thread.html/8440be572c49b7b2ffb76b63e6d935ada9efd9c1c2021369b6d27786@%3Cdev.arrow.apache.org%3E
>