You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by John Muehlhausen <jg...@jgm.org> on 2019/07/01 02:03:33 UTC

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

If there is going to be a breaking change to the IPC format, I'd appreciate
some discussion about an idea I had for RecordBatch metadata.  I previously
promised to create a discussion thread with an initial write-up but have
not yet done so.  I will try to do this tomorrow.  (The basic idea is to
have readers read only N records of a RecordBatch rather than all of them,
where in the usual case these row counts are the same.  This opens the door
to reading partially populated RecordBatches which will be extremely useful
in low-latency streaming applications.)

On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <em...@gmail.com>
wrote:

> At least on the read-side we can make this detectable by using something
> like <0xffffffff><int32_t size> instead of int64_t.  On the write side we
> would need some sort of default mode that we could flip on/off if we wanted
> to maintain compatibility.
>
> I should say I think we should fix it.  Undefined behavior is unpaid debt
> that might never be collected or might cause things to fail in difficult to
> diagnose ways. And pre-1.0.0 is definitely the time.
>
> -Micah
>
> On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:
>
> > On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > > hi Micah,
> > >
> > > This is definitely unfortunate, I wish we had realized the potential
> > > implications of having the Flatbuffer message start on a 4-byte
> > > (rather than 8-byte) boundary. The cost of making such a change now
> > > would be pretty high since all readers and writers in all languages
> > > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
> > > bump is the last opportunity we have to make a change like this, so we
> > > might as well discuss it now. Note that particular implementations
> > > could implement compatibility functions to handle the 4 to 8 byte
> > > change so that old clients can still be understood. We'd probably want
> > > to do this in C++, for example, since users would pretty quickly
> > > acquire a new pyarrow version in Spark applications while they are
> > > stuck on an old version of the Java libraries.
> >
> > NB such a backwards compatibility fix would not be forward-compatible,
> > so the PySpark users would need to use a pinned version of pyarrow
> > until Spark upgraded to Arrow 1.0.0. Maybe that's OK
> >
> > >
> > > - Wes
> > >
> > > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <emkornfield@gmail.com
> >
> > wrote:
> > > >
> > > > While working on trying to fix undefined behavior for unaligned
> memory
> > > > accesses [1], I ran into an issue with the IPC specification [2]
> which
> > > > prevents us from ever achieving zero-copy memory mapping and having
> > aligned
> > > > accesses (i.e. clean UBSan runs).
> > > >
> > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> > accesses.
> > > >
> > > > In the IPC format we align each message to 8-byte boundaries.  We
> then
> > > > write a int32_t integer to to denote the size of flat buffer
> metadata,
> > > > followed immediately  by the flatbuffer metadata.  This means the
> > > > flatbuffer metadata will never be 8 byte aligned.
> > > >
> > > > Do people care?  A simple fix  would be to use int64_t instead of
> > int32_t
> > > > for length.  However, any fix essentially breaks all previous client
> > > > library versions or incurs a memory copy.
> > > >
> > > > [1] https://github.com/apache/arrow/pull/4757
> > > > [2] https://arrow.apache.org/docs/ipc.html
> >
>

Re: [Discuss] IPC Specification, flatbuffers and unaligned memory accesses

Posted by Wes McKinney <we...@gmail.com>.
hi John,

This is off topic, so please feel free to start a discussion thread
and present a detailed proposal. I believe your use case can be
addressed by pre-allocating record batches and maintaining application
level metadata about what portion of the record batches has been
"filled" (so the unfilled records can be dropped by slicing). I don't
think any change to the binary protocol is warranted

Thanks,
Wes

On Sun, Jun 30, 2019 at 9:03 PM John Muehlhausen <jg...@jgm.org> wrote:
>
> If there is going to be a breaking change to the IPC format, I'd appreciate
> some discussion about an idea I had for RecordBatch metadata.  I previously
> promised to create a discussion thread with an initial write-up but have
> not yet done so.  I will try to do this tomorrow.  (The basic idea is to
> have readers read only N records of a RecordBatch rather than all of them,
> where in the usual case these row counts are the same.  This opens the door
> to reading partially populated RecordBatches which will be extremely useful
> in low-latency streaming applications.)
>
> On Sun, Jun 30, 2019 at 6:36 PM Micah Kornfield <em...@gmail.com>
> wrote:
>
> > At least on the read-side we can make this detectable by using something
> > like <0xffffffff><int32_t size> instead of int64_t.  On the write side we
> > would need some sort of default mode that we could flip on/off if we wanted
> > to maintain compatibility.
> >
> > I should say I think we should fix it.  Undefined behavior is unpaid debt
> > that might never be collected or might cause things to fail in difficult to
> > diagnose ways. And pre-1.0.0 is definitely the time.
> >
> > -Micah
> >
> > On Sun, Jun 30, 2019 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > On Sun, Jun 30, 2019 at 5:14 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >
> > > > hi Micah,
> > > >
> > > > This is definitely unfortunate, I wish we had realized the potential
> > > > implications of having the Flatbuffer message start on a 4-byte
> > > > (rather than 8-byte) boundary. The cost of making such a change now
> > > > would be pretty high since all readers and writers in all languages
> > > > would have to be changed. That being said, the 0.14.0 -> 1.0.0 version
> > > > bump is the last opportunity we have to make a change like this, so we
> > > > might as well discuss it now. Note that particular implementations
> > > > could implement compatibility functions to handle the 4 to 8 byte
> > > > change so that old clients can still be understood. We'd probably want
> > > > to do this in C++, for example, since users would pretty quickly
> > > > acquire a new pyarrow version in Spark applications while they are
> > > > stuck on an old version of the Java libraries.
> > >
> > > NB such a backwards compatibility fix would not be forward-compatible,
> > > so the PySpark users would need to use a pinned version of pyarrow
> > > until Spark upgraded to Arrow 1.0.0. Maybe that's OK
> > >
> > > >
> > > > - Wes
> > > >
> > > > On Sun, Jun 30, 2019 at 3:01 AM Micah Kornfield <emkornfield@gmail.com
> > >
> > > wrote:
> > > > >
> > > > > While working on trying to fix undefined behavior for unaligned
> > memory
> > > > > accesses [1], I ran into an issue with the IPC specification [2]
> > which
> > > > > prevents us from ever achieving zero-copy memory mapping and having
> > > aligned
> > > > > accesses (i.e. clean UBSan runs).
> > > > >
> > > > > Flatbuffer metadata needs 8-byte alignment to guarantee aligned
> > > accesses.
> > > > >
> > > > > In the IPC format we align each message to 8-byte boundaries.  We
> > then
> > > > > write a int32_t integer to to denote the size of flat buffer
> > metadata,
> > > > > followed immediately  by the flatbuffer metadata.  This means the
> > > > > flatbuffer metadata will never be 8 byte aligned.
> > > > >
> > > > > Do people care?  A simple fix  would be to use int64_t instead of
> > > int32_t
> > > > > for length.  However, any fix essentially breaks all previous client
> > > > > library versions or incurs a memory copy.
> > > > >
> > > > > [1] https://github.com/apache/arrow/pull/4757
> > > > > [2] https://arrow.apache.org/docs/ipc.html
> > >
> >