You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2021/07/07 15:50:06 UTC

[DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Retitling and forking the discussion to talk about key value pairs.

What is the byte cost of an empty list?  Another option would be to
introduce a new BinaryKeyValue table and add binary metadata.

On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind <
natebauernfeind@deephaven.io> wrote:

> Deephaven and I are very supportive of "upgrading" the value half of the kv
> pair to a byte vector. What is the best way to find out if there is
> sufficient interest?
>
>
> I've been stewing on the ideas here around schema evolution, and I realize
> the specific feature I am missing is the ability to encode that a field
> (i.e. its FieldNode and accompanying Buffers in the RecordBatch) is
> empty/has-no-data in O(0) cost (yes; for free).
>
> Might there be interest in adding a "field_id" to the FieldNode (which is
> encoded on the RecordBatch flatbuffer)? I see a simple forward-compatible
> upgrade (by either keying off of 0, or explicitly set the field default to
> -1) which would allow the sender to "skip" fields that have 1) FieldNode
> length of zero, and 2) all Buffer's associated at that level (and further
> nested) are also equally empty (i.e. Buffer length is zero).
>
> I understand this concept slightly interferes with RecordBatch's `length`
> field, and that many implementations use that length to resize the
> root-level FieldNodes. The use-case I have in mind has different logical
> lengths per field node; current implementations require sending a
> RecordBatch length of the max length across all root level field nodes. I
> believe this requires a copy of data whenever a field node is too short; I
> don't know if there is a decent solution to this slight inefficiency. I am
> bringing it up because if "skipping a field node when it is empty" is a
> feature, then we may not want to allocate space for those nodes given that
> the record batch length will likely be greater than zero.
>
> On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <we...@gmail.com> wrote:
>
> > On Wed, Jul 7, 2021 at 2:53 PM David Li <ap...@lidavidm.me> wrote:
> > >
> > > From the Flatbuffers internals doc[1] it appears they are the same:
> > "Strings are simply a vector of bytes, and are always null-terminated."
> >
> > I see. I took a look at flatbuffers.h, and it appears that changing
> > this field from string to [byte] would be backward-compatible and
> > forward-compatible except with code that expects a null terminator.
> > This is something we could discuss separately if there were enough
> > interest.
> >
> > > [1]: https://google.github.io/flatbuffers/flatbuffers_internals.html
> > >
> > > -David
> > >
> > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
> > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <
> emkornfield@gmail.com>
> > wrote:
> > > > >
> > > > > >
> > > > > > Right, I had wanted to focus the discussion on Flight as I think
> > schema
> > > > > > evolution or multiplexing streams (more so the latter) is a
> > property of the
> > > > > > transport and not the stream format itself. If we are leaning
> > towards just
> > > > > > schema evolution then maybe it makes sense to discuss it for the
> > IPC stream
> > > > > > format and leverage that in Flight. I'd be interested in what
> > others think.
> > > > >
> > > > > I tend to agree, I think stream multiplexing is likely a transport
> > level
> > > > > issue.  IMO I think schema evolution should be consistent with the
> > IPC
> > > > > stream format  and flight.
> > > > >
> > > > >
> > > > > > Nate: it may be worth starting a separate discussion about more
> > general
> > > > > > metadata in the IPC message. I'm not aware of why key-value
> > metadata was
> > > > > > chosen/if opaque bytes were considered in the past.
> > > > >
> > > > >
> > > > > I think  this was an unfortunate design of the key value metadata
> in
> > > > > Schema.fbs, but I don't think I was around when this decision was
> > made.
> > > >
> > > > I agree that it's unfortunate that we did not use [ byte ] instead of
> > > > string for the value in the KeyValue metadata — I think this was more
> > > > of an oversight than a deliberate choice (e.g. it was not our intent
> > > > to require binary data to be base64-encoded — this is something that
> > > > we have to do when encoding binary data in Thrift KeyValue metadata
> > > > for Parquet, for example). Is the binary representation of [byte]
> > > > different from string?
> > > >
> > > >
> > > >
> > > > > Side Question: Why isn't the IPC stream format a series of the
> flight
> > > > > > protobufs?
> > > > >
> > > > > In addition to what David said, protobufs can't be read directly
> > from a
> > > > > memory-mapped file (they need decoding).  This was one of the
> design
> > > > > considerations of using flatbuffers and IPC Stream/File format.
> > > > >
> > > > > I was thinking Micah's comment is more that whatever we do, it
> > should be
> > > > > > clearly specified and edge cases should be considered, especially
> > if we
> > > > > > might want to 'backport' this into the stream format later.
> > > > >
> > > > >
> > > > > Yes, for dictionaries we just need to be careful to define
> semantics
> > and
> > > > > ensure implementations are validating them with regards to
> > dictionaries.
> > > > > There likely isn't any need to change current implementations
> though.
> > > > >
> > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li <li...@apache.org>
> > wrote:
> > > > >
> > > > > > Right, I had wanted to focus the discussion on Flight as I think
> > schema
> > > > > > evolution or multiplexing streams (more so the latter) is a
> > property of the
> > > > > > transport and not the stream format itself. If we are leaning
> > towards just
> > > > > > schema evolution then maybe it makes sense to discuss it for the
> > IPC stream
> > > > > > format and leverage that in Flight. I'd be interested in what
> > others think.
> > > > > >
> > > > > > Especially if we are looking at multiplexing streams - I would
> > wonder if
> > > > > > that's actually better served by making it easier to implement
> > using the
> > > > > > Flight implementation as it stands (by managing concurrent RPC
> > calls and/or
> > > > > > performing the union-of-structs encoding trick for you), instead
> > of having
> > > > > > to change the protocol.
> > > > > >
> > > > > > Nate: it may be worth starting a separate discussion about more
> > general
> > > > > > metadata in the IPC message. I'm not aware of why key-value
> > metadata was
> > > > > > chosen/if opaque bytes were considered in the past. Off the top
> of
> > my head
> > > > > > if it's for on-disk storage and fully application-defined it may
> > make sense
> > > > > > to store as a separate file alongside the Arrow file (indexed by
> > record
> > > > > > batch index) where you can take advantage of whatever format is
> > most
> > > > > > suitable.
> > > > > >
> > > > > > -David
> > > > > >
> > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan wrote:
> > > > > > > Hi guys,
> > > > > > >
> > > > > > > 1. Regarding IPC vs Flight: in fact my initial suggestion was
> to
> > add this
> > > > > > > feature starting from the IPC(I moved initial write up steps to
> > the
> > > > > > bottom
> > > > > > > of the doc). Afterwards David suggested focusing on Flight and
> > that's how
> > > > > > > we ended up with the protobufs change in the proposal. This
> > being said I
> > > > > > do
> > > > > > > think that the place where this should be impemented is a good
> > question
> > > > > > on
> > > > > > > its own. Maybe it makes sense to have this kind of a feature in
> > IPC and
> > > > > > > somehow use it in Flight, maybe not.
> > > > > > > 2. The point about dictionaries deserves a dedicated section in
> > the
> > > > > > > proposal. Nate and David brought it up and shared some
> insights.
> > I'll try
> > > > > > > to aggregate them and we can continue the discussion form
> there.
> > > > > > >
> > > > > > > Cheers,
> > > > > > > Gosh
> > > > > > >
> > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <
> > > > > > natebauernfeind@deephaven.io>
> > > > > > > wrote:
> > > > > > >
> > > > > > > > >
> > > > > > > > > > > makes it more difficult to bring schema evolution back
> > into the
> > > > > > > > > > > IPC Stream format (i.e. it would live only in flight)
> > > > > > > > > >
> > > > > > > > > > Gosh's proposal extends the flatbuffer structures not the
> > > > > > protobufs.
> > > > > > > > Can
> > > > > > > > > > you help me understand how difficult it would be to bring
> > the
> > > > > > > > `schema_id`
> > > > > > > > > > approach to the IPC stream format?
> > > > > > > > >
> > > > > > > > > I thought we were talking solely about the Flight Protobuf
> > > > > > definitions -
> > > > > > > > > not the Flatbuffers (and the Google doc at least only talks
> > about the
> > > > > > > > > Protobufs).
> > > > > > > > >
> > > > > > > >
> > > > > > > > I somehow missed that schema_id is being added to protobuf in
> > the
> > > > > > document.
> > > > > > > > It feels to me that the schema_id is a property that would
> > ideally only
> > > > > > > > apply to the RecordBatch. I better understand Micah's
> > dictionary
> > > > > > concerns,
> > > > > > > > now, too.
> > > > > > > >
> > > > > > > > > Side Question: Why isn't the IPC stream format a series of
> > the flight
> > > > > > > > > > protobufs? It's a real shame that there is no standard
> way
> > to
> > > > > > > > > > capture/replay a stream with app_metadata. (Obviously
> > ignoring the
> > > > > > > > > > annoyances around protobuf wrapping flatbuffers.)
> > > > > > > > >
> > > > > > > > > The IPC format was defined long before Flight, and Flight's
> > > > > > app_metadata
> > > > > > > > > was added after Flight's initial definition. Note an IPC
> > message does
> > > > > > > > have
> > > > > > > > > a provision for key-value metadata, though I think APIs for
> > that are
> > > > > > not
> > > > > > > > > fully exposed. (See ARROW-6940:
> > > > > > > > > https://issues.apache.org/jira/browse/ARROW-6940 and
> > despite my
> > > > > > comments
> > > > > > > > > there perhaps we need to unify or at least consider how
> > Flight's
> > > > > > > > > app_metadata relates to the IPC message custom_metadata.
> Also
> > > > > > perhaps see
> > > > > > > > > ARROW-1059.)
> > > > > > > > >
> > > > > > > >
> > > > > > > > KeyValue unfortunately is string to string. In flatbuffer
> > strings are
> > > > > > only
> > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the other hand is
> > opaque
> > > > > > bytes.
> > > > > > > > The latter is a bit more useful.
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > >
> > > > > >
> > > >
> >
>
>
> --
>

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Phillip Cloud <cp...@gmail.com>.
I don't think memcpy is feasible. The bytes may be different for different
languages.

Many languages' compilers reorder struct fields and pad structs for
efficiency reasons so the bytes in metadata coming from language X may be
meaningless to language Y.


On Wed, Feb 9, 2022, 08:41 Dewey Dunnington <de...@voltrondata.com> wrote:

> I'm new to this discussion and Arrow generally and appreciate Joris
> bringing this up. While the forward- and backward-compatability of this is
> over my head, I think that it is important to provide a path for extension
> types to have serialized metadata as binary because the alternatives are
> (in my opinion) not great and might lead to extension developers inventing
> their own key/value encoding.
>
> With binary on the table there are options like protobuf, flatbuffers, or
> memcpy of a struct that extension type implementations can use depending on
> the complexity of the extension metadata that needs to be communicated and
> whether or not the metadata is intended to be used in compiled code (i.e.,
> are those values needed to do anything useful with the bytes in the
> buffers?).
>
> Without binary on the table, an extension type implementation has to vendor
> in a JSON reader/writer, vendor in a base64 encoder/decoder, or invent some
> custom serialization that's easier to parse than either of those.
>
> Cheers,
>
> -dewey
>
> On Fri, Feb 4, 2022 at 7:06 AM Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
>
> > Reviving this thread, I don't think anything happened in the meantime on
> > this topic?
> >
> > From rereading the thread, it seems David mentioned two possible ideas:
> > - A new [byte] binary_value field in the existing KeyValue type, next to
> > the existing string value field. And if you have valid string metadata,
> the
> > same value can be put in both fields by reusing the offset
> > - Relax the existing string value field to [byte], but also require
> > implementations to null-terminate binary values regardless to ensure
> > compatibility for old readers
> >  - And in addition to those options, there is also the safest option of
> > adding a new BinaryKeyValue type and add that next to KeyValue everywhere
> > this is used.
> >
> > And we need to further assess which option is viable given the
> > forward/backward compatibility constraints?
> >
> > I want to point out though that in pyarrow we already _do_ use binary
> > metadata for extension types for several years. The built-in
> > PyExtensionType class uses a pickle dump as the serialized metadata (so
> > which is put in the ARROW:extension:metadata key in the key-value
> metadata
> > of the Field), and the bytes from a pickle dump are certainly not valid
> > UTF-8 I think.
> > I don't know how much this is used in the wild. In pandas, our arrow
> > extension types don't use this pickle-based PyExtensionType but define
> > their own serialization which currently returns a valid UTF-8 string. But
> > for example the Ray project seems to use PyExtensionType (
> >
> >
> https://docs.ray.io/en/latest/_modules/ray/data/extensions/tensor_extension.html#ArrowTensorType
> > ).
> >
> > Joris
> >
> > On Thu, 2 Sept 2021 at 07:01, Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > It still sounds like adding a new type might be the safest approach
> (and
> > > marking the old type as discouraged).
> > >
> > > On Mon, Aug 23, 2021 at 11:18 AM David Li <li...@apache.org> wrote:
> > >
> > > > I believe so.
> > > >
> > > > The encoding of a string in Flatbuffers is [byte] with a null
> > terminator
> > > > not included in the length, so old files should still be readable
> (they
> > > > would simply not see the terminator anymore). And conversely,
> > continuing
> > > to
> > > > write the null terminator means new files should still be readable by
> > > > existing implementations, so long as they don't have binary metadata
> > (but
> > > > we could increment the metadata version before allowing that). There
> > > still
> > > > remains the issue of non-UTF8 data inside the "string" value; on that
> > > > point, this is a non-answer (or I guess, it canonicalizes the current
> > > > unintentional behavior).
> > > >
> > > > It would complicate the Java API as that currently works with
> Strings,
> > > but
> > > > that will be an issue for any attempt to add binary metadata.
> > > >
> > > > -David
> > > >
> > > > On Mon, Aug 23, 2021, at 13:40, Antoine Pitrou wrote:
> > > > >
> > > > > Le 23/08/2021 à 17:52, David Li a écrit :
> > > > > > Another way forward might be to relax the value type to [byte],
> but
> > > > also require implementations to null-terminate binary values
> > regardless.
> > > > The C++ Flatbuffers implementation does this already [1] (though not
> > the
> > > > Java one [2]). Old implementations validating UTF8-ness would still
> be
> > > > unable to read anything with binary metadata (which is the case
> they're
> > > > already in), but implementations just validating for null-termination
> > > would
> > > > continue to work. And again, I believe this is in-spec for
> Flatbuffers
> > > > since string lengths don't include the null terminator.
> > > > > >
> > > > > > [1]:
> > > >
> > >
> >
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/include/flatbuffers/flatbuffers.h#L1612-L1670
> > > > > > [2]:
> > > >
> > >
> >
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/java/com/google/flatbuffers/FlatBufferBuilder.java#L594-L637
> > > > > > Though it seems we could emulate C++ here by calling addByte(0)
> > > > ourselves before createByteVector.
> > > > >
> > > > > Would that allow for both forwards and backwards compatibility when
> > the
> > > > > metadata is valid UTF8?
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Dewey Dunnington <de...@voltrondata.com>.
I'm new to this discussion and Arrow generally and appreciate Joris
bringing this up. While the forward- and backward-compatability of this is
over my head, I think that it is important to provide a path for extension
types to have serialized metadata as binary because the alternatives are
(in my opinion) not great and might lead to extension developers inventing
their own key/value encoding.

With binary on the table there are options like protobuf, flatbuffers, or
memcpy of a struct that extension type implementations can use depending on
the complexity of the extension metadata that needs to be communicated and
whether or not the metadata is intended to be used in compiled code (i.e.,
are those values needed to do anything useful with the bytes in the
buffers?).

Without binary on the table, an extension type implementation has to vendor
in a JSON reader/writer, vendor in a base64 encoder/decoder, or invent some
custom serialization that's easier to parse than either of those.

Cheers,

-dewey

On Fri, Feb 4, 2022 at 7:06 AM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> Reviving this thread, I don't think anything happened in the meantime on
> this topic?
>
> From rereading the thread, it seems David mentioned two possible ideas:
> - A new [byte] binary_value field in the existing KeyValue type, next to
> the existing string value field. And if you have valid string metadata, the
> same value can be put in both fields by reusing the offset
> - Relax the existing string value field to [byte], but also require
> implementations to null-terminate binary values regardless to ensure
> compatibility for old readers
>  - And in addition to those options, there is also the safest option of
> adding a new BinaryKeyValue type and add that next to KeyValue everywhere
> this is used.
>
> And we need to further assess which option is viable given the
> forward/backward compatibility constraints?
>
> I want to point out though that in pyarrow we already _do_ use binary
> metadata for extension types for several years. The built-in
> PyExtensionType class uses a pickle dump as the serialized metadata (so
> which is put in the ARROW:extension:metadata key in the key-value metadata
> of the Field), and the bytes from a pickle dump are certainly not valid
> UTF-8 I think.
> I don't know how much this is used in the wild. In pandas, our arrow
> extension types don't use this pickle-based PyExtensionType but define
> their own serialization which currently returns a valid UTF-8 string. But
> for example the Ray project seems to use PyExtensionType (
>
> https://docs.ray.io/en/latest/_modules/ray/data/extensions/tensor_extension.html#ArrowTensorType
> ).
>
> Joris
>
> On Thu, 2 Sept 2021 at 07:01, Micah Kornfield <em...@gmail.com>
> wrote:
>
> > It still sounds like adding a new type might be the safest approach (and
> > marking the old type as discouraged).
> >
> > On Mon, Aug 23, 2021 at 11:18 AM David Li <li...@apache.org> wrote:
> >
> > > I believe so.
> > >
> > > The encoding of a string in Flatbuffers is [byte] with a null
> terminator
> > > not included in the length, so old files should still be readable (they
> > > would simply not see the terminator anymore). And conversely,
> continuing
> > to
> > > write the null terminator means new files should still be readable by
> > > existing implementations, so long as they don't have binary metadata
> (but
> > > we could increment the metadata version before allowing that). There
> > still
> > > remains the issue of non-UTF8 data inside the "string" value; on that
> > > point, this is a non-answer (or I guess, it canonicalizes the current
> > > unintentional behavior).
> > >
> > > It would complicate the Java API as that currently works with Strings,
> > but
> > > that will be an issue for any attempt to add binary metadata.
> > >
> > > -David
> > >
> > > On Mon, Aug 23, 2021, at 13:40, Antoine Pitrou wrote:
> > > >
> > > > Le 23/08/2021 à 17:52, David Li a écrit :
> > > > > Another way forward might be to relax the value type to [byte], but
> > > also require implementations to null-terminate binary values
> regardless.
> > > The C++ Flatbuffers implementation does this already [1] (though not
> the
> > > Java one [2]). Old implementations validating UTF8-ness would still be
> > > unable to read anything with binary metadata (which is the case they're
> > > already in), but implementations just validating for null-termination
> > would
> > > continue to work. And again, I believe this is in-spec for Flatbuffers
> > > since string lengths don't include the null terminator.
> > > > >
> > > > > [1]:
> > >
> >
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/include/flatbuffers/flatbuffers.h#L1612-L1670
> > > > > [2]:
> > >
> >
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/java/com/google/flatbuffers/FlatBufferBuilder.java#L594-L637
> > > > > Though it seems we could emulate C++ here by calling addByte(0)
> > > ourselves before createByteVector.
> > > >
> > > > Would that allow for both forwards and backwards compatibility when
> the
> > > > metadata is valid UTF8?
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Joris Van den Bossche <jo...@gmail.com>.
Reviving this thread, I don't think anything happened in the meantime on
this topic?

From rereading the thread, it seems David mentioned two possible ideas:
- A new [byte] binary_value field in the existing KeyValue type, next to
the existing string value field. And if you have valid string metadata, the
same value can be put in both fields by reusing the offset
- Relax the existing string value field to [byte], but also require
implementations to null-terminate binary values regardless to ensure
compatibility for old readers
 - And in addition to those options, there is also the safest option of
adding a new BinaryKeyValue type and add that next to KeyValue everywhere
this is used.

And we need to further assess which option is viable given the
forward/backward compatibility constraints?

I want to point out though that in pyarrow we already _do_ use binary
metadata for extension types for several years. The built-in
PyExtensionType class uses a pickle dump as the serialized metadata (so
which is put in the ARROW:extension:metadata key in the key-value metadata
of the Field), and the bytes from a pickle dump are certainly not valid
UTF-8 I think.
I don't know how much this is used in the wild. In pandas, our arrow
extension types don't use this pickle-based PyExtensionType but define
their own serialization which currently returns a valid UTF-8 string. But
for example the Ray project seems to use PyExtensionType (
https://docs.ray.io/en/latest/_modules/ray/data/extensions/tensor_extension.html#ArrowTensorType
).

Joris

On Thu, 2 Sept 2021 at 07:01, Micah Kornfield <em...@gmail.com> wrote:

> It still sounds like adding a new type might be the safest approach (and
> marking the old type as discouraged).
>
> On Mon, Aug 23, 2021 at 11:18 AM David Li <li...@apache.org> wrote:
>
> > I believe so.
> >
> > The encoding of a string in Flatbuffers is [byte] with a null terminator
> > not included in the length, so old files should still be readable (they
> > would simply not see the terminator anymore). And conversely, continuing
> to
> > write the null terminator means new files should still be readable by
> > existing implementations, so long as they don't have binary metadata (but
> > we could increment the metadata version before allowing that). There
> still
> > remains the issue of non-UTF8 data inside the "string" value; on that
> > point, this is a non-answer (or I guess, it canonicalizes the current
> > unintentional behavior).
> >
> > It would complicate the Java API as that currently works with Strings,
> but
> > that will be an issue for any attempt to add binary metadata.
> >
> > -David
> >
> > On Mon, Aug 23, 2021, at 13:40, Antoine Pitrou wrote:
> > >
> > > Le 23/08/2021 à 17:52, David Li a écrit :
> > > > Another way forward might be to relax the value type to [byte], but
> > also require implementations to null-terminate binary values regardless.
> > The C++ Flatbuffers implementation does this already [1] (though not the
> > Java one [2]). Old implementations validating UTF8-ness would still be
> > unable to read anything with binary metadata (which is the case they're
> > already in), but implementations just validating for null-termination
> would
> > continue to work. And again, I believe this is in-spec for Flatbuffers
> > since string lengths don't include the null terminator.
> > > >
> > > > [1]:
> >
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/include/flatbuffers/flatbuffers.h#L1612-L1670
> > > > [2]:
> >
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/java/com/google/flatbuffers/FlatBufferBuilder.java#L594-L637
> > > > Though it seems we could emulate C++ here by calling addByte(0)
> > ourselves before createByteVector.
> > >
> > > Would that allow for both forwards and backwards compatibility when the
> > > metadata is valid UTF8?
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> >
>

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Micah Kornfield <em...@gmail.com>.
It still sounds like adding a new type might be the safest approach (and
marking the old type as discouraged).

On Mon, Aug 23, 2021 at 11:18 AM David Li <li...@apache.org> wrote:

> I believe so.
>
> The encoding of a string in Flatbuffers is [byte] with a null terminator
> not included in the length, so old files should still be readable (they
> would simply not see the terminator anymore). And conversely, continuing to
> write the null terminator means new files should still be readable by
> existing implementations, so long as they don't have binary metadata (but
> we could increment the metadata version before allowing that). There still
> remains the issue of non-UTF8 data inside the "string" value; on that
> point, this is a non-answer (or I guess, it canonicalizes the current
> unintentional behavior).
>
> It would complicate the Java API as that currently works with Strings, but
> that will be an issue for any attempt to add binary metadata.
>
> -David
>
> On Mon, Aug 23, 2021, at 13:40, Antoine Pitrou wrote:
> >
> > Le 23/08/2021 à 17:52, David Li a écrit :
> > > Another way forward might be to relax the value type to [byte], but
> also require implementations to null-terminate binary values regardless.
> The C++ Flatbuffers implementation does this already [1] (though not the
> Java one [2]). Old implementations validating UTF8-ness would still be
> unable to read anything with binary metadata (which is the case they're
> already in), but implementations just validating for null-termination would
> continue to work. And again, I believe this is in-spec for Flatbuffers
> since string lengths don't include the null terminator.
> > >
> > > [1]:
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/include/flatbuffers/flatbuffers.h#L1612-L1670
> > > [2]:
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/java/com/google/flatbuffers/FlatBufferBuilder.java#L594-L637
> > > Though it seems we could emulate C++ here by calling addByte(0)
> ourselves before createByteVector.
> >
> > Would that allow for both forwards and backwards compatibility when the
> > metadata is valid UTF8?
> >
> > Regards
> >
> > Antoine.
> >
> >
>

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by David Li <li...@apache.org>.
I believe so.

The encoding of a string in Flatbuffers is [byte] with a null terminator not included in the length, so old files should still be readable (they would simply not see the terminator anymore). And conversely, continuing to write the null terminator means new files should still be readable by existing implementations, so long as they don't have binary metadata (but we could increment the metadata version before allowing that). There still remains the issue of non-UTF8 data inside the "string" value; on that point, this is a non-answer (or I guess, it canonicalizes the current unintentional behavior).

It would complicate the Java API as that currently works with Strings, but that will be an issue for any attempt to add binary metadata.

-David

On Mon, Aug 23, 2021, at 13:40, Antoine Pitrou wrote:
> 
> Le 23/08/2021 à 17:52, David Li a écrit :
> > Another way forward might be to relax the value type to [byte], but also require implementations to null-terminate binary values regardless. The C++ Flatbuffers implementation does this already [1] (though not the Java one [2]). Old implementations validating UTF8-ness would still be unable to read anything with binary metadata (which is the case they're already in), but implementations just validating for null-termination would continue to work. And again, I believe this is in-spec for Flatbuffers since string lengths don't include the null terminator.
> > 
> > [1]: https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/include/flatbuffers/flatbuffers.h#L1612-L1670
> > [2]: https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/java/com/google/flatbuffers/FlatBufferBuilder.java#L594-L637
> > Though it seems we could emulate C++ here by calling addByte(0) ourselves before createByteVector.
> 
> Would that allow for both forwards and backwards compatibility when the 
> metadata is valid UTF8?
> 
> Regards
> 
> Antoine.
> 
> 

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Antoine Pitrou <an...@python.org>.
Le 23/08/2021 à 17:52, David Li a écrit :
> Another way forward might be to relax the value type to [byte], but also require implementations to null-terminate binary values regardless. The C++ Flatbuffers implementation does this already [1] (though not the Java one [2]). Old implementations validating UTF8-ness would still be unable to read anything with binary metadata (which is the case they're already in), but implementations just validating for null-termination would continue to work. And again, I believe this is in-spec for Flatbuffers since string lengths don't include the null terminator.
> 
> [1]: https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/include/flatbuffers/flatbuffers.h#L1612-L1670
> [2]: https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/java/com/google/flatbuffers/FlatBufferBuilder.java#L594-L637
> Though it seems we could emulate C++ here by calling addByte(0) ourselves before createByteVector.

Would that allow for both forwards and backwards compatibility when the 
metadata is valid UTF8?

Regards

Antoine.


Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by David Li <li...@apache.org>.
Another way forward might be to relax the value type to [byte], but also require implementations to null-terminate binary values regardless. The C++ Flatbuffers implementation does this already [1] (though not the Java one [2]). Old implementations validating UTF8-ness would still be unable to read anything with binary metadata (which is the case they're already in), but implementations just validating for null-termination would continue to work. And again, I believe this is in-spec for Flatbuffers since string lengths don't include the null terminator.

[1]: https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/include/flatbuffers/flatbuffers.h#L1612-L1670
[2]: https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/java/com/google/flatbuffers/FlatBufferBuilder.java#L594-L637
Though it seems we could emulate C++ here by calling addByte(0) ourselves before createByteVector.

-David

On Wed, Aug 18, 2021, at 13:20, David Li wrote:
> This isn't too thought out yet but:
> 1.  Any file which stuffs binary data into the value is already unreadable for anyone directly using Flatbuffers. So we can specify that the field must be valid UTF-8, but implementations can permit relaxed validation/reading as binary data instead in order to read old files for the current metadata version.
> 2. For the new version, we can add a [byte] field to KeyValueMetadata alongside the string field. Since strings in Flatbuffers don't include the null terminator in their length, implementations can reuse the offset to allow values which are valid UTF-8 to be placed in both fields at a cost of only 4 bytes, so old consumers can still see UTF-8 metadata values. Actual binary values would only exist in the new field, and old consumers would see a null value (since we didn't declare it required); if both fields have a value, the binary value takes precedence. I think this at least leaves our read APIs alone since they already use bytestrings.
> A current extension type that uses binary metadata would still be a problem, though…
> 
> -David
> 
> On Tue, Aug 17, 2021, at 00:08, Micah Kornfield wrote:
> > I agree with you any thoughts on a way forward for at least hardening the
> > spec (or should this be done at the same time as adding the new field)?
> > 
> > On Mon, Aug 16, 2021 at 1:45 AM Wes McKinney <we...@gmail.com> wrote:
> > 
> > > I've been poking around the project, and I'm growing concerned that
> > > our use of the KeyValue field has already been non-compliant in many
> > > cases since we do not validate UTF8-ness. Since we also use KeyValue
> > > to handle opaque data serialization for extension types [1], the fact
> > > that the specification does not clarify that binary data (such as the
> > > output of ExtensionType::Serialize) must be base64-encoded or similar
> > > makes this a bit of a minefield at the moment.
> > >
> > > It seems that there are no particularly excellent solutions, and
> > > maintaining the status quo (having now identified these
> > > inconsistencies / vagueries in at least the C++ implementation) is
> > > probably not a good idea either. In Parquet, where we store a
> > > serialized binary Arrow schema, we have to base64-encode [2]
> > >
> > > [1]:
> > > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> > > [2]:
> > > https://github.com/apache/arrow/blob/e990d177b1f1dec962315487682f613d46be573c/cpp/src/parquet/arrow/writer.cc#L423
> > >
> > > On Tue, Aug 10, 2021 at 3:27 PM Wes McKinney <we...@gmail.com> wrote:
> > > >
> > > > Ah, that's definitely a no-go then (I believe we verify messages
> > > > unconditionally in C++). That's unfortunate (and I feel responsible
> > > > for missing this, but I suppose we had a lot of opportunities to fix
> > > > it prior to the 1.0.0 format version) — so to have actual binary
> > > > values (which was the intention in the first place for the metadata)
> > > > we would need to add a new metadata field.
> > > >
> > > > On Tue, Aug 10, 2021 at 6:53 AM Micah Kornfield <em...@gmail.com>
> > > wrote:
> > > > >
> > > > > One issue with changing it to byte is it would effectively break any
> > > reader that is validating flatbuffer data, because flatbuffers verifies
> > > null termination [1].  While this might comply with forward compatibility
> > > guarantees it seems like a pretty large blast radius.
> > > > >
> > > > > [1]
> > > https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flatbuffers.h#L2457
> > > > >
> > > > > On Mon, Jul 12, 2021 at 12:38 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > > >>
> > > > >> pyarrow at least treats the KeyValue values as binary and not UTF-8.
> > > > >>
> > > > >> On Sun, Jul 11, 2021 at 9:40 PM Micah Kornfield <
> > > emkornfield@gmail.com> wrote:
> > > > >> >
> > > > >> > I think other languages (e.g. java, python) might make more of
> > > distinction between utf-8 compatible strings and raw bytes.  For python it
> > > might be less of a concern if the c++ wrapper already makes the value field
> > > look like a bytes field
> > > > >> >
> > > > >> > On Sunday, July 11, 2021, Wes McKinney <we...@gmail.com> wrote:
> > > > >> >>
> > > > >> >> We could certainly "upgrade" KeyValue to have a binary value field
> > > > >> >> everywhere KeyValue is used, but there is some risk of code in the
> > > > >> >> wild expecting there to be a null terminator after the string data.
> > > > >> >> The Flatbuffers-generated accessor APIs do not depend on the
> > > existence
> > > > >> >> of the null terminator, though. Not ideal, but I would not be
> > > thrilled
> > > > >> >> about adding an extra [ BinaryKeyValue ] everyplace we currently
> > > have
> > > > >> >> [ KeyValue ].
> > > > >> >>
> > > > >> >> That said, I doubt that we have any endogenous forward
> > > compatibility
> > > > >> >> problems related to this in Apache Arrow-maintained libraries, the
> > > > >> >> risk would come from users who are interacting with the Flatbuffers
> > > > >> >> data manually / without using one of our libraries. We could
> > > implement
> > > > >> >> the changes and run a set of forward compatibility integration
> > > tests
> > > > >> >> to see if anyone of our released libraries have an issue.
> > > > >> >>
> > > > >> >> On Fri, Jul 9, 2021 at 11:33 AM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > > >> >> >
> > > > >> >> > The cost of an empty vector in Flatbuffers appears to be 4 bytes.
> > > > >> >> >
> > > > >> >> > On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield <
> > > emkornfield@gmail.com> wrote:
> > > > >> >> > >
> > > > >> >> > > Retitling and forking the discussion to talk about key value
> > > pairs.
> > > > >> >> > >
> > > > >> >> > > What is the byte cost of an empty list?  Another option would
> > > be to
> > > > >> >> > > introduce a new BinaryKeyValue table and add binary metadata.
> > > > >> >> > >
> > > > >> >> > > On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind <
> > > > >> >> > > natebauernfeind@deephaven.io> wrote:
> > > > >> >> > >
> > > > >> >> > > > Deephaven and I are very supportive of "upgrading" the value
> > > half of the kv
> > > > >> >> > > > pair to a byte vector. What is the best way to find out if
> > > there is
> > > > >> >> > > > sufficient interest?
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > > I've been stewing on the ideas here around schema evolution,
> > > and I realize
> > > > >> >> > > > the specific feature I am missing is the ability to encode
> > > that a field
> > > > >> >> > > > (i.e. its FieldNode and accompanying Buffers in the
> > > RecordBatch) is
> > > > >> >> > > > empty/has-no-data in O(0) cost (yes; for free).
> > > > >> >> > > >
> > > > >> >> > > > Might there be interest in adding a "field_id" to the
> > > FieldNode (which is
> > > > >> >> > > > encoded on the RecordBatch flatbuffer)? I see a simple
> > > forward-compatible
> > > > >> >> > > > upgrade (by either keying off of 0, or explicitly set the
> > > field default to
> > > > >> >> > > > -1) which would allow the sender to "skip" fields that have
> > > 1) FieldNode
> > > > >> >> > > > length of zero, and 2) all Buffer's associated at that level
> > > (and further
> > > > >> >> > > > nested) are also equally empty (i.e. Buffer length is zero).
> > > > >> >> > > >
> > > > >> >> > > > I understand this concept slightly interferes with
> > > RecordBatch's `length`
> > > > >> >> > > > field, and that many implementations use that length to
> > > resize the
> > > > >> >> > > > root-level FieldNodes. The use-case I have in mind has
> > > different logical
> > > > >> >> > > > lengths per field node; current implementations require
> > > sending a
> > > > >> >> > > > RecordBatch length of the max length across all root level
> > > field nodes. I
> > > > >> >> > > > believe this requires a copy of data whenever a field node
> > > is too short; I
> > > > >> >> > > > don't know if there is a decent solution to this slight
> > > inefficiency. I am
> > > > >> >> > > > bringing it up because if "skipping a field node when it is
> > > empty" is a
> > > > >> >> > > > feature, then we may not want to allocate space for those
> > > nodes given that
> > > > >> >> > > > the record batch length will likely be greater than zero.
> > > > >> >> > > >
> > > > >> >> > > > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <
> > > wesmckinn@gmail.com> wrote:
> > > > >> >> > > >
> > > > >> >> > > > > On Wed, Jul 7, 2021 at 2:53 PM David Li <
> > > apache@lidavidm.me> wrote:
> > > > >> >> > > > > >
> > > > >> >> > > > > > From the Flatbuffers internals doc[1] it appears they
> > > are the same:
> > > > >> >> > > > > "Strings are simply a vector of bytes, and are always
> > > null-terminated."
> > > > >> >> > > > >
> > > > >> >> > > > > I see. I took a look at flatbuffers.h, and it appears that
> > > changing
> > > > >> >> > > > > this field from string to [byte] would be
> > > backward-compatible and
> > > > >> >> > > > > forward-compatible except with code that expects a null
> > > terminator.
> > > > >> >> > > > > This is something we could discuss separately if there
> > > were enough
> > > > >> >> > > > > interest.
> > > > >> >> > > > >
> > > > >> >> > > > > > [1]:
> > > https://google.github.io/flatbuffers/flatbuffers_internals.html
> > > > >> >> > > > > >
> > > > >> >> > > > > > -David
> > > > >> >> > > > > >
> > > > >> >> > > > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
> > > > >> >> > > > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <
> > > > >> >> > > > emkornfield@gmail.com>
> > > > >> >> > > > > wrote:
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > Right, I had wanted to focus the discussion on
> > > Flight as I think
> > > > >> >> > > > > schema
> > > > >> >> > > > > > > > > evolution or multiplexing streams (more so the
> > > latter) is a
> > > > >> >> > > > > property of the
> > > > >> >> > > > > > > > > transport and not the stream format itself. If we
> > > are leaning
> > > > >> >> > > > > towards just
> > > > >> >> > > > > > > > > schema evolution then maybe it makes sense to
> > > discuss it for the
> > > > >> >> > > > > IPC stream
> > > > >> >> > > > > > > > > format and leverage that in Flight. I'd be
> > > interested in what
> > > > >> >> > > > > others think.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > I tend to agree, I think stream multiplexing is
> > > likely a transport
> > > > >> >> > > > > level
> > > > >> >> > > > > > > > issue.  IMO I think schema evolution should be
> > > consistent with the
> > > > >> >> > > > > IPC
> > > > >> >> > > > > > > > stream format  and flight.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > > Nate: it may be worth starting a separate
> > > discussion about more
> > > > >> >> > > > > general
> > > > >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why
> > > key-value
> > > > >> >> > > > > metadata was
> > > > >> >> > > > > > > > > chosen/if opaque bytes were considered in the past.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > I think  this was an unfortunate design of the key
> > > value metadata
> > > > >> >> > > > in
> > > > >> >> > > > > > > > Schema.fbs, but I don't think I was around when this
> > > decision was
> > > > >> >> > > > > made.
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > I agree that it's unfortunate that we did not use [
> > > byte ] instead of
> > > > >> >> > > > > > > string for the value in the KeyValue metadata — I
> > > think this was more
> > > > >> >> > > > > > > of an oversight than a deliberate choice (e.g. it was
> > > not our intent
> > > > >> >> > > > > > > to require binary data to be base64-encoded — this is
> > > something that
> > > > >> >> > > > > > > we have to do when encoding binary data in Thrift
> > > KeyValue metadata
> > > > >> >> > > > > > > for Parquet, for example). Is the binary
> > > representation of [byte]
> > > > >> >> > > > > > > different from string?
> > > > >> >> > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > > > > > Side Question: Why isn't the IPC stream format a
> > > series of the
> > > > >> >> > > > flight
> > > > >> >> > > > > > > > > protobufs?
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > In addition to what David said, protobufs can't be
> > > read directly
> > > > >> >> > > > > from a
> > > > >> >> > > > > > > > memory-mapped file (they need decoding).  This was
> > > one of the
> > > > >> >> > > > design
> > > > >> >> > > > > > > > considerations of using flatbuffers and IPC
> > > Stream/File format.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > I was thinking Micah's comment is more that whatever
> > > we do, it
> > > > >> >> > > > > should be
> > > > >> >> > > > > > > > > clearly specified and edge cases should be
> > > considered, especially
> > > > >> >> > > > > if we
> > > > >> >> > > > > > > > > might want to 'backport' this into the stream
> > > format later.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > Yes, for dictionaries we just need to be careful to
> > > define
> > > > >> >> > > > semantics
> > > > >> >> > > > > and
> > > > >> >> > > > > > > > ensure implementations are validating them with
> > > regards to
> > > > >> >> > > > > dictionaries.
> > > > >> >> > > > > > > > There likely isn't any need to change current
> > > implementations
> > > > >> >> > > > though.
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li <
> > > lidavidm@apache.org>
> > > > >> >> > > > > wrote:
> > > > >> >> > > > > > > >
> > > > >> >> > > > > > > > > Right, I had wanted to focus the discussion on
> > > Flight as I think
> > > > >> >> > > > > schema
> > > > >> >> > > > > > > > > evolution or multiplexing streams (more so the
> > > latter) is a
> > > > >> >> > > > > property of the
> > > > >> >> > > > > > > > > transport and not the stream format itself. If we
> > > are leaning
> > > > >> >> > > > > towards just
> > > > >> >> > > > > > > > > schema evolution then maybe it makes sense to
> > > discuss it for the
> > > > >> >> > > > > IPC stream
> > > > >> >> > > > > > > > > format and leverage that in Flight. I'd be
> > > interested in what
> > > > >> >> > > > > others think.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > Especially if we are looking at multiplexing
> > > streams - I would
> > > > >> >> > > > > wonder if
> > > > >> >> > > > > > > > > that's actually better served by making it easier
> > > to implement
> > > > >> >> > > > > using the
> > > > >> >> > > > > > > > > Flight implementation as it stands (by managing
> > > concurrent RPC
> > > > >> >> > > > > calls and/or
> > > > >> >> > > > > > > > > performing the union-of-structs encoding trick for
> > > you), instead
> > > > >> >> > > > > of having
> > > > >> >> > > > > > > > > to change the protocol.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > Nate: it may be worth starting a separate
> > > discussion about more
> > > > >> >> > > > > general
> > > > >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why
> > > key-value
> > > > >> >> > > > > metadata was
> > > > >> >> > > > > > > > > chosen/if opaque bytes were considered in the
> > > past. Off the top
> > > > >> >> > > > of
> > > > >> >> > > > > my head
> > > > >> >> > > > > > > > > if it's for on-disk storage and fully
> > > application-defined it may
> > > > >> >> > > > > make sense
> > > > >> >> > > > > > > > > to store as a separate file alongside the Arrow
> > > file (indexed by
> > > > >> >> > > > > record
> > > > >> >> > > > > > > > > batch index) where you can take advantage of
> > > whatever format is
> > > > >> >> > > > > most
> > > > >> >> > > > > > > > > suitable.
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > -David
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan
> > > wrote:
> > > > >> >> > > > > > > > > > Hi guys,
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > 1. Regarding IPC vs Flight: in fact my initial
> > > suggestion was
> > > > >> >> > > > to
> > > > >> >> > > > > add this
> > > > >> >> > > > > > > > > > feature starting from the IPC(I moved initial
> > > write up steps to
> > > > >> >> > > > > the
> > > > >> >> > > > > > > > > bottom
> > > > >> >> > > > > > > > > > of the doc). Afterwards David suggested focusing
> > > on Flight and
> > > > >> >> > > > > that's how
> > > > >> >> > > > > > > > > > we ended up with the protobufs change in the
> > > proposal. This
> > > > >> >> > > > > being said I
> > > > >> >> > > > > > > > > do
> > > > >> >> > > > > > > > > > think that the place where this should be
> > > impemented is a good
> > > > >> >> > > > > question
> > > > >> >> > > > > > > > > on
> > > > >> >> > > > > > > > > > its own. Maybe it makes sense to have this kind
> > > of a feature in
> > > > >> >> > > > > IPC and
> > > > >> >> > > > > > > > > > somehow use it in Flight, maybe not.
> > > > >> >> > > > > > > > > > 2. The point about dictionaries deserves a
> > > dedicated section in
> > > > >> >> > > > > the
> > > > >> >> > > > > > > > > > proposal. Nate and David brought it up and
> > > shared some
> > > > >> >> > > > insights.
> > > > >> >> > > > > I'll try
> > > > >> >> > > > > > > > > > to aggregate them and we can continue the
> > > discussion form
> > > > >> >> > > > there.
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > Cheers,
> > > > >> >> > > > > > > > > > Gosh
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <
> > > > >> >> > > > > > > > > natebauernfeind@deephaven.io>
> > > > >> >> > > > > > > > > > wrote:
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > > > makes it more difficult to bring schema
> > > evolution back
> > > > >> >> > > > > into the
> > > > >> >> > > > > > > > > > > > > > IPC Stream format (i.e. it would live
> > > only in flight)
> > > > >> >> > > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > > Gosh's proposal extends the flatbuffer
> > > structures not the
> > > > >> >> > > > > > > > > protobufs.
> > > > >> >> > > > > > > > > > > Can
> > > > >> >> > > > > > > > > > > > > you help me understand how difficult it
> > > would be to bring
> > > > >> >> > > > > the
> > > > >> >> > > > > > > > > > > `schema_id`
> > > > >> >> > > > > > > > > > > > > approach to the IPC stream format?
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > I thought we were talking solely about the
> > > Flight Protobuf
> > > > >> >> > > > > > > > > definitions -
> > > > >> >> > > > > > > > > > > > not the Flatbuffers (and the Google doc at
> > > least only talks
> > > > >> >> > > > > about the
> > > > >> >> > > > > > > > > > > > Protobufs).
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > > > I somehow missed that schema_id is being added
> > > to protobuf in
> > > > >> >> > > > > the
> > > > >> >> > > > > > > > > document.
> > > > >> >> > > > > > > > > > > It feels to me that the schema_id is a
> > > property that would
> > > > >> >> > > > > ideally only
> > > > >> >> > > > > > > > > > > apply to the RecordBatch. I better understand
> > > Micah's
> > > > >> >> > > > > dictionary
> > > > >> >> > > > > > > > > concerns,
> > > > >> >> > > > > > > > > > > now, too.
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > Side Question: Why isn't the IPC stream
> > > format a series of
> > > > >> >> > > > > the flight
> > > > >> >> > > > > > > > > > > > > protobufs? It's a real shame that there is
> > > no standard
> > > > >> >> > > > way
> > > > >> >> > > > > to
> > > > >> >> > > > > > > > > > > > > capture/replay a stream with app_metadata.
> > > (Obviously
> > > > >> >> > > > > ignoring the
> > > > >> >> > > > > > > > > > > > > annoyances around protobuf wrapping
> > > flatbuffers.)
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > > > The IPC format was defined long before
> > > Flight, and Flight's
> > > > >> >> > > > > > > > > app_metadata
> > > > >> >> > > > > > > > > > > > was added after Flight's initial definition.
> > > Note an IPC
> > > > >> >> > > > > message does
> > > > >> >> > > > > > > > > > > have
> > > > >> >> > > > > > > > > > > > a provision for key-value metadata, though I
> > > think APIs for
> > > > >> >> > > > > that are
> > > > >> >> > > > > > > > > not
> > > > >> >> > > > > > > > > > > > fully exposed. (See ARROW-6940:
> > > > >> >> > > > > > > > > > > >
> > > https://issues.apache.org/jira/browse/ARROW-6940 and
> > > > >> >> > > > > despite my
> > > > >> >> > > > > > > > > comments
> > > > >> >> > > > > > > > > > > > there perhaps we need to unify or at least
> > > consider how
> > > > >> >> > > > > Flight's
> > > > >> >> > > > > > > > > > > > app_metadata relates to the IPC message
> > > custom_metadata.
> > > > >> >> > > > Also
> > > > >> >> > > > > > > > > perhaps see
> > > > >> >> > > > > > > > > > > > ARROW-1059.)
> > > > >> >> > > > > > > > > > > >
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > > > KeyValue unfortunately is string to string. In
> > > flatbuffer
> > > > >> >> > > > > strings are
> > > > >> >> > > > > > > > > only
> > > > >> >> > > > > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the
> > > other hand is
> > > > >> >> > > > > opaque
> > > > >> >> > > > > > > > > bytes.
> > > > >> >> > > > > > > > > > > The latter is a bit more useful.
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > > > --
> > > > >> >> > > > > > > > > > >
> > > > >> >> > > > > > > > > >
> > > > >> >> > > > > > > > >
> > > > >> >> > > > > > >
> > > > >> >> > > > >
> > > > >> >> > > >
> > > > >> >> > > >
> > > > >> >> > > > --
> > > > >> >> > > >
> > >
> > 
> 

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by David Li <li...@apache.org>.
This isn't too thought out yet but:
 1.  Any file which stuffs binary data into the value is already unreadable for anyone directly using Flatbuffers. So we can specify that the field must be valid UTF-8, but implementations can permit relaxed validation/reading as binary data instead in order to read old files for the current metadata version.
 2. For the new version, we can add a [byte] field to KeyValueMetadata alongside the string field. Since strings in Flatbuffers don't include the null terminator in their length, implementations can reuse the offset to allow values which are valid UTF-8 to be placed in both fields at a cost of only 4 bytes, so old consumers can still see UTF-8 metadata values. Actual binary values would only exist in the new field, and old consumers would see a null value (since we didn't declare it required); if both fields have a value, the binary value takes precedence. I think this at least leaves our read APIs alone since they already use bytestrings.
A current extension type that uses binary metadata would still be a problem, though…

-David

On Tue, Aug 17, 2021, at 00:08, Micah Kornfield wrote:
> I agree with you any thoughts on a way forward for at least hardening the
> spec (or should this be done at the same time as adding the new field)?
> 
> On Mon, Aug 16, 2021 at 1:45 AM Wes McKinney <we...@gmail.com> wrote:
> 
> > I've been poking around the project, and I'm growing concerned that
> > our use of the KeyValue field has already been non-compliant in many
> > cases since we do not validate UTF8-ness. Since we also use KeyValue
> > to handle opaque data serialization for extension types [1], the fact
> > that the specification does not clarify that binary data (such as the
> > output of ExtensionType::Serialize) must be base64-encoded or similar
> > makes this a bit of a minefield at the moment.
> >
> > It seems that there are no particularly excellent solutions, and
> > maintaining the status quo (having now identified these
> > inconsistencies / vagueries in at least the C++ implementation) is
> > probably not a good idea either. In Parquet, where we store a
> > serialized binary Arrow schema, we have to base64-encode [2]
> >
> > [1]:
> > https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> > [2]:
> > https://github.com/apache/arrow/blob/e990d177b1f1dec962315487682f613d46be573c/cpp/src/parquet/arrow/writer.cc#L423
> >
> > On Tue, Aug 10, 2021 at 3:27 PM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > Ah, that's definitely a no-go then (I believe we verify messages
> > > unconditionally in C++). That's unfortunate (and I feel responsible
> > > for missing this, but I suppose we had a lot of opportunities to fix
> > > it prior to the 1.0.0 format version) — so to have actual binary
> > > values (which was the intention in the first place for the metadata)
> > > we would need to add a new metadata field.
> > >
> > > On Tue, Aug 10, 2021 at 6:53 AM Micah Kornfield <em...@gmail.com>
> > wrote:
> > > >
> > > > One issue with changing it to byte is it would effectively break any
> > reader that is validating flatbuffer data, because flatbuffers verifies
> > null termination [1].  While this might comply with forward compatibility
> > guarantees it seems like a pretty large blast radius.
> > > >
> > > > [1]
> > https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flatbuffers.h#L2457
> > > >
> > > > On Mon, Jul 12, 2021 at 12:38 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >>
> > > >> pyarrow at least treats the KeyValue values as binary and not UTF-8.
> > > >>
> > > >> On Sun, Jul 11, 2021 at 9:40 PM Micah Kornfield <
> > emkornfield@gmail.com> wrote:
> > > >> >
> > > >> > I think other languages (e.g. java, python) might make more of
> > distinction between utf-8 compatible strings and raw bytes.  For python it
> > might be less of a concern if the c++ wrapper already makes the value field
> > look like a bytes field
> > > >> >
> > > >> > On Sunday, July 11, 2021, Wes McKinney <we...@gmail.com> wrote:
> > > >> >>
> > > >> >> We could certainly "upgrade" KeyValue to have a binary value field
> > > >> >> everywhere KeyValue is used, but there is some risk of code in the
> > > >> >> wild expecting there to be a null terminator after the string data.
> > > >> >> The Flatbuffers-generated accessor APIs do not depend on the
> > existence
> > > >> >> of the null terminator, though. Not ideal, but I would not be
> > thrilled
> > > >> >> about adding an extra [ BinaryKeyValue ] everyplace we currently
> > have
> > > >> >> [ KeyValue ].
> > > >> >>
> > > >> >> That said, I doubt that we have any endogenous forward
> > compatibility
> > > >> >> problems related to this in Apache Arrow-maintained libraries, the
> > > >> >> risk would come from users who are interacting with the Flatbuffers
> > > >> >> data manually / without using one of our libraries. We could
> > implement
> > > >> >> the changes and run a set of forward compatibility integration
> > tests
> > > >> >> to see if anyone of our released libraries have an issue.
> > > >> >>
> > > >> >> On Fri, Jul 9, 2021 at 11:33 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >> >> >
> > > >> >> > The cost of an empty vector in Flatbuffers appears to be 4 bytes.
> > > >> >> >
> > > >> >> > On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield <
> > emkornfield@gmail.com> wrote:
> > > >> >> > >
> > > >> >> > > Retitling and forking the discussion to talk about key value
> > pairs.
> > > >> >> > >
> > > >> >> > > What is the byte cost of an empty list?  Another option would
> > be to
> > > >> >> > > introduce a new BinaryKeyValue table and add binary metadata.
> > > >> >> > >
> > > >> >> > > On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind <
> > > >> >> > > natebauernfeind@deephaven.io> wrote:
> > > >> >> > >
> > > >> >> > > > Deephaven and I are very supportive of "upgrading" the value
> > half of the kv
> > > >> >> > > > pair to a byte vector. What is the best way to find out if
> > there is
> > > >> >> > > > sufficient interest?
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > > I've been stewing on the ideas here around schema evolution,
> > and I realize
> > > >> >> > > > the specific feature I am missing is the ability to encode
> > that a field
> > > >> >> > > > (i.e. its FieldNode and accompanying Buffers in the
> > RecordBatch) is
> > > >> >> > > > empty/has-no-data in O(0) cost (yes; for free).
> > > >> >> > > >
> > > >> >> > > > Might there be interest in adding a "field_id" to the
> > FieldNode (which is
> > > >> >> > > > encoded on the RecordBatch flatbuffer)? I see a simple
> > forward-compatible
> > > >> >> > > > upgrade (by either keying off of 0, or explicitly set the
> > field default to
> > > >> >> > > > -1) which would allow the sender to "skip" fields that have
> > 1) FieldNode
> > > >> >> > > > length of zero, and 2) all Buffer's associated at that level
> > (and further
> > > >> >> > > > nested) are also equally empty (i.e. Buffer length is zero).
> > > >> >> > > >
> > > >> >> > > > I understand this concept slightly interferes with
> > RecordBatch's `length`
> > > >> >> > > > field, and that many implementations use that length to
> > resize the
> > > >> >> > > > root-level FieldNodes. The use-case I have in mind has
> > different logical
> > > >> >> > > > lengths per field node; current implementations require
> > sending a
> > > >> >> > > > RecordBatch length of the max length across all root level
> > field nodes. I
> > > >> >> > > > believe this requires a copy of data whenever a field node
> > is too short; I
> > > >> >> > > > don't know if there is a decent solution to this slight
> > inefficiency. I am
> > > >> >> > > > bringing it up because if "skipping a field node when it is
> > empty" is a
> > > >> >> > > > feature, then we may not want to allocate space for those
> > nodes given that
> > > >> >> > > > the record batch length will likely be greater than zero.
> > > >> >> > > >
> > > >> >> > > > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <
> > wesmckinn@gmail.com> wrote:
> > > >> >> > > >
> > > >> >> > > > > On Wed, Jul 7, 2021 at 2:53 PM David Li <
> > apache@lidavidm.me> wrote:
> > > >> >> > > > > >
> > > >> >> > > > > > From the Flatbuffers internals doc[1] it appears they
> > are the same:
> > > >> >> > > > > "Strings are simply a vector of bytes, and are always
> > null-terminated."
> > > >> >> > > > >
> > > >> >> > > > > I see. I took a look at flatbuffers.h, and it appears that
> > changing
> > > >> >> > > > > this field from string to [byte] would be
> > backward-compatible and
> > > >> >> > > > > forward-compatible except with code that expects a null
> > terminator.
> > > >> >> > > > > This is something we could discuss separately if there
> > were enough
> > > >> >> > > > > interest.
> > > >> >> > > > >
> > > >> >> > > > > > [1]:
> > https://google.github.io/flatbuffers/flatbuffers_internals.html
> > > >> >> > > > > >
> > > >> >> > > > > > -David
> > > >> >> > > > > >
> > > >> >> > > > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
> > > >> >> > > > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <
> > > >> >> > > > emkornfield@gmail.com>
> > > >> >> > > > > wrote:
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > Right, I had wanted to focus the discussion on
> > Flight as I think
> > > >> >> > > > > schema
> > > >> >> > > > > > > > > evolution or multiplexing streams (more so the
> > latter) is a
> > > >> >> > > > > property of the
> > > >> >> > > > > > > > > transport and not the stream format itself. If we
> > are leaning
> > > >> >> > > > > towards just
> > > >> >> > > > > > > > > schema evolution then maybe it makes sense to
> > discuss it for the
> > > >> >> > > > > IPC stream
> > > >> >> > > > > > > > > format and leverage that in Flight. I'd be
> > interested in what
> > > >> >> > > > > others think.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > I tend to agree, I think stream multiplexing is
> > likely a transport
> > > >> >> > > > > level
> > > >> >> > > > > > > > issue.  IMO I think schema evolution should be
> > consistent with the
> > > >> >> > > > > IPC
> > > >> >> > > > > > > > stream format  and flight.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > > Nate: it may be worth starting a separate
> > discussion about more
> > > >> >> > > > > general
> > > >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why
> > key-value
> > > >> >> > > > > metadata was
> > > >> >> > > > > > > > > chosen/if opaque bytes were considered in the past.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > I think  this was an unfortunate design of the key
> > value metadata
> > > >> >> > > > in
> > > >> >> > > > > > > > Schema.fbs, but I don't think I was around when this
> > decision was
> > > >> >> > > > > made.
> > > >> >> > > > > > >
> > > >> >> > > > > > > I agree that it's unfortunate that we did not use [
> > byte ] instead of
> > > >> >> > > > > > > string for the value in the KeyValue metadata — I
> > think this was more
> > > >> >> > > > > > > of an oversight than a deliberate choice (e.g. it was
> > not our intent
> > > >> >> > > > > > > to require binary data to be base64-encoded — this is
> > something that
> > > >> >> > > > > > > we have to do when encoding binary data in Thrift
> > KeyValue metadata
> > > >> >> > > > > > > for Parquet, for example). Is the binary
> > representation of [byte]
> > > >> >> > > > > > > different from string?
> > > >> >> > > > > > >
> > > >> >> > > > > > >
> > > >> >> > > > > > >
> > > >> >> > > > > > > > Side Question: Why isn't the IPC stream format a
> > series of the
> > > >> >> > > > flight
> > > >> >> > > > > > > > > protobufs?
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > In addition to what David said, protobufs can't be
> > read directly
> > > >> >> > > > > from a
> > > >> >> > > > > > > > memory-mapped file (they need decoding).  This was
> > one of the
> > > >> >> > > > design
> > > >> >> > > > > > > > considerations of using flatbuffers and IPC
> > Stream/File format.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > I was thinking Micah's comment is more that whatever
> > we do, it
> > > >> >> > > > > should be
> > > >> >> > > > > > > > > clearly specified and edge cases should be
> > considered, especially
> > > >> >> > > > > if we
> > > >> >> > > > > > > > > might want to 'backport' this into the stream
> > format later.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > Yes, for dictionaries we just need to be careful to
> > define
> > > >> >> > > > semantics
> > > >> >> > > > > and
> > > >> >> > > > > > > > ensure implementations are validating them with
> > regards to
> > > >> >> > > > > dictionaries.
> > > >> >> > > > > > > > There likely isn't any need to change current
> > implementations
> > > >> >> > > > though.
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li <
> > lidavidm@apache.org>
> > > >> >> > > > > wrote:
> > > >> >> > > > > > > >
> > > >> >> > > > > > > > > Right, I had wanted to focus the discussion on
> > Flight as I think
> > > >> >> > > > > schema
> > > >> >> > > > > > > > > evolution or multiplexing streams (more so the
> > latter) is a
> > > >> >> > > > > property of the
> > > >> >> > > > > > > > > transport and not the stream format itself. If we
> > are leaning
> > > >> >> > > > > towards just
> > > >> >> > > > > > > > > schema evolution then maybe it makes sense to
> > discuss it for the
> > > >> >> > > > > IPC stream
> > > >> >> > > > > > > > > format and leverage that in Flight. I'd be
> > interested in what
> > > >> >> > > > > others think.
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > Especially if we are looking at multiplexing
> > streams - I would
> > > >> >> > > > > wonder if
> > > >> >> > > > > > > > > that's actually better served by making it easier
> > to implement
> > > >> >> > > > > using the
> > > >> >> > > > > > > > > Flight implementation as it stands (by managing
> > concurrent RPC
> > > >> >> > > > > calls and/or
> > > >> >> > > > > > > > > performing the union-of-structs encoding trick for
> > you), instead
> > > >> >> > > > > of having
> > > >> >> > > > > > > > > to change the protocol.
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > Nate: it may be worth starting a separate
> > discussion about more
> > > >> >> > > > > general
> > > >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why
> > key-value
> > > >> >> > > > > metadata was
> > > >> >> > > > > > > > > chosen/if opaque bytes were considered in the
> > past. Off the top
> > > >> >> > > > of
> > > >> >> > > > > my head
> > > >> >> > > > > > > > > if it's for on-disk storage and fully
> > application-defined it may
> > > >> >> > > > > make sense
> > > >> >> > > > > > > > > to store as a separate file alongside the Arrow
> > file (indexed by
> > > >> >> > > > > record
> > > >> >> > > > > > > > > batch index) where you can take advantage of
> > whatever format is
> > > >> >> > > > > most
> > > >> >> > > > > > > > > suitable.
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > -David
> > > >> >> > > > > > > > >
> > > >> >> > > > > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan
> > wrote:
> > > >> >> > > > > > > > > > Hi guys,
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > 1. Regarding IPC vs Flight: in fact my initial
> > suggestion was
> > > >> >> > > > to
> > > >> >> > > > > add this
> > > >> >> > > > > > > > > > feature starting from the IPC(I moved initial
> > write up steps to
> > > >> >> > > > > the
> > > >> >> > > > > > > > > bottom
> > > >> >> > > > > > > > > > of the doc). Afterwards David suggested focusing
> > on Flight and
> > > >> >> > > > > that's how
> > > >> >> > > > > > > > > > we ended up with the protobufs change in the
> > proposal. This
> > > >> >> > > > > being said I
> > > >> >> > > > > > > > > do
> > > >> >> > > > > > > > > > think that the place where this should be
> > impemented is a good
> > > >> >> > > > > question
> > > >> >> > > > > > > > > on
> > > >> >> > > > > > > > > > its own. Maybe it makes sense to have this kind
> > of a feature in
> > > >> >> > > > > IPC and
> > > >> >> > > > > > > > > > somehow use it in Flight, maybe not.
> > > >> >> > > > > > > > > > 2. The point about dictionaries deserves a
> > dedicated section in
> > > >> >> > > > > the
> > > >> >> > > > > > > > > > proposal. Nate and David brought it up and
> > shared some
> > > >> >> > > > insights.
> > > >> >> > > > > I'll try
> > > >> >> > > > > > > > > > to aggregate them and we can continue the
> > discussion form
> > > >> >> > > > there.
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > Cheers,
> > > >> >> > > > > > > > > > Gosh
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <
> > > >> >> > > > > > > > > natebauernfeind@deephaven.io>
> > > >> >> > > > > > > > > > wrote:
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > > > > >
> > > >> >> > > > > > > > > > > > > > makes it more difficult to bring schema
> > evolution back
> > > >> >> > > > > into the
> > > >> >> > > > > > > > > > > > > > IPC Stream format (i.e. it would live
> > only in flight)
> > > >> >> > > > > > > > > > > > >
> > > >> >> > > > > > > > > > > > > Gosh's proposal extends the flatbuffer
> > structures not the
> > > >> >> > > > > > > > > protobufs.
> > > >> >> > > > > > > > > > > Can
> > > >> >> > > > > > > > > > > > > you help me understand how difficult it
> > would be to bring
> > > >> >> > > > > the
> > > >> >> > > > > > > > > > > `schema_id`
> > > >> >> > > > > > > > > > > > > approach to the IPC stream format?
> > > >> >> > > > > > > > > > > >
> > > >> >> > > > > > > > > > > > I thought we were talking solely about the
> > Flight Protobuf
> > > >> >> > > > > > > > > definitions -
> > > >> >> > > > > > > > > > > > not the Flatbuffers (and the Google doc at
> > least only talks
> > > >> >> > > > > about the
> > > >> >> > > > > > > > > > > > Protobufs).
> > > >> >> > > > > > > > > > > >
> > > >> >> > > > > > > > > > >
> > > >> >> > > > > > > > > > > I somehow missed that schema_id is being added
> > to protobuf in
> > > >> >> > > > > the
> > > >> >> > > > > > > > > document.
> > > >> >> > > > > > > > > > > It feels to me that the schema_id is a
> > property that would
> > > >> >> > > > > ideally only
> > > >> >> > > > > > > > > > > apply to the RecordBatch. I better understand
> > Micah's
> > > >> >> > > > > dictionary
> > > >> >> > > > > > > > > concerns,
> > > >> >> > > > > > > > > > > now, too.
> > > >> >> > > > > > > > > > >
> > > >> >> > > > > > > > > > > > Side Question: Why isn't the IPC stream
> > format a series of
> > > >> >> > > > > the flight
> > > >> >> > > > > > > > > > > > > protobufs? It's a real shame that there is
> > no standard
> > > >> >> > > > way
> > > >> >> > > > > to
> > > >> >> > > > > > > > > > > > > capture/replay a stream with app_metadata.
> > (Obviously
> > > >> >> > > > > ignoring the
> > > >> >> > > > > > > > > > > > > annoyances around protobuf wrapping
> > flatbuffers.)
> > > >> >> > > > > > > > > > > >
> > > >> >> > > > > > > > > > > > The IPC format was defined long before
> > Flight, and Flight's
> > > >> >> > > > > > > > > app_metadata
> > > >> >> > > > > > > > > > > > was added after Flight's initial definition.
> > Note an IPC
> > > >> >> > > > > message does
> > > >> >> > > > > > > > > > > have
> > > >> >> > > > > > > > > > > > a provision for key-value metadata, though I
> > think APIs for
> > > >> >> > > > > that are
> > > >> >> > > > > > > > > not
> > > >> >> > > > > > > > > > > > fully exposed. (See ARROW-6940:
> > > >> >> > > > > > > > > > > >
> > https://issues.apache.org/jira/browse/ARROW-6940 and
> > > >> >> > > > > despite my
> > > >> >> > > > > > > > > comments
> > > >> >> > > > > > > > > > > > there perhaps we need to unify or at least
> > consider how
> > > >> >> > > > > Flight's
> > > >> >> > > > > > > > > > > > app_metadata relates to the IPC message
> > custom_metadata.
> > > >> >> > > > Also
> > > >> >> > > > > > > > > perhaps see
> > > >> >> > > > > > > > > > > > ARROW-1059.)
> > > >> >> > > > > > > > > > > >
> > > >> >> > > > > > > > > > >
> > > >> >> > > > > > > > > > > KeyValue unfortunately is string to string. In
> > flatbuffer
> > > >> >> > > > > strings are
> > > >> >> > > > > > > > > only
> > > >> >> > > > > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the
> > other hand is
> > > >> >> > > > > opaque
> > > >> >> > > > > > > > > bytes.
> > > >> >> > > > > > > > > > > The latter is a bit more useful.
> > > >> >> > > > > > > > > > >
> > > >> >> > > > > > > > > > > --
> > > >> >> > > > > > > > > > >
> > > >> >> > > > > > > > > >
> > > >> >> > > > > > > > >
> > > >> >> > > > > > >
> > > >> >> > > > >
> > > >> >> > > >
> > > >> >> > > >
> > > >> >> > > > --
> > > >> >> > > >
> >
> 

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Micah Kornfield <em...@gmail.com>.
I agree with you any thoughts on a way forward for at least hardening the
spec (or should this be done at the same time as adding the new field)?

On Mon, Aug 16, 2021 at 1:45 AM Wes McKinney <we...@gmail.com> wrote:

> I've been poking around the project, and I'm growing concerned that
> our use of the KeyValue field has already been non-compliant in many
> cases since we do not validate UTF8-ness. Since we also use KeyValue
> to handle opaque data serialization for extension types [1], the fact
> that the specification does not clarify that binary data (such as the
> output of ExtensionType::Serialize) must be base64-encoded or similar
> makes this a bit of a minefield at the moment.
>
> It seems that there are no particularly excellent solutions, and
> maintaining the status quo (having now identified these
> inconsistencies / vagueries in at least the C++ implementation) is
> probably not a good idea either. In Parquet, where we store a
> serialized binary Arrow schema, we have to base64-encode [2]
>
> [1]:
> https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
> [2]:
> https://github.com/apache/arrow/blob/e990d177b1f1dec962315487682f613d46be573c/cpp/src/parquet/arrow/writer.cc#L423
>
> On Tue, Aug 10, 2021 at 3:27 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > Ah, that's definitely a no-go then (I believe we verify messages
> > unconditionally in C++). That's unfortunate (and I feel responsible
> > for missing this, but I suppose we had a lot of opportunities to fix
> > it prior to the 1.0.0 format version) — so to have actual binary
> > values (which was the intention in the first place for the metadata)
> > we would need to add a new metadata field.
> >
> > On Tue, Aug 10, 2021 at 6:53 AM Micah Kornfield <em...@gmail.com>
> wrote:
> > >
> > > One issue with changing it to byte is it would effectively break any
> reader that is validating flatbuffer data, because flatbuffers verifies
> null termination [1].  While this might comply with forward compatibility
> guarantees it seems like a pretty large blast radius.
> > >
> > > [1]
> https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flatbuffers.h#L2457
> > >
> > > On Mon, Jul 12, 2021 at 12:38 PM Wes McKinney <we...@gmail.com>
> wrote:
> > >>
> > >> pyarrow at least treats the KeyValue values as binary and not UTF-8.
> > >>
> > >> On Sun, Jul 11, 2021 at 9:40 PM Micah Kornfield <
> emkornfield@gmail.com> wrote:
> > >> >
> > >> > I think other languages (e.g. java, python) might make more of
> distinction between utf-8 compatible strings and raw bytes.  For python it
> might be less of a concern if the c++ wrapper already makes the value field
> look like a bytes field
> > >> >
> > >> > On Sunday, July 11, 2021, Wes McKinney <we...@gmail.com> wrote:
> > >> >>
> > >> >> We could certainly "upgrade" KeyValue to have a binary value field
> > >> >> everywhere KeyValue is used, but there is some risk of code in the
> > >> >> wild expecting there to be a null terminator after the string data.
> > >> >> The Flatbuffers-generated accessor APIs do not depend on the
> existence
> > >> >> of the null terminator, though. Not ideal, but I would not be
> thrilled
> > >> >> about adding an extra [ BinaryKeyValue ] everyplace we currently
> have
> > >> >> [ KeyValue ].
> > >> >>
> > >> >> That said, I doubt that we have any endogenous forward
> compatibility
> > >> >> problems related to this in Apache Arrow-maintained libraries, the
> > >> >> risk would come from users who are interacting with the Flatbuffers
> > >> >> data manually / without using one of our libraries. We could
> implement
> > >> >> the changes and run a set of forward compatibility integration
> tests
> > >> >> to see if anyone of our released libraries have an issue.
> > >> >>
> > >> >> On Fri, Jul 9, 2021 at 11:33 AM Wes McKinney <we...@gmail.com>
> wrote:
> > >> >> >
> > >> >> > The cost of an empty vector in Flatbuffers appears to be 4 bytes.
> > >> >> >
> > >> >> > On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield <
> emkornfield@gmail.com> wrote:
> > >> >> > >
> > >> >> > > Retitling and forking the discussion to talk about key value
> pairs.
> > >> >> > >
> > >> >> > > What is the byte cost of an empty list?  Another option would
> be to
> > >> >> > > introduce a new BinaryKeyValue table and add binary metadata.
> > >> >> > >
> > >> >> > > On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind <
> > >> >> > > natebauernfeind@deephaven.io> wrote:
> > >> >> > >
> > >> >> > > > Deephaven and I are very supportive of "upgrading" the value
> half of the kv
> > >> >> > > > pair to a byte vector. What is the best way to find out if
> there is
> > >> >> > > > sufficient interest?
> > >> >> > > >
> > >> >> > > >
> > >> >> > > > I've been stewing on the ideas here around schema evolution,
> and I realize
> > >> >> > > > the specific feature I am missing is the ability to encode
> that a field
> > >> >> > > > (i.e. its FieldNode and accompanying Buffers in the
> RecordBatch) is
> > >> >> > > > empty/has-no-data in O(0) cost (yes; for free).
> > >> >> > > >
> > >> >> > > > Might there be interest in adding a "field_id" to the
> FieldNode (which is
> > >> >> > > > encoded on the RecordBatch flatbuffer)? I see a simple
> forward-compatible
> > >> >> > > > upgrade (by either keying off of 0, or explicitly set the
> field default to
> > >> >> > > > -1) which would allow the sender to "skip" fields that have
> 1) FieldNode
> > >> >> > > > length of zero, and 2) all Buffer's associated at that level
> (and further
> > >> >> > > > nested) are also equally empty (i.e. Buffer length is zero).
> > >> >> > > >
> > >> >> > > > I understand this concept slightly interferes with
> RecordBatch's `length`
> > >> >> > > > field, and that many implementations use that length to
> resize the
> > >> >> > > > root-level FieldNodes. The use-case I have in mind has
> different logical
> > >> >> > > > lengths per field node; current implementations require
> sending a
> > >> >> > > > RecordBatch length of the max length across all root level
> field nodes. I
> > >> >> > > > believe this requires a copy of data whenever a field node
> is too short; I
> > >> >> > > > don't know if there is a decent solution to this slight
> inefficiency. I am
> > >> >> > > > bringing it up because if "skipping a field node when it is
> empty" is a
> > >> >> > > > feature, then we may not want to allocate space for those
> nodes given that
> > >> >> > > > the record batch length will likely be greater than zero.
> > >> >> > > >
> > >> >> > > > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <
> wesmckinn@gmail.com> wrote:
> > >> >> > > >
> > >> >> > > > > On Wed, Jul 7, 2021 at 2:53 PM David Li <
> apache@lidavidm.me> wrote:
> > >> >> > > > > >
> > >> >> > > > > > From the Flatbuffers internals doc[1] it appears they
> are the same:
> > >> >> > > > > "Strings are simply a vector of bytes, and are always
> null-terminated."
> > >> >> > > > >
> > >> >> > > > > I see. I took a look at flatbuffers.h, and it appears that
> changing
> > >> >> > > > > this field from string to [byte] would be
> backward-compatible and
> > >> >> > > > > forward-compatible except with code that expects a null
> terminator.
> > >> >> > > > > This is something we could discuss separately if there
> were enough
> > >> >> > > > > interest.
> > >> >> > > > >
> > >> >> > > > > > [1]:
> https://google.github.io/flatbuffers/flatbuffers_internals.html
> > >> >> > > > > >
> > >> >> > > > > > -David
> > >> >> > > > > >
> > >> >> > > > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
> > >> >> > > > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <
> > >> >> > > > emkornfield@gmail.com>
> > >> >> > > > > wrote:
> > >> >> > > > > > > >
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > Right, I had wanted to focus the discussion on
> Flight as I think
> > >> >> > > > > schema
> > >> >> > > > > > > > > evolution or multiplexing streams (more so the
> latter) is a
> > >> >> > > > > property of the
> > >> >> > > > > > > > > transport and not the stream format itself. If we
> are leaning
> > >> >> > > > > towards just
> > >> >> > > > > > > > > schema evolution then maybe it makes sense to
> discuss it for the
> > >> >> > > > > IPC stream
> > >> >> > > > > > > > > format and leverage that in Flight. I'd be
> interested in what
> > >> >> > > > > others think.
> > >> >> > > > > > > >
> > >> >> > > > > > > > I tend to agree, I think stream multiplexing is
> likely a transport
> > >> >> > > > > level
> > >> >> > > > > > > > issue.  IMO I think schema evolution should be
> consistent with the
> > >> >> > > > > IPC
> > >> >> > > > > > > > stream format  and flight.
> > >> >> > > > > > > >
> > >> >> > > > > > > >
> > >> >> > > > > > > > > Nate: it may be worth starting a separate
> discussion about more
> > >> >> > > > > general
> > >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why
> key-value
> > >> >> > > > > metadata was
> > >> >> > > > > > > > > chosen/if opaque bytes were considered in the past.
> > >> >> > > > > > > >
> > >> >> > > > > > > >
> > >> >> > > > > > > > I think  this was an unfortunate design of the key
> value metadata
> > >> >> > > > in
> > >> >> > > > > > > > Schema.fbs, but I don't think I was around when this
> decision was
> > >> >> > > > > made.
> > >> >> > > > > > >
> > >> >> > > > > > > I agree that it's unfortunate that we did not use [
> byte ] instead of
> > >> >> > > > > > > string for the value in the KeyValue metadata — I
> think this was more
> > >> >> > > > > > > of an oversight than a deliberate choice (e.g. it was
> not our intent
> > >> >> > > > > > > to require binary data to be base64-encoded — this is
> something that
> > >> >> > > > > > > we have to do when encoding binary data in Thrift
> KeyValue metadata
> > >> >> > > > > > > for Parquet, for example). Is the binary
> representation of [byte]
> > >> >> > > > > > > different from string?
> > >> >> > > > > > >
> > >> >> > > > > > >
> > >> >> > > > > > >
> > >> >> > > > > > > > Side Question: Why isn't the IPC stream format a
> series of the
> > >> >> > > > flight
> > >> >> > > > > > > > > protobufs?
> > >> >> > > > > > > >
> > >> >> > > > > > > > In addition to what David said, protobufs can't be
> read directly
> > >> >> > > > > from a
> > >> >> > > > > > > > memory-mapped file (they need decoding).  This was
> one of the
> > >> >> > > > design
> > >> >> > > > > > > > considerations of using flatbuffers and IPC
> Stream/File format.
> > >> >> > > > > > > >
> > >> >> > > > > > > > I was thinking Micah's comment is more that whatever
> we do, it
> > >> >> > > > > should be
> > >> >> > > > > > > > > clearly specified and edge cases should be
> considered, especially
> > >> >> > > > > if we
> > >> >> > > > > > > > > might want to 'backport' this into the stream
> format later.
> > >> >> > > > > > > >
> > >> >> > > > > > > >
> > >> >> > > > > > > > Yes, for dictionaries we just need to be careful to
> define
> > >> >> > > > semantics
> > >> >> > > > > and
> > >> >> > > > > > > > ensure implementations are validating them with
> regards to
> > >> >> > > > > dictionaries.
> > >> >> > > > > > > > There likely isn't any need to change current
> implementations
> > >> >> > > > though.
> > >> >> > > > > > > >
> > >> >> > > > > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li <
> lidavidm@apache.org>
> > >> >> > > > > wrote:
> > >> >> > > > > > > >
> > >> >> > > > > > > > > Right, I had wanted to focus the discussion on
> Flight as I think
> > >> >> > > > > schema
> > >> >> > > > > > > > > evolution or multiplexing streams (more so the
> latter) is a
> > >> >> > > > > property of the
> > >> >> > > > > > > > > transport and not the stream format itself. If we
> are leaning
> > >> >> > > > > towards just
> > >> >> > > > > > > > > schema evolution then maybe it makes sense to
> discuss it for the
> > >> >> > > > > IPC stream
> > >> >> > > > > > > > > format and leverage that in Flight. I'd be
> interested in what
> > >> >> > > > > others think.
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > Especially if we are looking at multiplexing
> streams - I would
> > >> >> > > > > wonder if
> > >> >> > > > > > > > > that's actually better served by making it easier
> to implement
> > >> >> > > > > using the
> > >> >> > > > > > > > > Flight implementation as it stands (by managing
> concurrent RPC
> > >> >> > > > > calls and/or
> > >> >> > > > > > > > > performing the union-of-structs encoding trick for
> you), instead
> > >> >> > > > > of having
> > >> >> > > > > > > > > to change the protocol.
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > Nate: it may be worth starting a separate
> discussion about more
> > >> >> > > > > general
> > >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why
> key-value
> > >> >> > > > > metadata was
> > >> >> > > > > > > > > chosen/if opaque bytes were considered in the
> past. Off the top
> > >> >> > > > of
> > >> >> > > > > my head
> > >> >> > > > > > > > > if it's for on-disk storage and fully
> application-defined it may
> > >> >> > > > > make sense
> > >> >> > > > > > > > > to store as a separate file alongside the Arrow
> file (indexed by
> > >> >> > > > > record
> > >> >> > > > > > > > > batch index) where you can take advantage of
> whatever format is
> > >> >> > > > > most
> > >> >> > > > > > > > > suitable.
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > -David
> > >> >> > > > > > > > >
> > >> >> > > > > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan
> wrote:
> > >> >> > > > > > > > > > Hi guys,
> > >> >> > > > > > > > > >
> > >> >> > > > > > > > > > 1. Regarding IPC vs Flight: in fact my initial
> suggestion was
> > >> >> > > > to
> > >> >> > > > > add this
> > >> >> > > > > > > > > > feature starting from the IPC(I moved initial
> write up steps to
> > >> >> > > > > the
> > >> >> > > > > > > > > bottom
> > >> >> > > > > > > > > > of the doc). Afterwards David suggested focusing
> on Flight and
> > >> >> > > > > that's how
> > >> >> > > > > > > > > > we ended up with the protobufs change in the
> proposal. This
> > >> >> > > > > being said I
> > >> >> > > > > > > > > do
> > >> >> > > > > > > > > > think that the place where this should be
> impemented is a good
> > >> >> > > > > question
> > >> >> > > > > > > > > on
> > >> >> > > > > > > > > > its own. Maybe it makes sense to have this kind
> of a feature in
> > >> >> > > > > IPC and
> > >> >> > > > > > > > > > somehow use it in Flight, maybe not.
> > >> >> > > > > > > > > > 2. The point about dictionaries deserves a
> dedicated section in
> > >> >> > > > > the
> > >> >> > > > > > > > > > proposal. Nate and David brought it up and
> shared some
> > >> >> > > > insights.
> > >> >> > > > > I'll try
> > >> >> > > > > > > > > > to aggregate them and we can continue the
> discussion form
> > >> >> > > > there.
> > >> >> > > > > > > > > >
> > >> >> > > > > > > > > > Cheers,
> > >> >> > > > > > > > > > Gosh
> > >> >> > > > > > > > > >
> > >> >> > > > > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <
> > >> >> > > > > > > > > natebauernfeind@deephaven.io>
> > >> >> > > > > > > > > > wrote:
> > >> >> > > > > > > > > >
> > >> >> > > > > > > > > > > >
> > >> >> > > > > > > > > > > > > > makes it more difficult to bring schema
> evolution back
> > >> >> > > > > into the
> > >> >> > > > > > > > > > > > > > IPC Stream format (i.e. it would live
> only in flight)
> > >> >> > > > > > > > > > > > >
> > >> >> > > > > > > > > > > > > Gosh's proposal extends the flatbuffer
> structures not the
> > >> >> > > > > > > > > protobufs.
> > >> >> > > > > > > > > > > Can
> > >> >> > > > > > > > > > > > > you help me understand how difficult it
> would be to bring
> > >> >> > > > > the
> > >> >> > > > > > > > > > > `schema_id`
> > >> >> > > > > > > > > > > > > approach to the IPC stream format?
> > >> >> > > > > > > > > > > >
> > >> >> > > > > > > > > > > > I thought we were talking solely about the
> Flight Protobuf
> > >> >> > > > > > > > > definitions -
> > >> >> > > > > > > > > > > > not the Flatbuffers (and the Google doc at
> least only talks
> > >> >> > > > > about the
> > >> >> > > > > > > > > > > > Protobufs).
> > >> >> > > > > > > > > > > >
> > >> >> > > > > > > > > > >
> > >> >> > > > > > > > > > > I somehow missed that schema_id is being added
> to protobuf in
> > >> >> > > > > the
> > >> >> > > > > > > > > document.
> > >> >> > > > > > > > > > > It feels to me that the schema_id is a
> property that would
> > >> >> > > > > ideally only
> > >> >> > > > > > > > > > > apply to the RecordBatch. I better understand
> Micah's
> > >> >> > > > > dictionary
> > >> >> > > > > > > > > concerns,
> > >> >> > > > > > > > > > > now, too.
> > >> >> > > > > > > > > > >
> > >> >> > > > > > > > > > > > Side Question: Why isn't the IPC stream
> format a series of
> > >> >> > > > > the flight
> > >> >> > > > > > > > > > > > > protobufs? It's a real shame that there is
> no standard
> > >> >> > > > way
> > >> >> > > > > to
> > >> >> > > > > > > > > > > > > capture/replay a stream with app_metadata.
> (Obviously
> > >> >> > > > > ignoring the
> > >> >> > > > > > > > > > > > > annoyances around protobuf wrapping
> flatbuffers.)
> > >> >> > > > > > > > > > > >
> > >> >> > > > > > > > > > > > The IPC format was defined long before
> Flight, and Flight's
> > >> >> > > > > > > > > app_metadata
> > >> >> > > > > > > > > > > > was added after Flight's initial definition.
> Note an IPC
> > >> >> > > > > message does
> > >> >> > > > > > > > > > > have
> > >> >> > > > > > > > > > > > a provision for key-value metadata, though I
> think APIs for
> > >> >> > > > > that are
> > >> >> > > > > > > > > not
> > >> >> > > > > > > > > > > > fully exposed. (See ARROW-6940:
> > >> >> > > > > > > > > > > >
> https://issues.apache.org/jira/browse/ARROW-6940 and
> > >> >> > > > > despite my
> > >> >> > > > > > > > > comments
> > >> >> > > > > > > > > > > > there perhaps we need to unify or at least
> consider how
> > >> >> > > > > Flight's
> > >> >> > > > > > > > > > > > app_metadata relates to the IPC message
> custom_metadata.
> > >> >> > > > Also
> > >> >> > > > > > > > > perhaps see
> > >> >> > > > > > > > > > > > ARROW-1059.)
> > >> >> > > > > > > > > > > >
> > >> >> > > > > > > > > > >
> > >> >> > > > > > > > > > > KeyValue unfortunately is string to string. In
> flatbuffer
> > >> >> > > > > strings are
> > >> >> > > > > > > > > only
> > >> >> > > > > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the
> other hand is
> > >> >> > > > > opaque
> > >> >> > > > > > > > > bytes.
> > >> >> > > > > > > > > > > The latter is a bit more useful.
> > >> >> > > > > > > > > > >
> > >> >> > > > > > > > > > > --
> > >> >> > > > > > > > > > >
> > >> >> > > > > > > > > >
> > >> >> > > > > > > > >
> > >> >> > > > > > >
> > >> >> > > > >
> > >> >> > > >
> > >> >> > > >
> > >> >> > > > --
> > >> >> > > >
>

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Wes McKinney <we...@gmail.com>.
I've been poking around the project, and I'm growing concerned that
our use of the KeyValue field has already been non-compliant in many
cases since we do not validate UTF8-ness. Since we also use KeyValue
to handle opaque data serialization for extension types [1], the fact
that the specification does not clarify that binary data (such as the
output of ExtensionType::Serialize) must be base64-encoded or similar
makes this a bit of a minefield at the moment.

It seems that there are no particularly excellent solutions, and
maintaining the status quo (having now identified these
inconsistencies / vagueries in at least the C++ implementation) is
probably not a good idea either. In Parquet, where we store a
serialized binary Arrow schema, we have to base64-encode [2]

[1]: https://arrow.apache.org/docs/format/Columnar.html#custom-application-metadata
[2]: https://github.com/apache/arrow/blob/e990d177b1f1dec962315487682f613d46be573c/cpp/src/parquet/arrow/writer.cc#L423

On Tue, Aug 10, 2021 at 3:27 PM Wes McKinney <we...@gmail.com> wrote:
>
> Ah, that's definitely a no-go then (I believe we verify messages
> unconditionally in C++). That's unfortunate (and I feel responsible
> for missing this, but I suppose we had a lot of opportunities to fix
> it prior to the 1.0.0 format version) — so to have actual binary
> values (which was the intention in the first place for the metadata)
> we would need to add a new metadata field.
>
> On Tue, Aug 10, 2021 at 6:53 AM Micah Kornfield <em...@gmail.com> wrote:
> >
> > One issue with changing it to byte is it would effectively break any reader that is validating flatbuffer data, because flatbuffers verifies null termination [1].  While this might comply with forward compatibility guarantees it seems like a pretty large blast radius.
> >
> > [1] https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flatbuffers.h#L2457
> >
> > On Mon, Jul 12, 2021 at 12:38 PM Wes McKinney <we...@gmail.com> wrote:
> >>
> >> pyarrow at least treats the KeyValue values as binary and not UTF-8.
> >>
> >> On Sun, Jul 11, 2021 at 9:40 PM Micah Kornfield <em...@gmail.com> wrote:
> >> >
> >> > I think other languages (e.g. java, python) might make more of distinction between utf-8 compatible strings and raw bytes.  For python it might be less of a concern if the c++ wrapper already makes the value field look like a bytes field
> >> >
> >> > On Sunday, July 11, 2021, Wes McKinney <we...@gmail.com> wrote:
> >> >>
> >> >> We could certainly "upgrade" KeyValue to have a binary value field
> >> >> everywhere KeyValue is used, but there is some risk of code in the
> >> >> wild expecting there to be a null terminator after the string data.
> >> >> The Flatbuffers-generated accessor APIs do not depend on the existence
> >> >> of the null terminator, though. Not ideal, but I would not be thrilled
> >> >> about adding an extra [ BinaryKeyValue ] everyplace we currently have
> >> >> [ KeyValue ].
> >> >>
> >> >> That said, I doubt that we have any endogenous forward compatibility
> >> >> problems related to this in Apache Arrow-maintained libraries, the
> >> >> risk would come from users who are interacting with the Flatbuffers
> >> >> data manually / without using one of our libraries. We could implement
> >> >> the changes and run a set of forward compatibility integration tests
> >> >> to see if anyone of our released libraries have an issue.
> >> >>
> >> >> On Fri, Jul 9, 2021 at 11:33 AM Wes McKinney <we...@gmail.com> wrote:
> >> >> >
> >> >> > The cost of an empty vector in Flatbuffers appears to be 4 bytes.
> >> >> >
> >> >> > On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield <em...@gmail.com> wrote:
> >> >> > >
> >> >> > > Retitling and forking the discussion to talk about key value pairs.
> >> >> > >
> >> >> > > What is the byte cost of an empty list?  Another option would be to
> >> >> > > introduce a new BinaryKeyValue table and add binary metadata.
> >> >> > >
> >> >> > > On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind <
> >> >> > > natebauernfeind@deephaven.io> wrote:
> >> >> > >
> >> >> > > > Deephaven and I are very supportive of "upgrading" the value half of the kv
> >> >> > > > pair to a byte vector. What is the best way to find out if there is
> >> >> > > > sufficient interest?
> >> >> > > >
> >> >> > > >
> >> >> > > > I've been stewing on the ideas here around schema evolution, and I realize
> >> >> > > > the specific feature I am missing is the ability to encode that a field
> >> >> > > > (i.e. its FieldNode and accompanying Buffers in the RecordBatch) is
> >> >> > > > empty/has-no-data in O(0) cost (yes; for free).
> >> >> > > >
> >> >> > > > Might there be interest in adding a "field_id" to the FieldNode (which is
> >> >> > > > encoded on the RecordBatch flatbuffer)? I see a simple forward-compatible
> >> >> > > > upgrade (by either keying off of 0, or explicitly set the field default to
> >> >> > > > -1) which would allow the sender to "skip" fields that have 1) FieldNode
> >> >> > > > length of zero, and 2) all Buffer's associated at that level (and further
> >> >> > > > nested) are also equally empty (i.e. Buffer length is zero).
> >> >> > > >
> >> >> > > > I understand this concept slightly interferes with RecordBatch's `length`
> >> >> > > > field, and that many implementations use that length to resize the
> >> >> > > > root-level FieldNodes. The use-case I have in mind has different logical
> >> >> > > > lengths per field node; current implementations require sending a
> >> >> > > > RecordBatch length of the max length across all root level field nodes. I
> >> >> > > > believe this requires a copy of data whenever a field node is too short; I
> >> >> > > > don't know if there is a decent solution to this slight inefficiency. I am
> >> >> > > > bringing it up because if "skipping a field node when it is empty" is a
> >> >> > > > feature, then we may not want to allocate space for those nodes given that
> >> >> > > > the record batch length will likely be greater than zero.
> >> >> > > >
> >> >> > > > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <we...@gmail.com> wrote:
> >> >> > > >
> >> >> > > > > On Wed, Jul 7, 2021 at 2:53 PM David Li <ap...@lidavidm.me> wrote:
> >> >> > > > > >
> >> >> > > > > > From the Flatbuffers internals doc[1] it appears they are the same:
> >> >> > > > > "Strings are simply a vector of bytes, and are always null-terminated."
> >> >> > > > >
> >> >> > > > > I see. I took a look at flatbuffers.h, and it appears that changing
> >> >> > > > > this field from string to [byte] would be backward-compatible and
> >> >> > > > > forward-compatible except with code that expects a null terminator.
> >> >> > > > > This is something we could discuss separately if there were enough
> >> >> > > > > interest.
> >> >> > > > >
> >> >> > > > > > [1]: https://google.github.io/flatbuffers/flatbuffers_internals.html
> >> >> > > > > >
> >> >> > > > > > -David
> >> >> > > > > >
> >> >> > > > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
> >> >> > > > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <
> >> >> > > > emkornfield@gmail.com>
> >> >> > > > > wrote:
> >> >> > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > > > > Right, I had wanted to focus the discussion on Flight as I think
> >> >> > > > > schema
> >> >> > > > > > > > > evolution or multiplexing streams (more so the latter) is a
> >> >> > > > > property of the
> >> >> > > > > > > > > transport and not the stream format itself. If we are leaning
> >> >> > > > > towards just
> >> >> > > > > > > > > schema evolution then maybe it makes sense to discuss it for the
> >> >> > > > > IPC stream
> >> >> > > > > > > > > format and leverage that in Flight. I'd be interested in what
> >> >> > > > > others think.
> >> >> > > > > > > >
> >> >> > > > > > > > I tend to agree, I think stream multiplexing is likely a transport
> >> >> > > > > level
> >> >> > > > > > > > issue.  IMO I think schema evolution should be consistent with the
> >> >> > > > > IPC
> >> >> > > > > > > > stream format  and flight.
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > > Nate: it may be worth starting a separate discussion about more
> >> >> > > > > general
> >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why key-value
> >> >> > > > > metadata was
> >> >> > > > > > > > > chosen/if opaque bytes were considered in the past.
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > I think  this was an unfortunate design of the key value metadata
> >> >> > > > in
> >> >> > > > > > > > Schema.fbs, but I don't think I was around when this decision was
> >> >> > > > > made.
> >> >> > > > > > >
> >> >> > > > > > > I agree that it's unfortunate that we did not use [ byte ] instead of
> >> >> > > > > > > string for the value in the KeyValue metadata — I think this was more
> >> >> > > > > > > of an oversight than a deliberate choice (e.g. it was not our intent
> >> >> > > > > > > to require binary data to be base64-encoded — this is something that
> >> >> > > > > > > we have to do when encoding binary data in Thrift KeyValue metadata
> >> >> > > > > > > for Parquet, for example). Is the binary representation of [byte]
> >> >> > > > > > > different from string?
> >> >> > > > > > >
> >> >> > > > > > >
> >> >> > > > > > >
> >> >> > > > > > > > Side Question: Why isn't the IPC stream format a series of the
> >> >> > > > flight
> >> >> > > > > > > > > protobufs?
> >> >> > > > > > > >
> >> >> > > > > > > > In addition to what David said, protobufs can't be read directly
> >> >> > > > > from a
> >> >> > > > > > > > memory-mapped file (they need decoding).  This was one of the
> >> >> > > > design
> >> >> > > > > > > > considerations of using flatbuffers and IPC Stream/File format.
> >> >> > > > > > > >
> >> >> > > > > > > > I was thinking Micah's comment is more that whatever we do, it
> >> >> > > > > should be
> >> >> > > > > > > > > clearly specified and edge cases should be considered, especially
> >> >> > > > > if we
> >> >> > > > > > > > > might want to 'backport' this into the stream format later.
> >> >> > > > > > > >
> >> >> > > > > > > >
> >> >> > > > > > > > Yes, for dictionaries we just need to be careful to define
> >> >> > > > semantics
> >> >> > > > > and
> >> >> > > > > > > > ensure implementations are validating them with regards to
> >> >> > > > > dictionaries.
> >> >> > > > > > > > There likely isn't any need to change current implementations
> >> >> > > > though.
> >> >> > > > > > > >
> >> >> > > > > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li <li...@apache.org>
> >> >> > > > > wrote:
> >> >> > > > > > > >
> >> >> > > > > > > > > Right, I had wanted to focus the discussion on Flight as I think
> >> >> > > > > schema
> >> >> > > > > > > > > evolution or multiplexing streams (more so the latter) is a
> >> >> > > > > property of the
> >> >> > > > > > > > > transport and not the stream format itself. If we are leaning
> >> >> > > > > towards just
> >> >> > > > > > > > > schema evolution then maybe it makes sense to discuss it for the
> >> >> > > > > IPC stream
> >> >> > > > > > > > > format and leverage that in Flight. I'd be interested in what
> >> >> > > > > others think.
> >> >> > > > > > > > >
> >> >> > > > > > > > > Especially if we are looking at multiplexing streams - I would
> >> >> > > > > wonder if
> >> >> > > > > > > > > that's actually better served by making it easier to implement
> >> >> > > > > using the
> >> >> > > > > > > > > Flight implementation as it stands (by managing concurrent RPC
> >> >> > > > > calls and/or
> >> >> > > > > > > > > performing the union-of-structs encoding trick for you), instead
> >> >> > > > > of having
> >> >> > > > > > > > > to change the protocol.
> >> >> > > > > > > > >
> >> >> > > > > > > > > Nate: it may be worth starting a separate discussion about more
> >> >> > > > > general
> >> >> > > > > > > > > metadata in the IPC message. I'm not aware of why key-value
> >> >> > > > > metadata was
> >> >> > > > > > > > > chosen/if opaque bytes were considered in the past. Off the top
> >> >> > > > of
> >> >> > > > > my head
> >> >> > > > > > > > > if it's for on-disk storage and fully application-defined it may
> >> >> > > > > make sense
> >> >> > > > > > > > > to store as a separate file alongside the Arrow file (indexed by
> >> >> > > > > record
> >> >> > > > > > > > > batch index) where you can take advantage of whatever format is
> >> >> > > > > most
> >> >> > > > > > > > > suitable.
> >> >> > > > > > > > >
> >> >> > > > > > > > > -David
> >> >> > > > > > > > >
> >> >> > > > > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan wrote:
> >> >> > > > > > > > > > Hi guys,
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > 1. Regarding IPC vs Flight: in fact my initial suggestion was
> >> >> > > > to
> >> >> > > > > add this
> >> >> > > > > > > > > > feature starting from the IPC(I moved initial write up steps to
> >> >> > > > > the
> >> >> > > > > > > > > bottom
> >> >> > > > > > > > > > of the doc). Afterwards David suggested focusing on Flight and
> >> >> > > > > that's how
> >> >> > > > > > > > > > we ended up with the protobufs change in the proposal. This
> >> >> > > > > being said I
> >> >> > > > > > > > > do
> >> >> > > > > > > > > > think that the place where this should be impemented is a good
> >> >> > > > > question
> >> >> > > > > > > > > on
> >> >> > > > > > > > > > its own. Maybe it makes sense to have this kind of a feature in
> >> >> > > > > IPC and
> >> >> > > > > > > > > > somehow use it in Flight, maybe not.
> >> >> > > > > > > > > > 2. The point about dictionaries deserves a dedicated section in
> >> >> > > > > the
> >> >> > > > > > > > > > proposal. Nate and David brought it up and shared some
> >> >> > > > insights.
> >> >> > > > > I'll try
> >> >> > > > > > > > > > to aggregate them and we can continue the discussion form
> >> >> > > > there.
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > Cheers,
> >> >> > > > > > > > > > Gosh
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <
> >> >> > > > > > > > > natebauernfeind@deephaven.io>
> >> >> > > > > > > > > > wrote:
> >> >> > > > > > > > > >
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > > > makes it more difficult to bring schema evolution back
> >> >> > > > > into the
> >> >> > > > > > > > > > > > > > IPC Stream format (i.e. it would live only in flight)
> >> >> > > > > > > > > > > > >
> >> >> > > > > > > > > > > > > Gosh's proposal extends the flatbuffer structures not the
> >> >> > > > > > > > > protobufs.
> >> >> > > > > > > > > > > Can
> >> >> > > > > > > > > > > > > you help me understand how difficult it would be to bring
> >> >> > > > > the
> >> >> > > > > > > > > > > `schema_id`
> >> >> > > > > > > > > > > > > approach to the IPC stream format?
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > I thought we were talking solely about the Flight Protobuf
> >> >> > > > > > > > > definitions -
> >> >> > > > > > > > > > > > not the Flatbuffers (and the Google doc at least only talks
> >> >> > > > > about the
> >> >> > > > > > > > > > > > Protobufs).
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > I somehow missed that schema_id is being added to protobuf in
> >> >> > > > > the
> >> >> > > > > > > > > document.
> >> >> > > > > > > > > > > It feels to me that the schema_id is a property that would
> >> >> > > > > ideally only
> >> >> > > > > > > > > > > apply to the RecordBatch. I better understand Micah's
> >> >> > > > > dictionary
> >> >> > > > > > > > > concerns,
> >> >> > > > > > > > > > > now, too.
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > > Side Question: Why isn't the IPC stream format a series of
> >> >> > > > > the flight
> >> >> > > > > > > > > > > > > protobufs? It's a real shame that there is no standard
> >> >> > > > way
> >> >> > > > > to
> >> >> > > > > > > > > > > > > capture/replay a stream with app_metadata. (Obviously
> >> >> > > > > ignoring the
> >> >> > > > > > > > > > > > > annoyances around protobuf wrapping flatbuffers.)
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > > > The IPC format was defined long before Flight, and Flight's
> >> >> > > > > > > > > app_metadata
> >> >> > > > > > > > > > > > was added after Flight's initial definition. Note an IPC
> >> >> > > > > message does
> >> >> > > > > > > > > > > have
> >> >> > > > > > > > > > > > a provision for key-value metadata, though I think APIs for
> >> >> > > > > that are
> >> >> > > > > > > > > not
> >> >> > > > > > > > > > > > fully exposed. (See ARROW-6940:
> >> >> > > > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-6940 and
> >> >> > > > > despite my
> >> >> > > > > > > > > comments
> >> >> > > > > > > > > > > > there perhaps we need to unify or at least consider how
> >> >> > > > > Flight's
> >> >> > > > > > > > > > > > app_metadata relates to the IPC message custom_metadata.
> >> >> > > > Also
> >> >> > > > > > > > > perhaps see
> >> >> > > > > > > > > > > > ARROW-1059.)
> >> >> > > > > > > > > > > >
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > KeyValue unfortunately is string to string. In flatbuffer
> >> >> > > > > strings are
> >> >> > > > > > > > > only
> >> >> > > > > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the other hand is
> >> >> > > > > opaque
> >> >> > > > > > > > > bytes.
> >> >> > > > > > > > > > > The latter is a bit more useful.
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > > > --
> >> >> > > > > > > > > > >
> >> >> > > > > > > > > >
> >> >> > > > > > > > >
> >> >> > > > > > >
> >> >> > > > >
> >> >> > > >
> >> >> > > >
> >> >> > > > --
> >> >> > > >

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Wes McKinney <we...@gmail.com>.
Ah, that's definitely a no-go then (I believe we verify messages
unconditionally in C++). That's unfortunate (and I feel responsible
for missing this, but I suppose we had a lot of opportunities to fix
it prior to the 1.0.0 format version) — so to have actual binary
values (which was the intention in the first place for the metadata)
we would need to add a new metadata field.

On Tue, Aug 10, 2021 at 6:53 AM Micah Kornfield <em...@gmail.com> wrote:
>
> One issue with changing it to byte is it would effectively break any reader that is validating flatbuffer data, because flatbuffers verifies null termination [1].  While this might comply with forward compatibility guarantees it seems like a pretty large blast radius.
>
> [1] https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flatbuffers.h#L2457
>
> On Mon, Jul 12, 2021 at 12:38 PM Wes McKinney <we...@gmail.com> wrote:
>>
>> pyarrow at least treats the KeyValue values as binary and not UTF-8.
>>
>> On Sun, Jul 11, 2021 at 9:40 PM Micah Kornfield <em...@gmail.com> wrote:
>> >
>> > I think other languages (e.g. java, python) might make more of distinction between utf-8 compatible strings and raw bytes.  For python it might be less of a concern if the c++ wrapper already makes the value field look like a bytes field
>> >
>> > On Sunday, July 11, 2021, Wes McKinney <we...@gmail.com> wrote:
>> >>
>> >> We could certainly "upgrade" KeyValue to have a binary value field
>> >> everywhere KeyValue is used, but there is some risk of code in the
>> >> wild expecting there to be a null terminator after the string data.
>> >> The Flatbuffers-generated accessor APIs do not depend on the existence
>> >> of the null terminator, though. Not ideal, but I would not be thrilled
>> >> about adding an extra [ BinaryKeyValue ] everyplace we currently have
>> >> [ KeyValue ].
>> >>
>> >> That said, I doubt that we have any endogenous forward compatibility
>> >> problems related to this in Apache Arrow-maintained libraries, the
>> >> risk would come from users who are interacting with the Flatbuffers
>> >> data manually / without using one of our libraries. We could implement
>> >> the changes and run a set of forward compatibility integration tests
>> >> to see if anyone of our released libraries have an issue.
>> >>
>> >> On Fri, Jul 9, 2021 at 11:33 AM Wes McKinney <we...@gmail.com> wrote:
>> >> >
>> >> > The cost of an empty vector in Flatbuffers appears to be 4 bytes.
>> >> >
>> >> > On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield <em...@gmail.com> wrote:
>> >> > >
>> >> > > Retitling and forking the discussion to talk about key value pairs.
>> >> > >
>> >> > > What is the byte cost of an empty list?  Another option would be to
>> >> > > introduce a new BinaryKeyValue table and add binary metadata.
>> >> > >
>> >> > > On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind <
>> >> > > natebauernfeind@deephaven.io> wrote:
>> >> > >
>> >> > > > Deephaven and I are very supportive of "upgrading" the value half of the kv
>> >> > > > pair to a byte vector. What is the best way to find out if there is
>> >> > > > sufficient interest?
>> >> > > >
>> >> > > >
>> >> > > > I've been stewing on the ideas here around schema evolution, and I realize
>> >> > > > the specific feature I am missing is the ability to encode that a field
>> >> > > > (i.e. its FieldNode and accompanying Buffers in the RecordBatch) is
>> >> > > > empty/has-no-data in O(0) cost (yes; for free).
>> >> > > >
>> >> > > > Might there be interest in adding a "field_id" to the FieldNode (which is
>> >> > > > encoded on the RecordBatch flatbuffer)? I see a simple forward-compatible
>> >> > > > upgrade (by either keying off of 0, or explicitly set the field default to
>> >> > > > -1) which would allow the sender to "skip" fields that have 1) FieldNode
>> >> > > > length of zero, and 2) all Buffer's associated at that level (and further
>> >> > > > nested) are also equally empty (i.e. Buffer length is zero).
>> >> > > >
>> >> > > > I understand this concept slightly interferes with RecordBatch's `length`
>> >> > > > field, and that many implementations use that length to resize the
>> >> > > > root-level FieldNodes. The use-case I have in mind has different logical
>> >> > > > lengths per field node; current implementations require sending a
>> >> > > > RecordBatch length of the max length across all root level field nodes. I
>> >> > > > believe this requires a copy of data whenever a field node is too short; I
>> >> > > > don't know if there is a decent solution to this slight inefficiency. I am
>> >> > > > bringing it up because if "skipping a field node when it is empty" is a
>> >> > > > feature, then we may not want to allocate space for those nodes given that
>> >> > > > the record batch length will likely be greater than zero.
>> >> > > >
>> >> > > > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <we...@gmail.com> wrote:
>> >> > > >
>> >> > > > > On Wed, Jul 7, 2021 at 2:53 PM David Li <ap...@lidavidm.me> wrote:
>> >> > > > > >
>> >> > > > > > From the Flatbuffers internals doc[1] it appears they are the same:
>> >> > > > > "Strings are simply a vector of bytes, and are always null-terminated."
>> >> > > > >
>> >> > > > > I see. I took a look at flatbuffers.h, and it appears that changing
>> >> > > > > this field from string to [byte] would be backward-compatible and
>> >> > > > > forward-compatible except with code that expects a null terminator.
>> >> > > > > This is something we could discuss separately if there were enough
>> >> > > > > interest.
>> >> > > > >
>> >> > > > > > [1]: https://google.github.io/flatbuffers/flatbuffers_internals.html
>> >> > > > > >
>> >> > > > > > -David
>> >> > > > > >
>> >> > > > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
>> >> > > > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <
>> >> > > > emkornfield@gmail.com>
>> >> > > > > wrote:
>> >> > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > > > > Right, I had wanted to focus the discussion on Flight as I think
>> >> > > > > schema
>> >> > > > > > > > > evolution or multiplexing streams (more so the latter) is a
>> >> > > > > property of the
>> >> > > > > > > > > transport and not the stream format itself. If we are leaning
>> >> > > > > towards just
>> >> > > > > > > > > schema evolution then maybe it makes sense to discuss it for the
>> >> > > > > IPC stream
>> >> > > > > > > > > format and leverage that in Flight. I'd be interested in what
>> >> > > > > others think.
>> >> > > > > > > >
>> >> > > > > > > > I tend to agree, I think stream multiplexing is likely a transport
>> >> > > > > level
>> >> > > > > > > > issue.  IMO I think schema evolution should be consistent with the
>> >> > > > > IPC
>> >> > > > > > > > stream format  and flight.
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > > Nate: it may be worth starting a separate discussion about more
>> >> > > > > general
>> >> > > > > > > > > metadata in the IPC message. I'm not aware of why key-value
>> >> > > > > metadata was
>> >> > > > > > > > > chosen/if opaque bytes were considered in the past.
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > I think  this was an unfortunate design of the key value metadata
>> >> > > > in
>> >> > > > > > > > Schema.fbs, but I don't think I was around when this decision was
>> >> > > > > made.
>> >> > > > > > >
>> >> > > > > > > I agree that it's unfortunate that we did not use [ byte ] instead of
>> >> > > > > > > string for the value in the KeyValue metadata — I think this was more
>> >> > > > > > > of an oversight than a deliberate choice (e.g. it was not our intent
>> >> > > > > > > to require binary data to be base64-encoded — this is something that
>> >> > > > > > > we have to do when encoding binary data in Thrift KeyValue metadata
>> >> > > > > > > for Parquet, for example). Is the binary representation of [byte]
>> >> > > > > > > different from string?
>> >> > > > > > >
>> >> > > > > > >
>> >> > > > > > >
>> >> > > > > > > > Side Question: Why isn't the IPC stream format a series of the
>> >> > > > flight
>> >> > > > > > > > > protobufs?
>> >> > > > > > > >
>> >> > > > > > > > In addition to what David said, protobufs can't be read directly
>> >> > > > > from a
>> >> > > > > > > > memory-mapped file (they need decoding).  This was one of the
>> >> > > > design
>> >> > > > > > > > considerations of using flatbuffers and IPC Stream/File format.
>> >> > > > > > > >
>> >> > > > > > > > I was thinking Micah's comment is more that whatever we do, it
>> >> > > > > should be
>> >> > > > > > > > > clearly specified and edge cases should be considered, especially
>> >> > > > > if we
>> >> > > > > > > > > might want to 'backport' this into the stream format later.
>> >> > > > > > > >
>> >> > > > > > > >
>> >> > > > > > > > Yes, for dictionaries we just need to be careful to define
>> >> > > > semantics
>> >> > > > > and
>> >> > > > > > > > ensure implementations are validating them with regards to
>> >> > > > > dictionaries.
>> >> > > > > > > > There likely isn't any need to change current implementations
>> >> > > > though.
>> >> > > > > > > >
>> >> > > > > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li <li...@apache.org>
>> >> > > > > wrote:
>> >> > > > > > > >
>> >> > > > > > > > > Right, I had wanted to focus the discussion on Flight as I think
>> >> > > > > schema
>> >> > > > > > > > > evolution or multiplexing streams (more so the latter) is a
>> >> > > > > property of the
>> >> > > > > > > > > transport and not the stream format itself. If we are leaning
>> >> > > > > towards just
>> >> > > > > > > > > schema evolution then maybe it makes sense to discuss it for the
>> >> > > > > IPC stream
>> >> > > > > > > > > format and leverage that in Flight. I'd be interested in what
>> >> > > > > others think.
>> >> > > > > > > > >
>> >> > > > > > > > > Especially if we are looking at multiplexing streams - I would
>> >> > > > > wonder if
>> >> > > > > > > > > that's actually better served by making it easier to implement
>> >> > > > > using the
>> >> > > > > > > > > Flight implementation as it stands (by managing concurrent RPC
>> >> > > > > calls and/or
>> >> > > > > > > > > performing the union-of-structs encoding trick for you), instead
>> >> > > > > of having
>> >> > > > > > > > > to change the protocol.
>> >> > > > > > > > >
>> >> > > > > > > > > Nate: it may be worth starting a separate discussion about more
>> >> > > > > general
>> >> > > > > > > > > metadata in the IPC message. I'm not aware of why key-value
>> >> > > > > metadata was
>> >> > > > > > > > > chosen/if opaque bytes were considered in the past. Off the top
>> >> > > > of
>> >> > > > > my head
>> >> > > > > > > > > if it's for on-disk storage and fully application-defined it may
>> >> > > > > make sense
>> >> > > > > > > > > to store as a separate file alongside the Arrow file (indexed by
>> >> > > > > record
>> >> > > > > > > > > batch index) where you can take advantage of whatever format is
>> >> > > > > most
>> >> > > > > > > > > suitable.
>> >> > > > > > > > >
>> >> > > > > > > > > -David
>> >> > > > > > > > >
>> >> > > > > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan wrote:
>> >> > > > > > > > > > Hi guys,
>> >> > > > > > > > > >
>> >> > > > > > > > > > 1. Regarding IPC vs Flight: in fact my initial suggestion was
>> >> > > > to
>> >> > > > > add this
>> >> > > > > > > > > > feature starting from the IPC(I moved initial write up steps to
>> >> > > > > the
>> >> > > > > > > > > bottom
>> >> > > > > > > > > > of the doc). Afterwards David suggested focusing on Flight and
>> >> > > > > that's how
>> >> > > > > > > > > > we ended up with the protobufs change in the proposal. This
>> >> > > > > being said I
>> >> > > > > > > > > do
>> >> > > > > > > > > > think that the place where this should be impemented is a good
>> >> > > > > question
>> >> > > > > > > > > on
>> >> > > > > > > > > > its own. Maybe it makes sense to have this kind of a feature in
>> >> > > > > IPC and
>> >> > > > > > > > > > somehow use it in Flight, maybe not.
>> >> > > > > > > > > > 2. The point about dictionaries deserves a dedicated section in
>> >> > > > > the
>> >> > > > > > > > > > proposal. Nate and David brought it up and shared some
>> >> > > > insights.
>> >> > > > > I'll try
>> >> > > > > > > > > > to aggregate them and we can continue the discussion form
>> >> > > > there.
>> >> > > > > > > > > >
>> >> > > > > > > > > > Cheers,
>> >> > > > > > > > > > Gosh
>> >> > > > > > > > > >
>> >> > > > > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <
>> >> > > > > > > > > natebauernfeind@deephaven.io>
>> >> > > > > > > > > > wrote:
>> >> > > > > > > > > >
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > > > makes it more difficult to bring schema evolution back
>> >> > > > > into the
>> >> > > > > > > > > > > > > > IPC Stream format (i.e. it would live only in flight)
>> >> > > > > > > > > > > > >
>> >> > > > > > > > > > > > > Gosh's proposal extends the flatbuffer structures not the
>> >> > > > > > > > > protobufs.
>> >> > > > > > > > > > > Can
>> >> > > > > > > > > > > > > you help me understand how difficult it would be to bring
>> >> > > > > the
>> >> > > > > > > > > > > `schema_id`
>> >> > > > > > > > > > > > > approach to the IPC stream format?
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > I thought we were talking solely about the Flight Protobuf
>> >> > > > > > > > > definitions -
>> >> > > > > > > > > > > > not the Flatbuffers (and the Google doc at least only talks
>> >> > > > > about the
>> >> > > > > > > > > > > > Protobufs).
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > I somehow missed that schema_id is being added to protobuf in
>> >> > > > > the
>> >> > > > > > > > > document.
>> >> > > > > > > > > > > It feels to me that the schema_id is a property that would
>> >> > > > > ideally only
>> >> > > > > > > > > > > apply to the RecordBatch. I better understand Micah's
>> >> > > > > dictionary
>> >> > > > > > > > > concerns,
>> >> > > > > > > > > > > now, too.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > > Side Question: Why isn't the IPC stream format a series of
>> >> > > > > the flight
>> >> > > > > > > > > > > > > protobufs? It's a real shame that there is no standard
>> >> > > > way
>> >> > > > > to
>> >> > > > > > > > > > > > > capture/replay a stream with app_metadata. (Obviously
>> >> > > > > ignoring the
>> >> > > > > > > > > > > > > annoyances around protobuf wrapping flatbuffers.)
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > > > The IPC format was defined long before Flight, and Flight's
>> >> > > > > > > > > app_metadata
>> >> > > > > > > > > > > > was added after Flight's initial definition. Note an IPC
>> >> > > > > message does
>> >> > > > > > > > > > > have
>> >> > > > > > > > > > > > a provision for key-value metadata, though I think APIs for
>> >> > > > > that are
>> >> > > > > > > > > not
>> >> > > > > > > > > > > > fully exposed. (See ARROW-6940:
>> >> > > > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-6940 and
>> >> > > > > despite my
>> >> > > > > > > > > comments
>> >> > > > > > > > > > > > there perhaps we need to unify or at least consider how
>> >> > > > > Flight's
>> >> > > > > > > > > > > > app_metadata relates to the IPC message custom_metadata.
>> >> > > > Also
>> >> > > > > > > > > perhaps see
>> >> > > > > > > > > > > > ARROW-1059.)
>> >> > > > > > > > > > > >
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > KeyValue unfortunately is string to string. In flatbuffer
>> >> > > > > strings are
>> >> > > > > > > > > only
>> >> > > > > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the other hand is
>> >> > > > > opaque
>> >> > > > > > > > > bytes.
>> >> > > > > > > > > > > The latter is a bit more useful.
>> >> > > > > > > > > > >
>> >> > > > > > > > > > > --
>> >> > > > > > > > > > >
>> >> > > > > > > > > >
>> >> > > > > > > > >
>> >> > > > > > >
>> >> > > > >
>> >> > > >
>> >> > > >
>> >> > > > --
>> >> > > >

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Micah Kornfield <em...@gmail.com>.
One issue with changing it to byte is it would effectively break any reader
that is validating flatbuffer data, because flatbuffers verifies null
termination [1].  While this might comply with forward compatibility
guarantees it seems like a pretty large blast radius.

[1]
https://github.com/google/flatbuffers/blob/master/include/flatbuffers/flatbuffers.h#L2457

On Mon, Jul 12, 2021 at 12:38 PM Wes McKinney <we...@gmail.com> wrote:

> pyarrow at least treats the KeyValue values as binary and not UTF-8.
>
> On Sun, Jul 11, 2021 at 9:40 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > I think other languages (e.g. java, python) might make more of
> distinction between utf-8 compatible strings and raw bytes.  For python it
> might be less of a concern if the c++ wrapper already makes the value field
> look like a bytes field
> >
> > On Sunday, July 11, 2021, Wes McKinney <we...@gmail.com> wrote:
> >>
> >> We could certainly "upgrade" KeyValue to have a binary value field
> >> everywhere KeyValue is used, but there is some risk of code in the
> >> wild expecting there to be a null terminator after the string data.
> >> The Flatbuffers-generated accessor APIs do not depend on the existence
> >> of the null terminator, though. Not ideal, but I would not be thrilled
> >> about adding an extra [ BinaryKeyValue ] everyplace we currently have
> >> [ KeyValue ].
> >>
> >> That said, I doubt that we have any endogenous forward compatibility
> >> problems related to this in Apache Arrow-maintained libraries, the
> >> risk would come from users who are interacting with the Flatbuffers
> >> data manually / without using one of our libraries. We could implement
> >> the changes and run a set of forward compatibility integration tests
> >> to see if anyone of our released libraries have an issue.
> >>
> >> On Fri, Jul 9, 2021 at 11:33 AM Wes McKinney <we...@gmail.com>
> wrote:
> >> >
> >> > The cost of an empty vector in Flatbuffers appears to be 4 bytes.
> >> >
> >> > On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >> > >
> >> > > Retitling and forking the discussion to talk about key value pairs.
> >> > >
> >> > > What is the byte cost of an empty list?  Another option would be to
> >> > > introduce a new BinaryKeyValue table and add binary metadata.
> >> > >
> >> > > On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind <
> >> > > natebauernfeind@deephaven.io> wrote:
> >> > >
> >> > > > Deephaven and I are very supportive of "upgrading" the value half
> of the kv
> >> > > > pair to a byte vector. What is the best way to find out if there
> is
> >> > > > sufficient interest?
> >> > > >
> >> > > >
> >> > > > I've been stewing on the ideas here around schema evolution, and
> I realize
> >> > > > the specific feature I am missing is the ability to encode that a
> field
> >> > > > (i.e. its FieldNode and accompanying Buffers in the RecordBatch)
> is
> >> > > > empty/has-no-data in O(0) cost (yes; for free).
> >> > > >
> >> > > > Might there be interest in adding a "field_id" to the FieldNode
> (which is
> >> > > > encoded on the RecordBatch flatbuffer)? I see a simple
> forward-compatible
> >> > > > upgrade (by either keying off of 0, or explicitly set the field
> default to
> >> > > > -1) which would allow the sender to "skip" fields that have 1)
> FieldNode
> >> > > > length of zero, and 2) all Buffer's associated at that level (and
> further
> >> > > > nested) are also equally empty (i.e. Buffer length is zero).
> >> > > >
> >> > > > I understand this concept slightly interferes with RecordBatch's
> `length`
> >> > > > field, and that many implementations use that length to resize the
> >> > > > root-level FieldNodes. The use-case I have in mind has different
> logical
> >> > > > lengths per field node; current implementations require sending a
> >> > > > RecordBatch length of the max length across all root level field
> nodes. I
> >> > > > believe this requires a copy of data whenever a field node is too
> short; I
> >> > > > don't know if there is a decent solution to this slight
> inefficiency. I am
> >> > > > bringing it up because if "skipping a field node when it is
> empty" is a
> >> > > > feature, then we may not want to allocate space for those nodes
> given that
> >> > > > the record batch length will likely be greater than zero.
> >> > > >
> >> > > > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <we...@gmail.com>
> wrote:
> >> > > >
> >> > > > > On Wed, Jul 7, 2021 at 2:53 PM David Li <ap...@lidavidm.me>
> wrote:
> >> > > > > >
> >> > > > > > From the Flatbuffers internals doc[1] it appears they are the
> same:
> >> > > > > "Strings are simply a vector of bytes, and are always
> null-terminated."
> >> > > > >
> >> > > > > I see. I took a look at flatbuffers.h, and it appears that
> changing
> >> > > > > this field from string to [byte] would be backward-compatible
> and
> >> > > > > forward-compatible except with code that expects a null
> terminator.
> >> > > > > This is something we could discuss separately if there were
> enough
> >> > > > > interest.
> >> > > > >
> >> > > > > > [1]:
> https://google.github.io/flatbuffers/flatbuffers_internals.html
> >> > > > > >
> >> > > > > > -David
> >> > > > > >
> >> > > > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
> >> > > > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <
> >> > > > emkornfield@gmail.com>
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > Right, I had wanted to focus the discussion on Flight
> as I think
> >> > > > > schema
> >> > > > > > > > > evolution or multiplexing streams (more so the latter)
> is a
> >> > > > > property of the
> >> > > > > > > > > transport and not the stream format itself. If we are
> leaning
> >> > > > > towards just
> >> > > > > > > > > schema evolution then maybe it makes sense to discuss
> it for the
> >> > > > > IPC stream
> >> > > > > > > > > format and leverage that in Flight. I'd be interested
> in what
> >> > > > > others think.
> >> > > > > > > >
> >> > > > > > > > I tend to agree, I think stream multiplexing is likely a
> transport
> >> > > > > level
> >> > > > > > > > issue.  IMO I think schema evolution should be consistent
> with the
> >> > > > > IPC
> >> > > > > > > > stream format  and flight.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > > Nate: it may be worth starting a separate discussion
> about more
> >> > > > > general
> >> > > > > > > > > metadata in the IPC message. I'm not aware of why
> key-value
> >> > > > > metadata was
> >> > > > > > > > > chosen/if opaque bytes were considered in the past.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > I think  this was an unfortunate design of the key value
> metadata
> >> > > > in
> >> > > > > > > > Schema.fbs, but I don't think I was around when this
> decision was
> >> > > > > made.
> >> > > > > > >
> >> > > > > > > I agree that it's unfortunate that we did not use [ byte ]
> instead of
> >> > > > > > > string for the value in the KeyValue metadata — I think
> this was more
> >> > > > > > > of an oversight than a deliberate choice (e.g. it was not
> our intent
> >> > > > > > > to require binary data to be base64-encoded — this is
> something that
> >> > > > > > > we have to do when encoding binary data in Thrift KeyValue
> metadata
> >> > > > > > > for Parquet, for example). Is the binary representation of
> [byte]
> >> > > > > > > different from string?
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > > Side Question: Why isn't the IPC stream format a series
> of the
> >> > > > flight
> >> > > > > > > > > protobufs?
> >> > > > > > > >
> >> > > > > > > > In addition to what David said, protobufs can't be read
> directly
> >> > > > > from a
> >> > > > > > > > memory-mapped file (they need decoding).  This was one of
> the
> >> > > > design
> >> > > > > > > > considerations of using flatbuffers and IPC Stream/File
> format.
> >> > > > > > > >
> >> > > > > > > > I was thinking Micah's comment is more that whatever we
> do, it
> >> > > > > should be
> >> > > > > > > > > clearly specified and edge cases should be considered,
> especially
> >> > > > > if we
> >> > > > > > > > > might want to 'backport' this into the stream format
> later.
> >> > > > > > > >
> >> > > > > > > >
> >> > > > > > > > Yes, for dictionaries we just need to be careful to define
> >> > > > semantics
> >> > > > > and
> >> > > > > > > > ensure implementations are validating them with regards to
> >> > > > > dictionaries.
> >> > > > > > > > There likely isn't any need to change current
> implementations
> >> > > > though.
> >> > > > > > > >
> >> > > > > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li <
> lidavidm@apache.org>
> >> > > > > wrote:
> >> > > > > > > >
> >> > > > > > > > > Right, I had wanted to focus the discussion on Flight
> as I think
> >> > > > > schema
> >> > > > > > > > > evolution or multiplexing streams (more so the latter)
> is a
> >> > > > > property of the
> >> > > > > > > > > transport and not the stream format itself. If we are
> leaning
> >> > > > > towards just
> >> > > > > > > > > schema evolution then maybe it makes sense to discuss
> it for the
> >> > > > > IPC stream
> >> > > > > > > > > format and leverage that in Flight. I'd be interested
> in what
> >> > > > > others think.
> >> > > > > > > > >
> >> > > > > > > > > Especially if we are looking at multiplexing streams -
> I would
> >> > > > > wonder if
> >> > > > > > > > > that's actually better served by making it easier to
> implement
> >> > > > > using the
> >> > > > > > > > > Flight implementation as it stands (by managing
> concurrent RPC
> >> > > > > calls and/or
> >> > > > > > > > > performing the union-of-structs encoding trick for
> you), instead
> >> > > > > of having
> >> > > > > > > > > to change the protocol.
> >> > > > > > > > >
> >> > > > > > > > > Nate: it may be worth starting a separate discussion
> about more
> >> > > > > general
> >> > > > > > > > > metadata in the IPC message. I'm not aware of why
> key-value
> >> > > > > metadata was
> >> > > > > > > > > chosen/if opaque bytes were considered in the past. Off
> the top
> >> > > > of
> >> > > > > my head
> >> > > > > > > > > if it's for on-disk storage and fully
> application-defined it may
> >> > > > > make sense
> >> > > > > > > > > to store as a separate file alongside the Arrow file
> (indexed by
> >> > > > > record
> >> > > > > > > > > batch index) where you can take advantage of whatever
> format is
> >> > > > > most
> >> > > > > > > > > suitable.
> >> > > > > > > > >
> >> > > > > > > > > -David
> >> > > > > > > > >
> >> > > > > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan wrote:
> >> > > > > > > > > > Hi guys,
> >> > > > > > > > > >
> >> > > > > > > > > > 1. Regarding IPC vs Flight: in fact my initial
> suggestion was
> >> > > > to
> >> > > > > add this
> >> > > > > > > > > > feature starting from the IPC(I moved initial write
> up steps to
> >> > > > > the
> >> > > > > > > > > bottom
> >> > > > > > > > > > of the doc). Afterwards David suggested focusing on
> Flight and
> >> > > > > that's how
> >> > > > > > > > > > we ended up with the protobufs change in the
> proposal. This
> >> > > > > being said I
> >> > > > > > > > > do
> >> > > > > > > > > > think that the place where this should be impemented
> is a good
> >> > > > > question
> >> > > > > > > > > on
> >> > > > > > > > > > its own. Maybe it makes sense to have this kind of a
> feature in
> >> > > > > IPC and
> >> > > > > > > > > > somehow use it in Flight, maybe not.
> >> > > > > > > > > > 2. The point about dictionaries deserves a dedicated
> section in
> >> > > > > the
> >> > > > > > > > > > proposal. Nate and David brought it up and shared some
> >> > > > insights.
> >> > > > > I'll try
> >> > > > > > > > > > to aggregate them and we can continue the discussion
> form
> >> > > > there.
> >> > > > > > > > > >
> >> > > > > > > > > > Cheers,
> >> > > > > > > > > > Gosh
> >> > > > > > > > > >
> >> > > > > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <
> >> > > > > > > > > natebauernfeind@deephaven.io>
> >> > > > > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > > > makes it more difficult to bring schema
> evolution back
> >> > > > > into the
> >> > > > > > > > > > > > > > IPC Stream format (i.e. it would live only in
> flight)
> >> > > > > > > > > > > > >
> >> > > > > > > > > > > > > Gosh's proposal extends the flatbuffer
> structures not the
> >> > > > > > > > > protobufs.
> >> > > > > > > > > > > Can
> >> > > > > > > > > > > > > you help me understand how difficult it would
> be to bring
> >> > > > > the
> >> > > > > > > > > > > `schema_id`
> >> > > > > > > > > > > > > approach to the IPC stream format?
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > I thought we were talking solely about the Flight
> Protobuf
> >> > > > > > > > > definitions -
> >> > > > > > > > > > > > not the Flatbuffers (and the Google doc at least
> only talks
> >> > > > > about the
> >> > > > > > > > > > > > Protobufs).
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > I somehow missed that schema_id is being added to
> protobuf in
> >> > > > > the
> >> > > > > > > > > document.
> >> > > > > > > > > > > It feels to me that the schema_id is a property
> that would
> >> > > > > ideally only
> >> > > > > > > > > > > apply to the RecordBatch. I better understand
> Micah's
> >> > > > > dictionary
> >> > > > > > > > > concerns,
> >> > > > > > > > > > > now, too.
> >> > > > > > > > > > >
> >> > > > > > > > > > > > Side Question: Why isn't the IPC stream format a
> series of
> >> > > > > the flight
> >> > > > > > > > > > > > > protobufs? It's a real shame that there is no
> standard
> >> > > > way
> >> > > > > to
> >> > > > > > > > > > > > > capture/replay a stream with app_metadata.
> (Obviously
> >> > > > > ignoring the
> >> > > > > > > > > > > > > annoyances around protobuf wrapping
> flatbuffers.)
> >> > > > > > > > > > > >
> >> > > > > > > > > > > > The IPC format was defined long before Flight,
> and Flight's
> >> > > > > > > > > app_metadata
> >> > > > > > > > > > > > was added after Flight's initial definition. Note
> an IPC
> >> > > > > message does
> >> > > > > > > > > > > have
> >> > > > > > > > > > > > a provision for key-value metadata, though I
> think APIs for
> >> > > > > that are
> >> > > > > > > > > not
> >> > > > > > > > > > > > fully exposed. (See ARROW-6940:
> >> > > > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-6940
> and
> >> > > > > despite my
> >> > > > > > > > > comments
> >> > > > > > > > > > > > there perhaps we need to unify or at least
> consider how
> >> > > > > Flight's
> >> > > > > > > > > > > > app_metadata relates to the IPC message
> custom_metadata.
> >> > > > Also
> >> > > > > > > > > perhaps see
> >> > > > > > > > > > > > ARROW-1059.)
> >> > > > > > > > > > > >
> >> > > > > > > > > > >
> >> > > > > > > > > > > KeyValue unfortunately is string to string. In
> flatbuffer
> >> > > > > strings are
> >> > > > > > > > > only
> >> > > > > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the other
> hand is
> >> > > > > opaque
> >> > > > > > > > > bytes.
> >> > > > > > > > > > > The latter is a bit more useful.
> >> > > > > > > > > > >
> >> > > > > > > > > > > --
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > >
> >> > > > >
> >> > > >
> >> > > >
> >> > > > --
> >> > > >
>

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Wes McKinney <we...@gmail.com>.
pyarrow at least treats the KeyValue values as binary and not UTF-8.

On Sun, Jul 11, 2021 at 9:40 PM Micah Kornfield <em...@gmail.com> wrote:
>
> I think other languages (e.g. java, python) might make more of distinction between utf-8 compatible strings and raw bytes.  For python it might be less of a concern if the c++ wrapper already makes the value field look like a bytes field
>
> On Sunday, July 11, 2021, Wes McKinney <we...@gmail.com> wrote:
>>
>> We could certainly "upgrade" KeyValue to have a binary value field
>> everywhere KeyValue is used, but there is some risk of code in the
>> wild expecting there to be a null terminator after the string data.
>> The Flatbuffers-generated accessor APIs do not depend on the existence
>> of the null terminator, though. Not ideal, but I would not be thrilled
>> about adding an extra [ BinaryKeyValue ] everyplace we currently have
>> [ KeyValue ].
>>
>> That said, I doubt that we have any endogenous forward compatibility
>> problems related to this in Apache Arrow-maintained libraries, the
>> risk would come from users who are interacting with the Flatbuffers
>> data manually / without using one of our libraries. We could implement
>> the changes and run a set of forward compatibility integration tests
>> to see if anyone of our released libraries have an issue.
>>
>> On Fri, Jul 9, 2021 at 11:33 AM Wes McKinney <we...@gmail.com> wrote:
>> >
>> > The cost of an empty vector in Flatbuffers appears to be 4 bytes.
>> >
>> > On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield <em...@gmail.com> wrote:
>> > >
>> > > Retitling and forking the discussion to talk about key value pairs.
>> > >
>> > > What is the byte cost of an empty list?  Another option would be to
>> > > introduce a new BinaryKeyValue table and add binary metadata.
>> > >
>> > > On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind <
>> > > natebauernfeind@deephaven.io> wrote:
>> > >
>> > > > Deephaven and I are very supportive of "upgrading" the value half of the kv
>> > > > pair to a byte vector. What is the best way to find out if there is
>> > > > sufficient interest?
>> > > >
>> > > >
>> > > > I've been stewing on the ideas here around schema evolution, and I realize
>> > > > the specific feature I am missing is the ability to encode that a field
>> > > > (i.e. its FieldNode and accompanying Buffers in the RecordBatch) is
>> > > > empty/has-no-data in O(0) cost (yes; for free).
>> > > >
>> > > > Might there be interest in adding a "field_id" to the FieldNode (which is
>> > > > encoded on the RecordBatch flatbuffer)? I see a simple forward-compatible
>> > > > upgrade (by either keying off of 0, or explicitly set the field default to
>> > > > -1) which would allow the sender to "skip" fields that have 1) FieldNode
>> > > > length of zero, and 2) all Buffer's associated at that level (and further
>> > > > nested) are also equally empty (i.e. Buffer length is zero).
>> > > >
>> > > > I understand this concept slightly interferes with RecordBatch's `length`
>> > > > field, and that many implementations use that length to resize the
>> > > > root-level FieldNodes. The use-case I have in mind has different logical
>> > > > lengths per field node; current implementations require sending a
>> > > > RecordBatch length of the max length across all root level field nodes. I
>> > > > believe this requires a copy of data whenever a field node is too short; I
>> > > > don't know if there is a decent solution to this slight inefficiency. I am
>> > > > bringing it up because if "skipping a field node when it is empty" is a
>> > > > feature, then we may not want to allocate space for those nodes given that
>> > > > the record batch length will likely be greater than zero.
>> > > >
>> > > > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <we...@gmail.com> wrote:
>> > > >
>> > > > > On Wed, Jul 7, 2021 at 2:53 PM David Li <ap...@lidavidm.me> wrote:
>> > > > > >
>> > > > > > From the Flatbuffers internals doc[1] it appears they are the same:
>> > > > > "Strings are simply a vector of bytes, and are always null-terminated."
>> > > > >
>> > > > > I see. I took a look at flatbuffers.h, and it appears that changing
>> > > > > this field from string to [byte] would be backward-compatible and
>> > > > > forward-compatible except with code that expects a null terminator.
>> > > > > This is something we could discuss separately if there were enough
>> > > > > interest.
>> > > > >
>> > > > > > [1]: https://google.github.io/flatbuffers/flatbuffers_internals.html
>> > > > > >
>> > > > > > -David
>> > > > > >
>> > > > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
>> > > > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <
>> > > > emkornfield@gmail.com>
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > >
>> > > > > > > > > Right, I had wanted to focus the discussion on Flight as I think
>> > > > > schema
>> > > > > > > > > evolution or multiplexing streams (more so the latter) is a
>> > > > > property of the
>> > > > > > > > > transport and not the stream format itself. If we are leaning
>> > > > > towards just
>> > > > > > > > > schema evolution then maybe it makes sense to discuss it for the
>> > > > > IPC stream
>> > > > > > > > > format and leverage that in Flight. I'd be interested in what
>> > > > > others think.
>> > > > > > > >
>> > > > > > > > I tend to agree, I think stream multiplexing is likely a transport
>> > > > > level
>> > > > > > > > issue.  IMO I think schema evolution should be consistent with the
>> > > > > IPC
>> > > > > > > > stream format  and flight.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > > Nate: it may be worth starting a separate discussion about more
>> > > > > general
>> > > > > > > > > metadata in the IPC message. I'm not aware of why key-value
>> > > > > metadata was
>> > > > > > > > > chosen/if opaque bytes were considered in the past.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > I think  this was an unfortunate design of the key value metadata
>> > > > in
>> > > > > > > > Schema.fbs, but I don't think I was around when this decision was
>> > > > > made.
>> > > > > > >
>> > > > > > > I agree that it's unfortunate that we did not use [ byte ] instead of
>> > > > > > > string for the value in the KeyValue metadata — I think this was more
>> > > > > > > of an oversight than a deliberate choice (e.g. it was not our intent
>> > > > > > > to require binary data to be base64-encoded — this is something that
>> > > > > > > we have to do when encoding binary data in Thrift KeyValue metadata
>> > > > > > > for Parquet, for example). Is the binary representation of [byte]
>> > > > > > > different from string?
>> > > > > > >
>> > > > > > >
>> > > > > > >
>> > > > > > > > Side Question: Why isn't the IPC stream format a series of the
>> > > > flight
>> > > > > > > > > protobufs?
>> > > > > > > >
>> > > > > > > > In addition to what David said, protobufs can't be read directly
>> > > > > from a
>> > > > > > > > memory-mapped file (they need decoding).  This was one of the
>> > > > design
>> > > > > > > > considerations of using flatbuffers and IPC Stream/File format.
>> > > > > > > >
>> > > > > > > > I was thinking Micah's comment is more that whatever we do, it
>> > > > > should be
>> > > > > > > > > clearly specified and edge cases should be considered, especially
>> > > > > if we
>> > > > > > > > > might want to 'backport' this into the stream format later.
>> > > > > > > >
>> > > > > > > >
>> > > > > > > > Yes, for dictionaries we just need to be careful to define
>> > > > semantics
>> > > > > and
>> > > > > > > > ensure implementations are validating them with regards to
>> > > > > dictionaries.
>> > > > > > > > There likely isn't any need to change current implementations
>> > > > though.
>> > > > > > > >
>> > > > > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li <li...@apache.org>
>> > > > > wrote:
>> > > > > > > >
>> > > > > > > > > Right, I had wanted to focus the discussion on Flight as I think
>> > > > > schema
>> > > > > > > > > evolution or multiplexing streams (more so the latter) is a
>> > > > > property of the
>> > > > > > > > > transport and not the stream format itself. If we are leaning
>> > > > > towards just
>> > > > > > > > > schema evolution then maybe it makes sense to discuss it for the
>> > > > > IPC stream
>> > > > > > > > > format and leverage that in Flight. I'd be interested in what
>> > > > > others think.
>> > > > > > > > >
>> > > > > > > > > Especially if we are looking at multiplexing streams - I would
>> > > > > wonder if
>> > > > > > > > > that's actually better served by making it easier to implement
>> > > > > using the
>> > > > > > > > > Flight implementation as it stands (by managing concurrent RPC
>> > > > > calls and/or
>> > > > > > > > > performing the union-of-structs encoding trick for you), instead
>> > > > > of having
>> > > > > > > > > to change the protocol.
>> > > > > > > > >
>> > > > > > > > > Nate: it may be worth starting a separate discussion about more
>> > > > > general
>> > > > > > > > > metadata in the IPC message. I'm not aware of why key-value
>> > > > > metadata was
>> > > > > > > > > chosen/if opaque bytes were considered in the past. Off the top
>> > > > of
>> > > > > my head
>> > > > > > > > > if it's for on-disk storage and fully application-defined it may
>> > > > > make sense
>> > > > > > > > > to store as a separate file alongside the Arrow file (indexed by
>> > > > > record
>> > > > > > > > > batch index) where you can take advantage of whatever format is
>> > > > > most
>> > > > > > > > > suitable.
>> > > > > > > > >
>> > > > > > > > > -David
>> > > > > > > > >
>> > > > > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan wrote:
>> > > > > > > > > > Hi guys,
>> > > > > > > > > >
>> > > > > > > > > > 1. Regarding IPC vs Flight: in fact my initial suggestion was
>> > > > to
>> > > > > add this
>> > > > > > > > > > feature starting from the IPC(I moved initial write up steps to
>> > > > > the
>> > > > > > > > > bottom
>> > > > > > > > > > of the doc). Afterwards David suggested focusing on Flight and
>> > > > > that's how
>> > > > > > > > > > we ended up with the protobufs change in the proposal. This
>> > > > > being said I
>> > > > > > > > > do
>> > > > > > > > > > think that the place where this should be impemented is a good
>> > > > > question
>> > > > > > > > > on
>> > > > > > > > > > its own. Maybe it makes sense to have this kind of a feature in
>> > > > > IPC and
>> > > > > > > > > > somehow use it in Flight, maybe not.
>> > > > > > > > > > 2. The point about dictionaries deserves a dedicated section in
>> > > > > the
>> > > > > > > > > > proposal. Nate and David brought it up and shared some
>> > > > insights.
>> > > > > I'll try
>> > > > > > > > > > to aggregate them and we can continue the discussion form
>> > > > there.
>> > > > > > > > > >
>> > > > > > > > > > Cheers,
>> > > > > > > > > > Gosh
>> > > > > > > > > >
>> > > > > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <
>> > > > > > > > > natebauernfeind@deephaven.io>
>> > > > > > > > > > wrote:
>> > > > > > > > > >
>> > > > > > > > > > > >
>> > > > > > > > > > > > > > makes it more difficult to bring schema evolution back
>> > > > > into the
>> > > > > > > > > > > > > > IPC Stream format (i.e. it would live only in flight)
>> > > > > > > > > > > > >
>> > > > > > > > > > > > > Gosh's proposal extends the flatbuffer structures not the
>> > > > > > > > > protobufs.
>> > > > > > > > > > > Can
>> > > > > > > > > > > > > you help me understand how difficult it would be to bring
>> > > > > the
>> > > > > > > > > > > `schema_id`
>> > > > > > > > > > > > > approach to the IPC stream format?
>> > > > > > > > > > > >
>> > > > > > > > > > > > I thought we were talking solely about the Flight Protobuf
>> > > > > > > > > definitions -
>> > > > > > > > > > > > not the Flatbuffers (and the Google doc at least only talks
>> > > > > about the
>> > > > > > > > > > > > Protobufs).
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > I somehow missed that schema_id is being added to protobuf in
>> > > > > the
>> > > > > > > > > document.
>> > > > > > > > > > > It feels to me that the schema_id is a property that would
>> > > > > ideally only
>> > > > > > > > > > > apply to the RecordBatch. I better understand Micah's
>> > > > > dictionary
>> > > > > > > > > concerns,
>> > > > > > > > > > > now, too.
>> > > > > > > > > > >
>> > > > > > > > > > > > Side Question: Why isn't the IPC stream format a series of
>> > > > > the flight
>> > > > > > > > > > > > > protobufs? It's a real shame that there is no standard
>> > > > way
>> > > > > to
>> > > > > > > > > > > > > capture/replay a stream with app_metadata. (Obviously
>> > > > > ignoring the
>> > > > > > > > > > > > > annoyances around protobuf wrapping flatbuffers.)
>> > > > > > > > > > > >
>> > > > > > > > > > > > The IPC format was defined long before Flight, and Flight's
>> > > > > > > > > app_metadata
>> > > > > > > > > > > > was added after Flight's initial definition. Note an IPC
>> > > > > message does
>> > > > > > > > > > > have
>> > > > > > > > > > > > a provision for key-value metadata, though I think APIs for
>> > > > > that are
>> > > > > > > > > not
>> > > > > > > > > > > > fully exposed. (See ARROW-6940:
>> > > > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-6940 and
>> > > > > despite my
>> > > > > > > > > comments
>> > > > > > > > > > > > there perhaps we need to unify or at least consider how
>> > > > > Flight's
>> > > > > > > > > > > > app_metadata relates to the IPC message custom_metadata.
>> > > > Also
>> > > > > > > > > perhaps see
>> > > > > > > > > > > > ARROW-1059.)
>> > > > > > > > > > > >
>> > > > > > > > > > >
>> > > > > > > > > > > KeyValue unfortunately is string to string. In flatbuffer
>> > > > > strings are
>> > > > > > > > > only
>> > > > > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the other hand is
>> > > > > opaque
>> > > > > > > > > bytes.
>> > > > > > > > > > > The latter is a bit more useful.
>> > > > > > > > > > >
>> > > > > > > > > > > --
>> > > > > > > > > > >
>> > > > > > > > > >
>> > > > > > > > >
>> > > > > > >
>> > > > >
>> > > >
>> > > >
>> > > > --
>> > > >

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Micah Kornfield <em...@gmail.com>.
I think other languages (e.g. java, python) might make more of distinction
between utf-8 compatible strings and raw bytes.  For python it might be
less of a concern if the c++ wrapper already makes the value field look
like a bytes field

On Sunday, July 11, 2021, Wes McKinney <we...@gmail.com> wrote:

> We could certainly "upgrade" KeyValue to have a binary value field
> everywhere KeyValue is used, but there is some risk of code in the
> wild expecting there to be a null terminator after the string data.
> The Flatbuffers-generated accessor APIs do not depend on the existence
> of the null terminator, though. Not ideal, but I would not be thrilled
> about adding an extra [ BinaryKeyValue ] everyplace we currently have
> [ KeyValue ].
>
> That said, I doubt that we have any endogenous forward compatibility
> problems related to this in Apache Arrow-maintained libraries, the
> risk would come from users who are interacting with the Flatbuffers
> data manually / without using one of our libraries. We could implement
> the changes and run a set of forward compatibility integration tests
> to see if anyone of our released libraries have an issue.
>
> On Fri, Jul 9, 2021 at 11:33 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > The cost of an empty vector in Flatbuffers appears to be 4 bytes.
> >
> > On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield <em...@gmail.com>
> wrote:
> > >
> > > Retitling and forking the discussion to talk about key value pairs.
> > >
> > > What is the byte cost of an empty list?  Another option would be to
> > > introduce a new BinaryKeyValue table and add binary metadata.
> > >
> > > On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind <
> > > natebauernfeind@deephaven.io> wrote:
> > >
> > > > Deephaven and I are very supportive of "upgrading" the value half of
> the kv
> > > > pair to a byte vector. What is the best way to find out if there is
> > > > sufficient interest?
> > > >
> > > >
> > > > I've been stewing on the ideas here around schema evolution, and I
> realize
> > > > the specific feature I am missing is the ability to encode that a
> field
> > > > (i.e. its FieldNode and accompanying Buffers in the RecordBatch) is
> > > > empty/has-no-data in O(0) cost (yes; for free).
> > > >
> > > > Might there be interest in adding a "field_id" to the FieldNode
> (which is
> > > > encoded on the RecordBatch flatbuffer)? I see a simple
> forward-compatible
> > > > upgrade (by either keying off of 0, or explicitly set the field
> default to
> > > > -1) which would allow the sender to "skip" fields that have 1)
> FieldNode
> > > > length of zero, and 2) all Buffer's associated at that level (and
> further
> > > > nested) are also equally empty (i.e. Buffer length is zero).
> > > >
> > > > I understand this concept slightly interferes with RecordBatch's
> `length`
> > > > field, and that many implementations use that length to resize the
> > > > root-level FieldNodes. The use-case I have in mind has different
> logical
> > > > lengths per field node; current implementations require sending a
> > > > RecordBatch length of the max length across all root level field
> nodes. I
> > > > believe this requires a copy of data whenever a field node is too
> short; I
> > > > don't know if there is a decent solution to this slight
> inefficiency. I am
> > > > bringing it up because if "skipping a field node when it is empty"
> is a
> > > > feature, then we may not want to allocate space for those nodes
> given that
> > > > the record batch length will likely be greater than zero.
> > > >
> > > > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <we...@gmail.com>
> wrote:
> > > >
> > > > > On Wed, Jul 7, 2021 at 2:53 PM David Li <ap...@lidavidm.me>
> wrote:
> > > > > >
> > > > > > From the Flatbuffers internals doc[1] it appears they are the
> same:
> > > > > "Strings are simply a vector of bytes, and are always
> null-terminated."
> > > > >
> > > > > I see. I took a look at flatbuffers.h, and it appears that changing
> > > > > this field from string to [byte] would be backward-compatible and
> > > > > forward-compatible except with code that expects a null terminator.
> > > > > This is something we could discuss separately if there were enough
> > > > > interest.
> > > > >
> > > > > > [1]: https://google.github.io/flatbuffers/flatbuffers_
> internals.html
> > > > > >
> > > > > > -David
> > > > > >
> > > > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
> > > > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <
> > > > emkornfield@gmail.com>
> > > > > wrote:
> > > > > > > >
> > > > > > > > >
> > > > > > > > > Right, I had wanted to focus the discussion on Flight as I
> think
> > > > > schema
> > > > > > > > > evolution or multiplexing streams (more so the latter) is a
> > > > > property of the
> > > > > > > > > transport and not the stream format itself. If we are
> leaning
> > > > > towards just
> > > > > > > > > schema evolution then maybe it makes sense to discuss it
> for the
> > > > > IPC stream
> > > > > > > > > format and leverage that in Flight. I'd be interested in
> what
> > > > > others think.
> > > > > > > >
> > > > > > > > I tend to agree, I think stream multiplexing is likely a
> transport
> > > > > level
> > > > > > > > issue.  IMO I think schema evolution should be consistent
> with the
> > > > > IPC
> > > > > > > > stream format  and flight.
> > > > > > > >
> > > > > > > >
> > > > > > > > > Nate: it may be worth starting a separate discussion about
> more
> > > > > general
> > > > > > > > > metadata in the IPC message. I'm not aware of why key-value
> > > > > metadata was
> > > > > > > > > chosen/if opaque bytes were considered in the past.
> > > > > > > >
> > > > > > > >
> > > > > > > > I think  this was an unfortunate design of the key value
> metadata
> > > > in
> > > > > > > > Schema.fbs, but I don't think I was around when this
> decision was
> > > > > made.
> > > > > > >
> > > > > > > I agree that it's unfortunate that we did not use [ byte ]
> instead of
> > > > > > > string for the value in the KeyValue metadata — I think this
> was more
> > > > > > > of an oversight than a deliberate choice (e.g. it was not our
> intent
> > > > > > > to require binary data to be base64-encoded — this is
> something that
> > > > > > > we have to do when encoding binary data in Thrift KeyValue
> metadata
> > > > > > > for Parquet, for example). Is the binary representation of
> [byte]
> > > > > > > different from string?
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > Side Question: Why isn't the IPC stream format a series of
> the
> > > > flight
> > > > > > > > > protobufs?
> > > > > > > >
> > > > > > > > In addition to what David said, protobufs can't be read
> directly
> > > > > from a
> > > > > > > > memory-mapped file (they need decoding).  This was one of the
> > > > design
> > > > > > > > considerations of using flatbuffers and IPC Stream/File
> format.
> > > > > > > >
> > > > > > > > I was thinking Micah's comment is more that whatever we do,
> it
> > > > > should be
> > > > > > > > > clearly specified and edge cases should be considered,
> especially
> > > > > if we
> > > > > > > > > might want to 'backport' this into the stream format later.
> > > > > > > >
> > > > > > > >
> > > > > > > > Yes, for dictionaries we just need to be careful to define
> > > > semantics
> > > > > and
> > > > > > > > ensure implementations are validating them with regards to
> > > > > dictionaries.
> > > > > > > > There likely isn't any need to change current implementations
> > > > though.
> > > > > > > >
> > > > > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li <
> lidavidm@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Right, I had wanted to focus the discussion on Flight as I
> think
> > > > > schema
> > > > > > > > > evolution or multiplexing streams (more so the latter) is a
> > > > > property of the
> > > > > > > > > transport and not the stream format itself. If we are
> leaning
> > > > > towards just
> > > > > > > > > schema evolution then maybe it makes sense to discuss it
> for the
> > > > > IPC stream
> > > > > > > > > format and leverage that in Flight. I'd be interested in
> what
> > > > > others think.
> > > > > > > > >
> > > > > > > > > Especially if we are looking at multiplexing streams - I
> would
> > > > > wonder if
> > > > > > > > > that's actually better served by making it easier to
> implement
> > > > > using the
> > > > > > > > > Flight implementation as it stands (by managing concurrent
> RPC
> > > > > calls and/or
> > > > > > > > > performing the union-of-structs encoding trick for you),
> instead
> > > > > of having
> > > > > > > > > to change the protocol.
> > > > > > > > >
> > > > > > > > > Nate: it may be worth starting a separate discussion about
> more
> > > > > general
> > > > > > > > > metadata in the IPC message. I'm not aware of why key-value
> > > > > metadata was
> > > > > > > > > chosen/if opaque bytes were considered in the past. Off
> the top
> > > > of
> > > > > my head
> > > > > > > > > if it's for on-disk storage and fully application-defined
> it may
> > > > > make sense
> > > > > > > > > to store as a separate file alongside the Arrow file
> (indexed by
> > > > > record
> > > > > > > > > batch index) where you can take advantage of whatever
> format is
> > > > > most
> > > > > > > > > suitable.
> > > > > > > > >
> > > > > > > > > -David
> > > > > > > > >
> > > > > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan wrote:
> > > > > > > > > > Hi guys,
> > > > > > > > > >
> > > > > > > > > > 1. Regarding IPC vs Flight: in fact my initial
> suggestion was
> > > > to
> > > > > add this
> > > > > > > > > > feature starting from the IPC(I moved initial write up
> steps to
> > > > > the
> > > > > > > > > bottom
> > > > > > > > > > of the doc). Afterwards David suggested focusing on
> Flight and
> > > > > that's how
> > > > > > > > > > we ended up with the protobufs change in the proposal.
> This
> > > > > being said I
> > > > > > > > > do
> > > > > > > > > > think that the place where this should be impemented is
> a good
> > > > > question
> > > > > > > > > on
> > > > > > > > > > its own. Maybe it makes sense to have this kind of a
> feature in
> > > > > IPC and
> > > > > > > > > > somehow use it in Flight, maybe not.
> > > > > > > > > > 2. The point about dictionaries deserves a dedicated
> section in
> > > > > the
> > > > > > > > > > proposal. Nate and David brought it up and shared some
> > > > insights.
> > > > > I'll try
> > > > > > > > > > to aggregate them and we can continue the discussion form
> > > > there.
> > > > > > > > > >
> > > > > > > > > > Cheers,
> > > > > > > > > > Gosh
> > > > > > > > > >
> > > > > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <
> > > > > > > > > natebauernfeind@deephaven.io>
> > > > > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > >
> > > > > > > > > > > > > > makes it more difficult to bring schema
> evolution back
> > > > > into the
> > > > > > > > > > > > > > IPC Stream format (i.e. it would live only in
> flight)
> > > > > > > > > > > > >
> > > > > > > > > > > > > Gosh's proposal extends the flatbuffer structures
> not the
> > > > > > > > > protobufs.
> > > > > > > > > > > Can
> > > > > > > > > > > > > you help me understand how difficult it would be
> to bring
> > > > > the
> > > > > > > > > > > `schema_id`
> > > > > > > > > > > > > approach to the IPC stream format?
> > > > > > > > > > > >
> > > > > > > > > > > > I thought we were talking solely about the Flight
> Protobuf
> > > > > > > > > definitions -
> > > > > > > > > > > > not the Flatbuffers (and the Google doc at least
> only talks
> > > > > about the
> > > > > > > > > > > > Protobufs).
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > I somehow missed that schema_id is being added to
> protobuf in
> > > > > the
> > > > > > > > > document.
> > > > > > > > > > > It feels to me that the schema_id is a property that
> would
> > > > > ideally only
> > > > > > > > > > > apply to the RecordBatch. I better understand Micah's
> > > > > dictionary
> > > > > > > > > concerns,
> > > > > > > > > > > now, too.
> > > > > > > > > > >
> > > > > > > > > > > > Side Question: Why isn't the IPC stream format a
> series of
> > > > > the flight
> > > > > > > > > > > > > protobufs? It's a real shame that there is no
> standard
> > > > way
> > > > > to
> > > > > > > > > > > > > capture/replay a stream with app_metadata.
> (Obviously
> > > > > ignoring the
> > > > > > > > > > > > > annoyances around protobuf wrapping flatbuffers.)
> > > > > > > > > > > >
> > > > > > > > > > > > The IPC format was defined long before Flight, and
> Flight's
> > > > > > > > > app_metadata
> > > > > > > > > > > > was added after Flight's initial definition. Note an
> IPC
> > > > > message does
> > > > > > > > > > > have
> > > > > > > > > > > > a provision for key-value metadata, though I think
> APIs for
> > > > > that are
> > > > > > > > > not
> > > > > > > > > > > > fully exposed. (See ARROW-6940:
> > > > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-6940 and
> > > > > despite my
> > > > > > > > > comments
> > > > > > > > > > > > there perhaps we need to unify or at least consider
> how
> > > > > Flight's
> > > > > > > > > > > > app_metadata relates to the IPC message
> custom_metadata.
> > > > Also
> > > > > > > > > perhaps see
> > > > > > > > > > > > ARROW-1059.)
> > > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > KeyValue unfortunately is string to string. In
> flatbuffer
> > > > > strings are
> > > > > > > > > only
> > > > > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the other
> hand is
> > > > > opaque
> > > > > > > > > bytes.
> > > > > > > > > > > The latter is a bit more useful.
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > >
>

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Wes McKinney <we...@gmail.com>.
We could certainly "upgrade" KeyValue to have a binary value field
everywhere KeyValue is used, but there is some risk of code in the
wild expecting there to be a null terminator after the string data.
The Flatbuffers-generated accessor APIs do not depend on the existence
of the null terminator, though. Not ideal, but I would not be thrilled
about adding an extra [ BinaryKeyValue ] everyplace we currently have
[ KeyValue ].

That said, I doubt that we have any endogenous forward compatibility
problems related to this in Apache Arrow-maintained libraries, the
risk would come from users who are interacting with the Flatbuffers
data manually / without using one of our libraries. We could implement
the changes and run a set of forward compatibility integration tests
to see if anyone of our released libraries have an issue.

On Fri, Jul 9, 2021 at 11:33 AM Wes McKinney <we...@gmail.com> wrote:
>
> The cost of an empty vector in Flatbuffers appears to be 4 bytes.
>
> On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield <em...@gmail.com> wrote:
> >
> > Retitling and forking the discussion to talk about key value pairs.
> >
> > What is the byte cost of an empty list?  Another option would be to
> > introduce a new BinaryKeyValue table and add binary metadata.
> >
> > On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind <
> > natebauernfeind@deephaven.io> wrote:
> >
> > > Deephaven and I are very supportive of "upgrading" the value half of the kv
> > > pair to a byte vector. What is the best way to find out if there is
> > > sufficient interest?
> > >
> > >
> > > I've been stewing on the ideas here around schema evolution, and I realize
> > > the specific feature I am missing is the ability to encode that a field
> > > (i.e. its FieldNode and accompanying Buffers in the RecordBatch) is
> > > empty/has-no-data in O(0) cost (yes; for free).
> > >
> > > Might there be interest in adding a "field_id" to the FieldNode (which is
> > > encoded on the RecordBatch flatbuffer)? I see a simple forward-compatible
> > > upgrade (by either keying off of 0, or explicitly set the field default to
> > > -1) which would allow the sender to "skip" fields that have 1) FieldNode
> > > length of zero, and 2) all Buffer's associated at that level (and further
> > > nested) are also equally empty (i.e. Buffer length is zero).
> > >
> > > I understand this concept slightly interferes with RecordBatch's `length`
> > > field, and that many implementations use that length to resize the
> > > root-level FieldNodes. The use-case I have in mind has different logical
> > > lengths per field node; current implementations require sending a
> > > RecordBatch length of the max length across all root level field nodes. I
> > > believe this requires a copy of data whenever a field node is too short; I
> > > don't know if there is a decent solution to this slight inefficiency. I am
> > > bringing it up because if "skipping a field node when it is empty" is a
> > > feature, then we may not want to allocate space for those nodes given that
> > > the record batch length will likely be greater than zero.
> > >
> > > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > > On Wed, Jul 7, 2021 at 2:53 PM David Li <ap...@lidavidm.me> wrote:
> > > > >
> > > > > From the Flatbuffers internals doc[1] it appears they are the same:
> > > > "Strings are simply a vector of bytes, and are always null-terminated."
> > > >
> > > > I see. I took a look at flatbuffers.h, and it appears that changing
> > > > this field from string to [byte] would be backward-compatible and
> > > > forward-compatible except with code that expects a null terminator.
> > > > This is something we could discuss separately if there were enough
> > > > interest.
> > > >
> > > > > [1]: https://google.github.io/flatbuffers/flatbuffers_internals.html
> > > > >
> > > > > -David
> > > > >
> > > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
> > > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <
> > > emkornfield@gmail.com>
> > > > wrote:
> > > > > > >
> > > > > > > >
> > > > > > > > Right, I had wanted to focus the discussion on Flight as I think
> > > > schema
> > > > > > > > evolution or multiplexing streams (more so the latter) is a
> > > > property of the
> > > > > > > > transport and not the stream format itself. If we are leaning
> > > > towards just
> > > > > > > > schema evolution then maybe it makes sense to discuss it for the
> > > > IPC stream
> > > > > > > > format and leverage that in Flight. I'd be interested in what
> > > > others think.
> > > > > > >
> > > > > > > I tend to agree, I think stream multiplexing is likely a transport
> > > > level
> > > > > > > issue.  IMO I think schema evolution should be consistent with the
> > > > IPC
> > > > > > > stream format  and flight.
> > > > > > >
> > > > > > >
> > > > > > > > Nate: it may be worth starting a separate discussion about more
> > > > general
> > > > > > > > metadata in the IPC message. I'm not aware of why key-value
> > > > metadata was
> > > > > > > > chosen/if opaque bytes were considered in the past.
> > > > > > >
> > > > > > >
> > > > > > > I think  this was an unfortunate design of the key value metadata
> > > in
> > > > > > > Schema.fbs, but I don't think I was around when this decision was
> > > > made.
> > > > > >
> > > > > > I agree that it's unfortunate that we did not use [ byte ] instead of
> > > > > > string for the value in the KeyValue metadata — I think this was more
> > > > > > of an oversight than a deliberate choice (e.g. it was not our intent
> > > > > > to require binary data to be base64-encoded — this is something that
> > > > > > we have to do when encoding binary data in Thrift KeyValue metadata
> > > > > > for Parquet, for example). Is the binary representation of [byte]
> > > > > > different from string?
> > > > > >
> > > > > >
> > > > > >
> > > > > > > Side Question: Why isn't the IPC stream format a series of the
> > > flight
> > > > > > > > protobufs?
> > > > > > >
> > > > > > > In addition to what David said, protobufs can't be read directly
> > > > from a
> > > > > > > memory-mapped file (they need decoding).  This was one of the
> > > design
> > > > > > > considerations of using flatbuffers and IPC Stream/File format.
> > > > > > >
> > > > > > > I was thinking Micah's comment is more that whatever we do, it
> > > > should be
> > > > > > > > clearly specified and edge cases should be considered, especially
> > > > if we
> > > > > > > > might want to 'backport' this into the stream format later.
> > > > > > >
> > > > > > >
> > > > > > > Yes, for dictionaries we just need to be careful to define
> > > semantics
> > > > and
> > > > > > > ensure implementations are validating them with regards to
> > > > dictionaries.
> > > > > > > There likely isn't any need to change current implementations
> > > though.
> > > > > > >
> > > > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li <li...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > > > Right, I had wanted to focus the discussion on Flight as I think
> > > > schema
> > > > > > > > evolution or multiplexing streams (more so the latter) is a
> > > > property of the
> > > > > > > > transport and not the stream format itself. If we are leaning
> > > > towards just
> > > > > > > > schema evolution then maybe it makes sense to discuss it for the
> > > > IPC stream
> > > > > > > > format and leverage that in Flight. I'd be interested in what
> > > > others think.
> > > > > > > >
> > > > > > > > Especially if we are looking at multiplexing streams - I would
> > > > wonder if
> > > > > > > > that's actually better served by making it easier to implement
> > > > using the
> > > > > > > > Flight implementation as it stands (by managing concurrent RPC
> > > > calls and/or
> > > > > > > > performing the union-of-structs encoding trick for you), instead
> > > > of having
> > > > > > > > to change the protocol.
> > > > > > > >
> > > > > > > > Nate: it may be worth starting a separate discussion about more
> > > > general
> > > > > > > > metadata in the IPC message. I'm not aware of why key-value
> > > > metadata was
> > > > > > > > chosen/if opaque bytes were considered in the past. Off the top
> > > of
> > > > my head
> > > > > > > > if it's for on-disk storage and fully application-defined it may
> > > > make sense
> > > > > > > > to store as a separate file alongside the Arrow file (indexed by
> > > > record
> > > > > > > > batch index) where you can take advantage of whatever format is
> > > > most
> > > > > > > > suitable.
> > > > > > > >
> > > > > > > > -David
> > > > > > > >
> > > > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan wrote:
> > > > > > > > > Hi guys,
> > > > > > > > >
> > > > > > > > > 1. Regarding IPC vs Flight: in fact my initial suggestion was
> > > to
> > > > add this
> > > > > > > > > feature starting from the IPC(I moved initial write up steps to
> > > > the
> > > > > > > > bottom
> > > > > > > > > of the doc). Afterwards David suggested focusing on Flight and
> > > > that's how
> > > > > > > > > we ended up with the protobufs change in the proposal. This
> > > > being said I
> > > > > > > > do
> > > > > > > > > think that the place where this should be impemented is a good
> > > > question
> > > > > > > > on
> > > > > > > > > its own. Maybe it makes sense to have this kind of a feature in
> > > > IPC and
> > > > > > > > > somehow use it in Flight, maybe not.
> > > > > > > > > 2. The point about dictionaries deserves a dedicated section in
> > > > the
> > > > > > > > > proposal. Nate and David brought it up and shared some
> > > insights.
> > > > I'll try
> > > > > > > > > to aggregate them and we can continue the discussion form
> > > there.
> > > > > > > > >
> > > > > > > > > Cheers,
> > > > > > > > > Gosh
> > > > > > > > >
> > > > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <
> > > > > > > > natebauernfeind@deephaven.io>
> > > > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > > > makes it more difficult to bring schema evolution back
> > > > into the
> > > > > > > > > > > > > IPC Stream format (i.e. it would live only in flight)
> > > > > > > > > > > >
> > > > > > > > > > > > Gosh's proposal extends the flatbuffer structures not the
> > > > > > > > protobufs.
> > > > > > > > > > Can
> > > > > > > > > > > > you help me understand how difficult it would be to bring
> > > > the
> > > > > > > > > > `schema_id`
> > > > > > > > > > > > approach to the IPC stream format?
> > > > > > > > > > >
> > > > > > > > > > > I thought we were talking solely about the Flight Protobuf
> > > > > > > > definitions -
> > > > > > > > > > > not the Flatbuffers (and the Google doc at least only talks
> > > > about the
> > > > > > > > > > > Protobufs).
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > I somehow missed that schema_id is being added to protobuf in
> > > > the
> > > > > > > > document.
> > > > > > > > > > It feels to me that the schema_id is a property that would
> > > > ideally only
> > > > > > > > > > apply to the RecordBatch. I better understand Micah's
> > > > dictionary
> > > > > > > > concerns,
> > > > > > > > > > now, too.
> > > > > > > > > >
> > > > > > > > > > > Side Question: Why isn't the IPC stream format a series of
> > > > the flight
> > > > > > > > > > > > protobufs? It's a real shame that there is no standard
> > > way
> > > > to
> > > > > > > > > > > > capture/replay a stream with app_metadata. (Obviously
> > > > ignoring the
> > > > > > > > > > > > annoyances around protobuf wrapping flatbuffers.)
> > > > > > > > > > >
> > > > > > > > > > > The IPC format was defined long before Flight, and Flight's
> > > > > > > > app_metadata
> > > > > > > > > > > was added after Flight's initial definition. Note an IPC
> > > > message does
> > > > > > > > > > have
> > > > > > > > > > > a provision for key-value metadata, though I think APIs for
> > > > that are
> > > > > > > > not
> > > > > > > > > > > fully exposed. (See ARROW-6940:
> > > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-6940 and
> > > > despite my
> > > > > > > > comments
> > > > > > > > > > > there perhaps we need to unify or at least consider how
> > > > Flight's
> > > > > > > > > > > app_metadata relates to the IPC message custom_metadata.
> > > Also
> > > > > > > > perhaps see
> > > > > > > > > > > ARROW-1059.)
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > KeyValue unfortunately is string to string. In flatbuffer
> > > > strings are
> > > > > > > > only
> > > > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the other hand is
> > > > opaque
> > > > > > > > bytes.
> > > > > > > > > > The latter is a bit more useful.
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > >
> > > >
> > >
> > >
> > > --
> > >

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Wes McKinney <we...@gmail.com>.
The cost of an empty vector in Flatbuffers appears to be 4 bytes.

On Wed, Jul 7, 2021 at 5:50 PM Micah Kornfield <em...@gmail.com> wrote:
>
> Retitling and forking the discussion to talk about key value pairs.
>
> What is the byte cost of an empty list?  Another option would be to
> introduce a new BinaryKeyValue table and add binary metadata.
>
> On Wed, Jul 7, 2021 at 8:32 AM Nate Bauernfeind <
> natebauernfeind@deephaven.io> wrote:
>
> > Deephaven and I are very supportive of "upgrading" the value half of the kv
> > pair to a byte vector. What is the best way to find out if there is
> > sufficient interest?
> >
> >
> > I've been stewing on the ideas here around schema evolution, and I realize
> > the specific feature I am missing is the ability to encode that a field
> > (i.e. its FieldNode and accompanying Buffers in the RecordBatch) is
> > empty/has-no-data in O(0) cost (yes; for free).
> >
> > Might there be interest in adding a "field_id" to the FieldNode (which is
> > encoded on the RecordBatch flatbuffer)? I see a simple forward-compatible
> > upgrade (by either keying off of 0, or explicitly set the field default to
> > -1) which would allow the sender to "skip" fields that have 1) FieldNode
> > length of zero, and 2) all Buffer's associated at that level (and further
> > nested) are also equally empty (i.e. Buffer length is zero).
> >
> > I understand this concept slightly interferes with RecordBatch's `length`
> > field, and that many implementations use that length to resize the
> > root-level FieldNodes. The use-case I have in mind has different logical
> > lengths per field node; current implementations require sending a
> > RecordBatch length of the max length across all root level field nodes. I
> > believe this requires a copy of data whenever a field node is too short; I
> > don't know if there is a decent solution to this slight inefficiency. I am
> > bringing it up because if "skipping a field node when it is empty" is a
> > feature, then we may not want to allocate space for those nodes given that
> > the record batch length will likely be greater than zero.
> >
> > On Wed, Jul 7, 2021 at 8:12 AM Wes McKinney <we...@gmail.com> wrote:
> >
> > > On Wed, Jul 7, 2021 at 2:53 PM David Li <ap...@lidavidm.me> wrote:
> > > >
> > > > From the Flatbuffers internals doc[1] it appears they are the same:
> > > "Strings are simply a vector of bytes, and are always null-terminated."
> > >
> > > I see. I took a look at flatbuffers.h, and it appears that changing
> > > this field from string to [byte] would be backward-compatible and
> > > forward-compatible except with code that expects a null terminator.
> > > This is something we could discuss separately if there were enough
> > > interest.
> > >
> > > > [1]: https://google.github.io/flatbuffers/flatbuffers_internals.html
> > > >
> > > > -David
> > > >
> > > > On Wed, Jul 7, 2021, at 05:08, Wes McKinney wrote:
> > > > > On Tue, Jul 6, 2021 at 6:33 PM Micah Kornfield <
> > emkornfield@gmail.com>
> > > wrote:
> > > > > >
> > > > > > >
> > > > > > > Right, I had wanted to focus the discussion on Flight as I think
> > > schema
> > > > > > > evolution or multiplexing streams (more so the latter) is a
> > > property of the
> > > > > > > transport and not the stream format itself. If we are leaning
> > > towards just
> > > > > > > schema evolution then maybe it makes sense to discuss it for the
> > > IPC stream
> > > > > > > format and leverage that in Flight. I'd be interested in what
> > > others think.
> > > > > >
> > > > > > I tend to agree, I think stream multiplexing is likely a transport
> > > level
> > > > > > issue.  IMO I think schema evolution should be consistent with the
> > > IPC
> > > > > > stream format  and flight.
> > > > > >
> > > > > >
> > > > > > > Nate: it may be worth starting a separate discussion about more
> > > general
> > > > > > > metadata in the IPC message. I'm not aware of why key-value
> > > metadata was
> > > > > > > chosen/if opaque bytes were considered in the past.
> > > > > >
> > > > > >
> > > > > > I think  this was an unfortunate design of the key value metadata
> > in
> > > > > > Schema.fbs, but I don't think I was around when this decision was
> > > made.
> > > > >
> > > > > I agree that it's unfortunate that we did not use [ byte ] instead of
> > > > > string for the value in the KeyValue metadata — I think this was more
> > > > > of an oversight than a deliberate choice (e.g. it was not our intent
> > > > > to require binary data to be base64-encoded — this is something that
> > > > > we have to do when encoding binary data in Thrift KeyValue metadata
> > > > > for Parquet, for example). Is the binary representation of [byte]
> > > > > different from string?
> > > > >
> > > > >
> > > > >
> > > > > > Side Question: Why isn't the IPC stream format a series of the
> > flight
> > > > > > > protobufs?
> > > > > >
> > > > > > In addition to what David said, protobufs can't be read directly
> > > from a
> > > > > > memory-mapped file (they need decoding).  This was one of the
> > design
> > > > > > considerations of using flatbuffers and IPC Stream/File format.
> > > > > >
> > > > > > I was thinking Micah's comment is more that whatever we do, it
> > > should be
> > > > > > > clearly specified and edge cases should be considered, especially
> > > if we
> > > > > > > might want to 'backport' this into the stream format later.
> > > > > >
> > > > > >
> > > > > > Yes, for dictionaries we just need to be careful to define
> > semantics
> > > and
> > > > > > ensure implementations are validating them with regards to
> > > dictionaries.
> > > > > > There likely isn't any need to change current implementations
> > though.
> > > > > >
> > > > > > On Mon, Jun 28, 2021 at 1:25 PM David Li <li...@apache.org>
> > > wrote:
> > > > > >
> > > > > > > Right, I had wanted to focus the discussion on Flight as I think
> > > schema
> > > > > > > evolution or multiplexing streams (more so the latter) is a
> > > property of the
> > > > > > > transport and not the stream format itself. If we are leaning
> > > towards just
> > > > > > > schema evolution then maybe it makes sense to discuss it for the
> > > IPC stream
> > > > > > > format and leverage that in Flight. I'd be interested in what
> > > others think.
> > > > > > >
> > > > > > > Especially if we are looking at multiplexing streams - I would
> > > wonder if
> > > > > > > that's actually better served by making it easier to implement
> > > using the
> > > > > > > Flight implementation as it stands (by managing concurrent RPC
> > > calls and/or
> > > > > > > performing the union-of-structs encoding trick for you), instead
> > > of having
> > > > > > > to change the protocol.
> > > > > > >
> > > > > > > Nate: it may be worth starting a separate discussion about more
> > > general
> > > > > > > metadata in the IPC message. I'm not aware of why key-value
> > > metadata was
> > > > > > > chosen/if opaque bytes were considered in the past. Off the top
> > of
> > > my head
> > > > > > > if it's for on-disk storage and fully application-defined it may
> > > make sense
> > > > > > > to store as a separate file alongside the Arrow file (indexed by
> > > record
> > > > > > > batch index) where you can take advantage of whatever format is
> > > most
> > > > > > > suitable.
> > > > > > >
> > > > > > > -David
> > > > > > >
> > > > > > > On Sun, Jun 27, 2021, at 07:50, Gosh Arzumanyan wrote:
> > > > > > > > Hi guys,
> > > > > > > >
> > > > > > > > 1. Regarding IPC vs Flight: in fact my initial suggestion was
> > to
> > > add this
> > > > > > > > feature starting from the IPC(I moved initial write up steps to
> > > the
> > > > > > > bottom
> > > > > > > > of the doc). Afterwards David suggested focusing on Flight and
> > > that's how
> > > > > > > > we ended up with the protobufs change in the proposal. This
> > > being said I
> > > > > > > do
> > > > > > > > think that the place where this should be impemented is a good
> > > question
> > > > > > > on
> > > > > > > > its own. Maybe it makes sense to have this kind of a feature in
> > > IPC and
> > > > > > > > somehow use it in Flight, maybe not.
> > > > > > > > 2. The point about dictionaries deserves a dedicated section in
> > > the
> > > > > > > > proposal. Nate and David brought it up and shared some
> > insights.
> > > I'll try
> > > > > > > > to aggregate them and we can continue the discussion form
> > there.
> > > > > > > >
> > > > > > > > Cheers,
> > > > > > > > Gosh
> > > > > > > >
> > > > > > > > On Sat., 26 Jun. 2021, 17:26 Nate Bauernfeind, <
> > > > > > > natebauernfeind@deephaven.io>
> > > > > > > > wrote:
> > > > > > > >
> > > > > > > > > >
> > > > > > > > > > > > makes it more difficult to bring schema evolution back
> > > into the
> > > > > > > > > > > > IPC Stream format (i.e. it would live only in flight)
> > > > > > > > > > >
> > > > > > > > > > > Gosh's proposal extends the flatbuffer structures not the
> > > > > > > protobufs.
> > > > > > > > > Can
> > > > > > > > > > > you help me understand how difficult it would be to bring
> > > the
> > > > > > > > > `schema_id`
> > > > > > > > > > > approach to the IPC stream format?
> > > > > > > > > >
> > > > > > > > > > I thought we were talking solely about the Flight Protobuf
> > > > > > > definitions -
> > > > > > > > > > not the Flatbuffers (and the Google doc at least only talks
> > > about the
> > > > > > > > > > Protobufs).
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > I somehow missed that schema_id is being added to protobuf in
> > > the
> > > > > > > document.
> > > > > > > > > It feels to me that the schema_id is a property that would
> > > ideally only
> > > > > > > > > apply to the RecordBatch. I better understand Micah's
> > > dictionary
> > > > > > > concerns,
> > > > > > > > > now, too.
> > > > > > > > >
> > > > > > > > > > Side Question: Why isn't the IPC stream format a series of
> > > the flight
> > > > > > > > > > > protobufs? It's a real shame that there is no standard
> > way
> > > to
> > > > > > > > > > > capture/replay a stream with app_metadata. (Obviously
> > > ignoring the
> > > > > > > > > > > annoyances around protobuf wrapping flatbuffers.)
> > > > > > > > > >
> > > > > > > > > > The IPC format was defined long before Flight, and Flight's
> > > > > > > app_metadata
> > > > > > > > > > was added after Flight's initial definition. Note an IPC
> > > message does
> > > > > > > > > have
> > > > > > > > > > a provision for key-value metadata, though I think APIs for
> > > that are
> > > > > > > not
> > > > > > > > > > fully exposed. (See ARROW-6940:
> > > > > > > > > > https://issues.apache.org/jira/browse/ARROW-6940 and
> > > despite my
> > > > > > > comments
> > > > > > > > > > there perhaps we need to unify or at least consider how
> > > Flight's
> > > > > > > > > > app_metadata relates to the IPC message custom_metadata.
> > Also
> > > > > > > perhaps see
> > > > > > > > > > ARROW-1059.)
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > KeyValue unfortunately is string to string. In flatbuffer
> > > strings are
> > > > > > > only
> > > > > > > > > UTF-8 or 7-bit ASCII. The app_metadata on the other hand is
> > > opaque
> > > > > > > bytes.
> > > > > > > > > The latter is a bit more useful.
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > >
> > >
> >
> >
> > --
> >