You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Micah Kornfield <em...@gmail.com> on 2021/09/02 05:01:05 UTC

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

It still sounds like adding a new type might be the safest approach (and
marking the old type as discouraged).

On Mon, Aug 23, 2021 at 11:18 AM David Li <li...@apache.org> wrote:

> I believe so.
>
> The encoding of a string in Flatbuffers is [byte] with a null terminator
> not included in the length, so old files should still be readable (they
> would simply not see the terminator anymore). And conversely, continuing to
> write the null terminator means new files should still be readable by
> existing implementations, so long as they don't have binary metadata (but
> we could increment the metadata version before allowing that). There still
> remains the issue of non-UTF8 data inside the "string" value; on that
> point, this is a non-answer (or I guess, it canonicalizes the current
> unintentional behavior).
>
> It would complicate the Java API as that currently works with Strings, but
> that will be an issue for any attempt to add binary metadata.
>
> -David
>
> On Mon, Aug 23, 2021, at 13:40, Antoine Pitrou wrote:
> >
> > Le 23/08/2021 à 17:52, David Li a écrit :
> > > Another way forward might be to relax the value type to [byte], but
> also require implementations to null-terminate binary values regardless.
> The C++ Flatbuffers implementation does this already [1] (though not the
> Java one [2]). Old implementations validating UTF8-ness would still be
> unable to read anything with binary metadata (which is the case they're
> already in), but implementations just validating for null-termination would
> continue to work. And again, I believe this is in-spec for Flatbuffers
> since string lengths don't include the null terminator.
> > >
> > > [1]:
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/include/flatbuffers/flatbuffers.h#L1612-L1670
> > > [2]:
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/java/com/google/flatbuffers/FlatBufferBuilder.java#L594-L637
> > > Though it seems we could emulate C++ here by calling addByte(0)
> ourselves before createByteVector.
> >
> > Would that allow for both forwards and backwards compatibility when the
> > metadata is valid UTF8?
> >
> > Regards
> >
> > Antoine.
> >
> >
>

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Phillip Cloud <cp...@gmail.com>.
I don't think memcpy is feasible. The bytes may be different for different
languages.

Many languages' compilers reorder struct fields and pad structs for
efficiency reasons so the bytes in metadata coming from language X may be
meaningless to language Y.


On Wed, Feb 9, 2022, 08:41 Dewey Dunnington <de...@voltrondata.com> wrote:

> I'm new to this discussion and Arrow generally and appreciate Joris
> bringing this up. While the forward- and backward-compatability of this is
> over my head, I think that it is important to provide a path for extension
> types to have serialized metadata as binary because the alternatives are
> (in my opinion) not great and might lead to extension developers inventing
> their own key/value encoding.
>
> With binary on the table there are options like protobuf, flatbuffers, or
> memcpy of a struct that extension type implementations can use depending on
> the complexity of the extension metadata that needs to be communicated and
> whether or not the metadata is intended to be used in compiled code (i.e.,
> are those values needed to do anything useful with the bytes in the
> buffers?).
>
> Without binary on the table, an extension type implementation has to vendor
> in a JSON reader/writer, vendor in a base64 encoder/decoder, or invent some
> custom serialization that's easier to parse than either of those.
>
> Cheers,
>
> -dewey
>
> On Fri, Feb 4, 2022 at 7:06 AM Joris Van den Bossche <
> jorisvandenbossche@gmail.com> wrote:
>
> > Reviving this thread, I don't think anything happened in the meantime on
> > this topic?
> >
> > From rereading the thread, it seems David mentioned two possible ideas:
> > - A new [byte] binary_value field in the existing KeyValue type, next to
> > the existing string value field. And if you have valid string metadata,
> the
> > same value can be put in both fields by reusing the offset
> > - Relax the existing string value field to [byte], but also require
> > implementations to null-terminate binary values regardless to ensure
> > compatibility for old readers
> >  - And in addition to those options, there is also the safest option of
> > adding a new BinaryKeyValue type and add that next to KeyValue everywhere
> > this is used.
> >
> > And we need to further assess which option is viable given the
> > forward/backward compatibility constraints?
> >
> > I want to point out though that in pyarrow we already _do_ use binary
> > metadata for extension types for several years. The built-in
> > PyExtensionType class uses a pickle dump as the serialized metadata (so
> > which is put in the ARROW:extension:metadata key in the key-value
> metadata
> > of the Field), and the bytes from a pickle dump are certainly not valid
> > UTF-8 I think.
> > I don't know how much this is used in the wild. In pandas, our arrow
> > extension types don't use this pickle-based PyExtensionType but define
> > their own serialization which currently returns a valid UTF-8 string. But
> > for example the Ray project seems to use PyExtensionType (
> >
> >
> https://docs.ray.io/en/latest/_modules/ray/data/extensions/tensor_extension.html#ArrowTensorType
> > ).
> >
> > Joris
> >
> > On Thu, 2 Sept 2021 at 07:01, Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> > > It still sounds like adding a new type might be the safest approach
> (and
> > > marking the old type as discouraged).
> > >
> > > On Mon, Aug 23, 2021 at 11:18 AM David Li <li...@apache.org> wrote:
> > >
> > > > I believe so.
> > > >
> > > > The encoding of a string in Flatbuffers is [byte] with a null
> > terminator
> > > > not included in the length, so old files should still be readable
> (they
> > > > would simply not see the terminator anymore). And conversely,
> > continuing
> > > to
> > > > write the null terminator means new files should still be readable by
> > > > existing implementations, so long as they don't have binary metadata
> > (but
> > > > we could increment the metadata version before allowing that). There
> > > still
> > > > remains the issue of non-UTF8 data inside the "string" value; on that
> > > > point, this is a non-answer (or I guess, it canonicalizes the current
> > > > unintentional behavior).
> > > >
> > > > It would complicate the Java API as that currently works with
> Strings,
> > > but
> > > > that will be an issue for any attempt to add binary metadata.
> > > >
> > > > -David
> > > >
> > > > On Mon, Aug 23, 2021, at 13:40, Antoine Pitrou wrote:
> > > > >
> > > > > Le 23/08/2021 à 17:52, David Li a écrit :
> > > > > > Another way forward might be to relax the value type to [byte],
> but
> > > > also require implementations to null-terminate binary values
> > regardless.
> > > > The C++ Flatbuffers implementation does this already [1] (though not
> > the
> > > > Java one [2]). Old implementations validating UTF8-ness would still
> be
> > > > unable to read anything with binary metadata (which is the case
> they're
> > > > already in), but implementations just validating for null-termination
> > > would
> > > > continue to work. And again, I believe this is in-spec for
> Flatbuffers
> > > > since string lengths don't include the null terminator.
> > > > > >
> > > > > > [1]:
> > > >
> > >
> >
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/include/flatbuffers/flatbuffers.h#L1612-L1670
> > > > > > [2]:
> > > >
> > >
> >
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/java/com/google/flatbuffers/FlatBufferBuilder.java#L594-L637
> > > > > > Though it seems we could emulate C++ here by calling addByte(0)
> > > > ourselves before createByteVector.
> > > > >
> > > > > Would that allow for both forwards and backwards compatibility when
> > the
> > > > > metadata is valid UTF8?
> > > > >
> > > > > Regards
> > > > >
> > > > > Antoine.
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Dewey Dunnington <de...@voltrondata.com>.
I'm new to this discussion and Arrow generally and appreciate Joris
bringing this up. While the forward- and backward-compatability of this is
over my head, I think that it is important to provide a path for extension
types to have serialized metadata as binary because the alternatives are
(in my opinion) not great and might lead to extension developers inventing
their own key/value encoding.

With binary on the table there are options like protobuf, flatbuffers, or
memcpy of a struct that extension type implementations can use depending on
the complexity of the extension metadata that needs to be communicated and
whether or not the metadata is intended to be used in compiled code (i.e.,
are those values needed to do anything useful with the bytes in the
buffers?).

Without binary on the table, an extension type implementation has to vendor
in a JSON reader/writer, vendor in a base64 encoder/decoder, or invent some
custom serialization that's easier to parse than either of those.

Cheers,

-dewey

On Fri, Feb 4, 2022 at 7:06 AM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> Reviving this thread, I don't think anything happened in the meantime on
> this topic?
>
> From rereading the thread, it seems David mentioned two possible ideas:
> - A new [byte] binary_value field in the existing KeyValue type, next to
> the existing string value field. And if you have valid string metadata, the
> same value can be put in both fields by reusing the offset
> - Relax the existing string value field to [byte], but also require
> implementations to null-terminate binary values regardless to ensure
> compatibility for old readers
>  - And in addition to those options, there is also the safest option of
> adding a new BinaryKeyValue type and add that next to KeyValue everywhere
> this is used.
>
> And we need to further assess which option is viable given the
> forward/backward compatibility constraints?
>
> I want to point out though that in pyarrow we already _do_ use binary
> metadata for extension types for several years. The built-in
> PyExtensionType class uses a pickle dump as the serialized metadata (so
> which is put in the ARROW:extension:metadata key in the key-value metadata
> of the Field), and the bytes from a pickle dump are certainly not valid
> UTF-8 I think.
> I don't know how much this is used in the wild. In pandas, our arrow
> extension types don't use this pickle-based PyExtensionType but define
> their own serialization which currently returns a valid UTF-8 string. But
> for example the Ray project seems to use PyExtensionType (
>
> https://docs.ray.io/en/latest/_modules/ray/data/extensions/tensor_extension.html#ArrowTensorType
> ).
>
> Joris
>
> On Thu, 2 Sept 2021 at 07:01, Micah Kornfield <em...@gmail.com>
> wrote:
>
> > It still sounds like adding a new type might be the safest approach (and
> > marking the old type as discouraged).
> >
> > On Mon, Aug 23, 2021 at 11:18 AM David Li <li...@apache.org> wrote:
> >
> > > I believe so.
> > >
> > > The encoding of a string in Flatbuffers is [byte] with a null
> terminator
> > > not included in the length, so old files should still be readable (they
> > > would simply not see the terminator anymore). And conversely,
> continuing
> > to
> > > write the null terminator means new files should still be readable by
> > > existing implementations, so long as they don't have binary metadata
> (but
> > > we could increment the metadata version before allowing that). There
> > still
> > > remains the issue of non-UTF8 data inside the "string" value; on that
> > > point, this is a non-answer (or I guess, it canonicalizes the current
> > > unintentional behavior).
> > >
> > > It would complicate the Java API as that currently works with Strings,
> > but
> > > that will be an issue for any attempt to add binary metadata.
> > >
> > > -David
> > >
> > > On Mon, Aug 23, 2021, at 13:40, Antoine Pitrou wrote:
> > > >
> > > > Le 23/08/2021 à 17:52, David Li a écrit :
> > > > > Another way forward might be to relax the value type to [byte], but
> > > also require implementations to null-terminate binary values
> regardless.
> > > The C++ Flatbuffers implementation does this already [1] (though not
> the
> > > Java one [2]). Old implementations validating UTF8-ness would still be
> > > unable to read anything with binary metadata (which is the case they're
> > > already in), but implementations just validating for null-termination
> > would
> > > continue to work. And again, I believe this is in-spec for Flatbuffers
> > > since string lengths don't include the null terminator.
> > > > >
> > > > > [1]:
> > >
> >
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/include/flatbuffers/flatbuffers.h#L1612-L1670
> > > > > [2]:
> > >
> >
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/java/com/google/flatbuffers/FlatBufferBuilder.java#L594-L637
> > > > > Though it seems we could emulate C++ here by calling addByte(0)
> > > ourselves before createByteVector.
> > > >
> > > > Would that allow for both forwards and backwards compatibility when
> the
> > > > metadata is valid UTF8?
> > > >
> > > > Regards
> > > >
> > > > Antoine.
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Binary Values in Key value pairs WAS: Re: [INFO_REQUEST][FLIGHT] - Dynamic schema changes in ArrowFlight streams

Posted by Joris Van den Bossche <jo...@gmail.com>.
Reviving this thread, I don't think anything happened in the meantime on
this topic?

From rereading the thread, it seems David mentioned two possible ideas:
- A new [byte] binary_value field in the existing KeyValue type, next to
the existing string value field. And if you have valid string metadata, the
same value can be put in both fields by reusing the offset
- Relax the existing string value field to [byte], but also require
implementations to null-terminate binary values regardless to ensure
compatibility for old readers
 - And in addition to those options, there is also the safest option of
adding a new BinaryKeyValue type and add that next to KeyValue everywhere
this is used.

And we need to further assess which option is viable given the
forward/backward compatibility constraints?

I want to point out though that in pyarrow we already _do_ use binary
metadata for extension types for several years. The built-in
PyExtensionType class uses a pickle dump as the serialized metadata (so
which is put in the ARROW:extension:metadata key in the key-value metadata
of the Field), and the bytes from a pickle dump are certainly not valid
UTF-8 I think.
I don't know how much this is used in the wild. In pandas, our arrow
extension types don't use this pickle-based PyExtensionType but define
their own serialization which currently returns a valid UTF-8 string. But
for example the Ray project seems to use PyExtensionType (
https://docs.ray.io/en/latest/_modules/ray/data/extensions/tensor_extension.html#ArrowTensorType
).

Joris

On Thu, 2 Sept 2021 at 07:01, Micah Kornfield <em...@gmail.com> wrote:

> It still sounds like adding a new type might be the safest approach (and
> marking the old type as discouraged).
>
> On Mon, Aug 23, 2021 at 11:18 AM David Li <li...@apache.org> wrote:
>
> > I believe so.
> >
> > The encoding of a string in Flatbuffers is [byte] with a null terminator
> > not included in the length, so old files should still be readable (they
> > would simply not see the terminator anymore). And conversely, continuing
> to
> > write the null terminator means new files should still be readable by
> > existing implementations, so long as they don't have binary metadata (but
> > we could increment the metadata version before allowing that). There
> still
> > remains the issue of non-UTF8 data inside the "string" value; on that
> > point, this is a non-answer (or I guess, it canonicalizes the current
> > unintentional behavior).
> >
> > It would complicate the Java API as that currently works with Strings,
> but
> > that will be an issue for any attempt to add binary metadata.
> >
> > -David
> >
> > On Mon, Aug 23, 2021, at 13:40, Antoine Pitrou wrote:
> > >
> > > Le 23/08/2021 à 17:52, David Li a écrit :
> > > > Another way forward might be to relax the value type to [byte], but
> > also require implementations to null-terminate binary values regardless.
> > The C++ Flatbuffers implementation does this already [1] (though not the
> > Java one [2]). Old implementations validating UTF8-ness would still be
> > unable to read anything with binary metadata (which is the case they're
> > already in), but implementations just validating for null-termination
> would
> > continue to work. And again, I believe this is in-spec for Flatbuffers
> > since string lengths don't include the null terminator.
> > > >
> > > > [1]:
> >
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/include/flatbuffers/flatbuffers.h#L1612-L1670
> > > > [2]:
> >
> https://github.com/google/flatbuffers/blob/273f6084e55285e26ff8b4cdd55a9c0e2d9a48b7/java/com/google/flatbuffers/FlatBufferBuilder.java#L594-L637
> > > > Though it seems we could emulate C++ here by calling addByte(0)
> > ourselves before createByteVector.
> > >
> > > Would that allow for both forwards and backwards compatibility when the
> > > metadata is valid UTF8?
> > >
> > > Regards
> > >
> > > Antoine.
> > >
> > >
> >
>