You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2019/06/03 19:07:20 UTC

Re: [DISCUSS] Formalizing "extension type" metadata in the Arrow binary protocol

hi Micah,

I have just updated my PR per your comments with more examples of
extension types.

https://github.com/apache/arrow/pull/4332

Are there more comments about this? I can start a vote in a couple of
days absent further opinions.

Can someone volunteer to review David's Java PR? I would like to move
this along so we have a chance of having working extension types in
the 0.14 release. A number of people are also interested in bridging
between pandas's ExtensionArray facility (for custom DataFrame column
types [1]) and Arrow's ExtensionType

Thanks
Wes

[1]: https://pandas.pydata.org/pandas-docs/stable/development/extending.html

On Sat, May 18, 2019 at 6:25 PM Micah Kornfield <em...@gmail.com> wrote:
>
> Hi Wes,
> Like I said I think this approach looks good, I think what I'm looking for is a little more documentation/examples on how additional types would be handled.  I think Tensor would be a good example, we also had questions about INET addresses previously, maybe this would be a another good illustrative example.  Providing examples of serialized metadata in the docs would be useful (clarifying that these are opaque binary blobs, that will be passed along to extension type factories?)
>
> In this regard, I think it might be good to provide a further recommendations for the name of extension types:  What do you think about recommend organization/projects namespace them to according to some convention, so that there aren't conflicts and extensions can be shared?
>
> Thanks,
> Micah
>
>
>
> On Sat, May 18, 2019 at 12:00 PM Wes McKinney <we...@gmail.com> wrote:
>>
>>
>>
>> On Sat, May 18, 2019, 1:58 PM Wes McKinney <we...@gmail.com> wrote:
>>>
>>> Hi Micah,
>>>
>>> The use cases I'm aware of are mostly coming from proprietary applications. My idea was for the extension metadata to be as unobtrusive as possible. The only alternative as I see it would be to have an Extension value in the Type union which would be more intrusive to applications handling data for which they have no special handling. That doesn't seem desirable if there are alternatives.
>>
>>
>> The other (3rd) option would be to add an extra member to Field. This is also a bit more intrusive than having fields in the custom_metadata dictionary.
>>
>>>
>>> As an immediate use case we could use extension types to embed Tensor values in Binary arrays.
>>>
>>> Wes
>>>
>>> On Sat, May 18, 2019, 12:19 PM Micah Kornfield <em...@gmail.com> wrote:
>>>>
>>>> Hi Wes,
>>>> This approach seems reasonable to me.  I'm a little concerned we haven't
>>>> validated many use-cases against the approach (but I don't see any obvious
>>>> flaws).
>>>>
>>>> Thanks,
>>>> Micah
>>>>
>>>> On Fri, May 17, 2019 at 5:16 AM Wes McKinney <we...@gmail.com> wrote:
>>>>
>>>> > As Micah brought up, as part of this we would like to formalize the
>>>> > use of "ARROW:" as a reserved metadata key prefix. This is similar to
>>>> > Apache Avro which uses "avro." as a reserved prefix [1]. If someone
>>>> > has a different idea about what the prefix should be I'm open to other
>>>> > ideas
>>>> >
>>>> > [1] : https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files
>>>> >
>>>> > On Thu, May 16, 2019 at 7:29 PM Wes McKinney <we...@gmail.com> wrote:
>>>> > >
>>>> > > hi folks,
>>>> > >
>>>> > > In a prior mailing list thread from February [1] I brought up some
>>>> > > work I'd done in C++ to create an API to define custom data types that
>>>> > > can be embedded in built-in Arrow logical types. These are serialized
>>>> > > through IPC by adding special fields to the `custom_metadata` member
>>>> > > of Field in the Flatbuffers metadata [2]. The idea is that if an
>>>> > > implementation does not understand the custom type, then they can
>>>> > > still interact with the underlying data if need be, or pass on the
>>>> > > extension metadata in subsequent IPC messages.
>>>> > >
>>>> > > David Li has put up a WIP PR to implement this for Java [4], so to
>>>> > > help the project move forward I think it's a good time to formalize
>>>> > > this, and if there are disagreements to hash them out now. I have just
>>>> > > opened a PR to the Arrow specification documents [3] that describes
>>>> > > the current state of C++ and also the WIP Java PR.
>>>> > >
>>>> > > Any thought about this? If there is consensus about this solution
>>>> > > approach then I can hold a vote.
>>>> > >
>>>> > > Thanks
>>>> > > Wes
>>>> > >
>>>> > > [1]:
>>>> > https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E
>>>> > > [2]: https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291
>>>> > > [3]: https://github.com/apache/arrow/pull/4332
>>>> > > [4]: https://github.com/apache/arrow/pull/4251
>>>> >

Re: [DISCUSS] Formalizing "extension type" metadata in the Arrow binary protocol

Posted by Wes McKinney <we...@gmail.com>.
Great, thanks Jacques. I'll kick off a vote thread so we can hopefully
get this approved

On Fri, Jun 7, 2019 at 3:27 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> I'm good with this. The consistent separator is a good improvement.
>
> On Thu, Jun 6, 2019 at 1:06 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hey Jacques,
> >
> > On Thu, Jun 6, 2019 at 12:53 PM Jacques Nadeau <ja...@apache.org> wrote:
> > >
> > > Thanks for pushing this along. I think it is important. Sorry I'm coming
> > > late to the conversation. Couple thoughts:
> > >
> > > - Should we reconsider having this be an independent optional field as
> > > opposed to overloading customer_metadata? It avoids having the weird
> > string
> > > prefixing behavior
> >
> > This is one option that we've discussed. The downside of this is that
> > it becomes another piece of metadata that Arrow implementations need
> > to mind when they are passing through IPC messages. The idea is that
> > "dumb" readers can simply ignore the metadata but pass it along in a
> > subsequent message. For example, suppose a simplistic data
> > service/microservice that evaluates a filter against record batches
> > coming through. There might be columns with extension types that come
> > through that the service does not recognize.
> >
> > In some implementations the custom_metadata member is preserved in
> > schemas and survives IPC round trips, but this is a feature that IMHO
> > should be implemented consistently in all Arrow implementations. For
> > example, I believe that Java drops the custom_metadata as soon as the
> > IPC protocol is parsed.
> >
> > Admittedly, this is not a huge issue, so if you had an extra member of
> > Field like
> >
> > table ExtensionType {
> >   name: string
> >   metadata: string
> > }
> >
> > ...
> >
> > table Field {
> >   ...
> >   custom_type : ExtensionType
> > }
> >
> > then that would work, too. It's more obtrusive to implementations as
> > readers that do not recognize a type should still mind this metadata
> > and pass it along in subsequent messages. If we embed in
> > custom_metadata then this happens automatically (assuming that
> > custom_metadata is preserved...)
> >
> > > - I'd be inclined to be much more stringent about type naming. Maybe even
> > > make the name multiple parts to force the issue?
> >
> > I just updated my PR https://github.com/apache/arrow/pull/4332 to say
> > also that colon ":" is the designated namespace separator and I've
> > made the metadata keys
> >
> > ARROW:extension:name
> > ARROW:extension:metadata
> >
> > As far as the actual type name, since it's application-defined, it
> > might be better to leave this up to the developer-user. If we defined
> > any "built-in extension types" (things like UUID come to mind) we
> > might want to have a pseudo-namespace like "builtin.uuid",
> > "builtin.ipv6", etc. for these
> >
> > Let me know what you think -- it would be great to start a vote on this
> > soon.
> >
> > Thanks
> > Wes
> >
> > >
> > > On Mon, Jun 3, 2019 at 12:08 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >
> > > > hi Micah,
> > > >
> > > > I have just updated my PR per your comments with more examples of
> > > > extension types.
> > > >
> > > > https://github.com/apache/arrow/pull/4332
> > > >
> > > > Are there more comments about this? I can start a vote in a couple of
> > > > days absent further opinions.
> > > >
> > > > Can someone volunteer to review David's Java PR? I would like to move
> > > > this along so we have a chance of having working extension types in
> > > > the 0.14 release. A number of people are also interested in bridging
> > > > between pandas's ExtensionArray facility (for custom DataFrame column
> > > > types [1]) and Arrow's ExtensionType
> > > >
> > > > Thanks
> > > > Wes
> > > >
> > > > [1]:
> > > >
> > https://pandas.pydata.org/pandas-docs/stable/development/extending.html
> > > >
> > > > On Sat, May 18, 2019 at 6:25 PM Micah Kornfield <emkornfield@gmail.com
> > >
> > > > wrote:
> > > > >
> > > > > Hi Wes,
> > > > > Like I said I think this approach looks good, I think what I'm
> > looking
> > > > for is a little more documentation/examples on how additional types
> > would
> > > > be handled.  I think Tensor would be a good example, we also had
> > questions
> > > > about INET addresses previously, maybe this would be a another good
> > > > illustrative example.  Providing examples of serialized metadata in the
> > > > docs would be useful (clarifying that these are opaque binary blobs,
> > that
> > > > will be passed along to extension type factories?)
> > > > >
> > > > > In this regard, I think it might be good to provide a further
> > > > recommendations for the name of extension types:  What do you think
> > about
> > > > recommend organization/projects namespace them to according to some
> > > > convention, so that there aren't conflicts and extensions can be
> > shared?
> > > > >
> > > > > Thanks,
> > > > > Micah
> > > > >
> > > > >
> > > > >
> > > > > On Sat, May 18, 2019 at 12:00 PM Wes McKinney <we...@gmail.com>
> > > > wrote:
> > > > >>
> > > > >>
> > > > >>
> > > > >> On Sat, May 18, 2019, 1:58 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > > > >>>
> > > > >>> Hi Micah,
> > > > >>>
> > > > >>> The use cases I'm aware of are mostly coming from proprietary
> > > > applications. My idea was for the extension metadata to be as
> > unobtrusive
> > > > as possible. The only alternative as I see it would be to have an
> > Extension
> > > > value in the Type union which would be more intrusive to applications
> > > > handling data for which they have no special handling. That doesn't
> > seem
> > > > desirable if there are alternatives.
> > > > >>
> > > > >>
> > > > >> The other (3rd) option would be to add an extra member to Field.
> > This
> > > > is also a bit more intrusive than having fields in the custom_metadata
> > > > dictionary.
> > > > >>
> > > > >>>
> > > > >>> As an immediate use case we could use extension types to embed
> > Tensor
> > > > values in Binary arrays.
> > > > >>>
> > > > >>> Wes
> > > > >>>
> > > > >>> On Sat, May 18, 2019, 12:19 PM Micah Kornfield <
> > emkornfield@gmail.com>
> > > > wrote:
> > > > >>>>
> > > > >>>> Hi Wes,
> > > > >>>> This approach seems reasonable to me.  I'm a little concerned we
> > > > haven't
> > > > >>>> validated many use-cases against the approach (but I don't see any
> > > > obvious
> > > > >>>> flaws).
> > > > >>>>
> > > > >>>> Thanks,
> > > > >>>> Micah
> > > > >>>>
> > > > >>>> On Fri, May 17, 2019 at 5:16 AM Wes McKinney <wesmckinn@gmail.com
> > >
> > > > wrote:
> > > > >>>>
> > > > >>>> > As Micah brought up, as part of this we would like to formalize
> > the
> > > > >>>> > use of "ARROW:" as a reserved metadata key prefix. This is
> > similar
> > > > to
> > > > >>>> > Apache Avro which uses "avro." as a reserved prefix [1]. If
> > someone
> > > > >>>> > has a different idea about what the prefix should be I'm open to
> > > > other
> > > > >>>> > ideas
> > > > >>>> >
> > > > >>>> > [1] :
> > > > https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files
> > > > >>>> >
> > > > >>>> > On Thu, May 16, 2019 at 7:29 PM Wes McKinney <
> > wesmckinn@gmail.com>
> > > > wrote:
> > > > >>>> > >
> > > > >>>> > > hi folks,
> > > > >>>> > >
> > > > >>>> > > In a prior mailing list thread from February [1] I brought up
> > some
> > > > >>>> > > work I'd done in C++ to create an API to define custom data
> > types
> > > > that
> > > > >>>> > > can be embedded in built-in Arrow logical types. These are
> > > > serialized
> > > > >>>> > > through IPC by adding special fields to the `custom_metadata`
> > > > member
> > > > >>>> > > of Field in the Flatbuffers metadata [2]. The idea is that if
> > an
> > > > >>>> > > implementation does not understand the custom type, then they
> > can
> > > > >>>> > > still interact with the underlying data if need be, or pass
> > on the
> > > > >>>> > > extension metadata in subsequent IPC messages.
> > > > >>>> > >
> > > > >>>> > > David Li has put up a WIP PR to implement this for Java [4],
> > so to
> > > > >>>> > > help the project move forward I think it's a good time to
> > > > formalize
> > > > >>>> > > this, and if there are disagreements to hash them out now. I
> > have
> > > > just
> > > > >>>> > > opened a PR to the Arrow specification documents [3] that
> > > > describes
> > > > >>>> > > the current state of C++ and also the WIP Java PR.
> > > > >>>> > >
> > > > >>>> > > Any thought about this? If there is consensus about this
> > solution
> > > > >>>> > > approach then I can hold a vote.
> > > > >>>> > >
> > > > >>>> > > Thanks
> > > > >>>> > > Wes
> > > > >>>> > >
> > > > >>>> > > [1]:
> > > > >>>> >
> > > >
> > https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E
> > > > >>>> > > [2]:
> > > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291
> > > > >>>> > > [3]: https://github.com/apache/arrow/pull/4332
> > > > >>>> > > [4]: https://github.com/apache/arrow/pull/4251
> > > > >>>> >
> > > >
> >

Re: [DISCUSS] Formalizing "extension type" metadata in the Arrow binary protocol

Posted by Jacques Nadeau <ja...@apache.org>.
I'm good with this. The consistent separator is a good improvement.

On Thu, Jun 6, 2019 at 1:06 PM Wes McKinney <we...@gmail.com> wrote:

> hey Jacques,
>
> On Thu, Jun 6, 2019 at 12:53 PM Jacques Nadeau <ja...@apache.org> wrote:
> >
> > Thanks for pushing this along. I think it is important. Sorry I'm coming
> > late to the conversation. Couple thoughts:
> >
> > - Should we reconsider having this be an independent optional field as
> > opposed to overloading customer_metadata? It avoids having the weird
> string
> > prefixing behavior
>
> This is one option that we've discussed. The downside of this is that
> it becomes another piece of metadata that Arrow implementations need
> to mind when they are passing through IPC messages. The idea is that
> "dumb" readers can simply ignore the metadata but pass it along in a
> subsequent message. For example, suppose a simplistic data
> service/microservice that evaluates a filter against record batches
> coming through. There might be columns with extension types that come
> through that the service does not recognize.
>
> In some implementations the custom_metadata member is preserved in
> schemas and survives IPC round trips, but this is a feature that IMHO
> should be implemented consistently in all Arrow implementations. For
> example, I believe that Java drops the custom_metadata as soon as the
> IPC protocol is parsed.
>
> Admittedly, this is not a huge issue, so if you had an extra member of
> Field like
>
> table ExtensionType {
>   name: string
>   metadata: string
> }
>
> ...
>
> table Field {
>   ...
>   custom_type : ExtensionType
> }
>
> then that would work, too. It's more obtrusive to implementations as
> readers that do not recognize a type should still mind this metadata
> and pass it along in subsequent messages. If we embed in
> custom_metadata then this happens automatically (assuming that
> custom_metadata is preserved...)
>
> > - I'd be inclined to be much more stringent about type naming. Maybe even
> > make the name multiple parts to force the issue?
>
> I just updated my PR https://github.com/apache/arrow/pull/4332 to say
> also that colon ":" is the designated namespace separator and I've
> made the metadata keys
>
> ARROW:extension:name
> ARROW:extension:metadata
>
> As far as the actual type name, since it's application-defined, it
> might be better to leave this up to the developer-user. If we defined
> any "built-in extension types" (things like UUID come to mind) we
> might want to have a pseudo-namespace like "builtin.uuid",
> "builtin.ipv6", etc. for these
>
> Let me know what you think -- it would be great to start a vote on this
> soon.
>
> Thanks
> Wes
>
> >
> > On Mon, Jun 3, 2019 at 12:08 PM Wes McKinney <we...@gmail.com>
> wrote:
> >
> > > hi Micah,
> > >
> > > I have just updated my PR per your comments with more examples of
> > > extension types.
> > >
> > > https://github.com/apache/arrow/pull/4332
> > >
> > > Are there more comments about this? I can start a vote in a couple of
> > > days absent further opinions.
> > >
> > > Can someone volunteer to review David's Java PR? I would like to move
> > > this along so we have a chance of having working extension types in
> > > the 0.14 release. A number of people are also interested in bridging
> > > between pandas's ExtensionArray facility (for custom DataFrame column
> > > types [1]) and Arrow's ExtensionType
> > >
> > > Thanks
> > > Wes
> > >
> > > [1]:
> > >
> https://pandas.pydata.org/pandas-docs/stable/development/extending.html
> > >
> > > On Sat, May 18, 2019 at 6:25 PM Micah Kornfield <emkornfield@gmail.com
> >
> > > wrote:
> > > >
> > > > Hi Wes,
> > > > Like I said I think this approach looks good, I think what I'm
> looking
> > > for is a little more documentation/examples on how additional types
> would
> > > be handled.  I think Tensor would be a good example, we also had
> questions
> > > about INET addresses previously, maybe this would be a another good
> > > illustrative example.  Providing examples of serialized metadata in the
> > > docs would be useful (clarifying that these are opaque binary blobs,
> that
> > > will be passed along to extension type factories?)
> > > >
> > > > In this regard, I think it might be good to provide a further
> > > recommendations for the name of extension types:  What do you think
> about
> > > recommend organization/projects namespace them to according to some
> > > convention, so that there aren't conflicts and extensions can be
> shared?
> > > >
> > > > Thanks,
> > > > Micah
> > > >
> > > >
> > > >
> > > > On Sat, May 18, 2019 at 12:00 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > >>
> > > >>
> > > >>
> > > >> On Sat, May 18, 2019, 1:58 PM Wes McKinney <we...@gmail.com>
> wrote:
> > > >>>
> > > >>> Hi Micah,
> > > >>>
> > > >>> The use cases I'm aware of are mostly coming from proprietary
> > > applications. My idea was for the extension metadata to be as
> unobtrusive
> > > as possible. The only alternative as I see it would be to have an
> Extension
> > > value in the Type union which would be more intrusive to applications
> > > handling data for which they have no special handling. That doesn't
> seem
> > > desirable if there are alternatives.
> > > >>
> > > >>
> > > >> The other (3rd) option would be to add an extra member to Field.
> This
> > > is also a bit more intrusive than having fields in the custom_metadata
> > > dictionary.
> > > >>
> > > >>>
> > > >>> As an immediate use case we could use extension types to embed
> Tensor
> > > values in Binary arrays.
> > > >>>
> > > >>> Wes
> > > >>>
> > > >>> On Sat, May 18, 2019, 12:19 PM Micah Kornfield <
> emkornfield@gmail.com>
> > > wrote:
> > > >>>>
> > > >>>> Hi Wes,
> > > >>>> This approach seems reasonable to me.  I'm a little concerned we
> > > haven't
> > > >>>> validated many use-cases against the approach (but I don't see any
> > > obvious
> > > >>>> flaws).
> > > >>>>
> > > >>>> Thanks,
> > > >>>> Micah
> > > >>>>
> > > >>>> On Fri, May 17, 2019 at 5:16 AM Wes McKinney <wesmckinn@gmail.com
> >
> > > wrote:
> > > >>>>
> > > >>>> > As Micah brought up, as part of this we would like to formalize
> the
> > > >>>> > use of "ARROW:" as a reserved metadata key prefix. This is
> similar
> > > to
> > > >>>> > Apache Avro which uses "avro." as a reserved prefix [1]. If
> someone
> > > >>>> > has a different idea about what the prefix should be I'm open to
> > > other
> > > >>>> > ideas
> > > >>>> >
> > > >>>> > [1] :
> > > https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files
> > > >>>> >
> > > >>>> > On Thu, May 16, 2019 at 7:29 PM Wes McKinney <
> wesmckinn@gmail.com>
> > > wrote:
> > > >>>> > >
> > > >>>> > > hi folks,
> > > >>>> > >
> > > >>>> > > In a prior mailing list thread from February [1] I brought up
> some
> > > >>>> > > work I'd done in C++ to create an API to define custom data
> types
> > > that
> > > >>>> > > can be embedded in built-in Arrow logical types. These are
> > > serialized
> > > >>>> > > through IPC by adding special fields to the `custom_metadata`
> > > member
> > > >>>> > > of Field in the Flatbuffers metadata [2]. The idea is that if
> an
> > > >>>> > > implementation does not understand the custom type, then they
> can
> > > >>>> > > still interact with the underlying data if need be, or pass
> on the
> > > >>>> > > extension metadata in subsequent IPC messages.
> > > >>>> > >
> > > >>>> > > David Li has put up a WIP PR to implement this for Java [4],
> so to
> > > >>>> > > help the project move forward I think it's a good time to
> > > formalize
> > > >>>> > > this, and if there are disagreements to hash them out now. I
> have
> > > just
> > > >>>> > > opened a PR to the Arrow specification documents [3] that
> > > describes
> > > >>>> > > the current state of C++ and also the WIP Java PR.
> > > >>>> > >
> > > >>>> > > Any thought about this? If there is consensus about this
> solution
> > > >>>> > > approach then I can hold a vote.
> > > >>>> > >
> > > >>>> > > Thanks
> > > >>>> > > Wes
> > > >>>> > >
> > > >>>> > > [1]:
> > > >>>> >
> > >
> https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E
> > > >>>> > > [2]:
> > > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291
> > > >>>> > > [3]: https://github.com/apache/arrow/pull/4332
> > > >>>> > > [4]: https://github.com/apache/arrow/pull/4251
> > > >>>> >
> > >
>

Re: [DISCUSS] Formalizing "extension type" metadata in the Arrow binary protocol

Posted by Wes McKinney <we...@gmail.com>.
hey Jacques,

On Thu, Jun 6, 2019 at 12:53 PM Jacques Nadeau <ja...@apache.org> wrote:
>
> Thanks for pushing this along. I think it is important. Sorry I'm coming
> late to the conversation. Couple thoughts:
>
> - Should we reconsider having this be an independent optional field as
> opposed to overloading customer_metadata? It avoids having the weird string
> prefixing behavior

This is one option that we've discussed. The downside of this is that
it becomes another piece of metadata that Arrow implementations need
to mind when they are passing through IPC messages. The idea is that
"dumb" readers can simply ignore the metadata but pass it along in a
subsequent message. For example, suppose a simplistic data
service/microservice that evaluates a filter against record batches
coming through. There might be columns with extension types that come
through that the service does not recognize.

In some implementations the custom_metadata member is preserved in
schemas and survives IPC round trips, but this is a feature that IMHO
should be implemented consistently in all Arrow implementations. For
example, I believe that Java drops the custom_metadata as soon as the
IPC protocol is parsed.

Admittedly, this is not a huge issue, so if you had an extra member of
Field like

table ExtensionType {
  name: string
  metadata: string
}

...

table Field {
  ...
  custom_type : ExtensionType
}

then that would work, too. It's more obtrusive to implementations as
readers that do not recognize a type should still mind this metadata
and pass it along in subsequent messages. If we embed in
custom_metadata then this happens automatically (assuming that
custom_metadata is preserved...)

> - I'd be inclined to be much more stringent about type naming. Maybe even
> make the name multiple parts to force the issue?

I just updated my PR https://github.com/apache/arrow/pull/4332 to say
also that colon ":" is the designated namespace separator and I've
made the metadata keys

ARROW:extension:name
ARROW:extension:metadata

As far as the actual type name, since it's application-defined, it
might be better to leave this up to the developer-user. If we defined
any "built-in extension types" (things like UUID come to mind) we
might want to have a pseudo-namespace like "builtin.uuid",
"builtin.ipv6", etc. for these

Let me know what you think -- it would be great to start a vote on this soon.

Thanks
Wes

>
> On Mon, Jun 3, 2019 at 12:08 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Micah,
> >
> > I have just updated my PR per your comments with more examples of
> > extension types.
> >
> > https://github.com/apache/arrow/pull/4332
> >
> > Are there more comments about this? I can start a vote in a couple of
> > days absent further opinions.
> >
> > Can someone volunteer to review David's Java PR? I would like to move
> > this along so we have a chance of having working extension types in
> > the 0.14 release. A number of people are also interested in bridging
> > between pandas's ExtensionArray facility (for custom DataFrame column
> > types [1]) and Arrow's ExtensionType
> >
> > Thanks
> > Wes
> >
> > [1]:
> > https://pandas.pydata.org/pandas-docs/stable/development/extending.html
> >
> > On Sat, May 18, 2019 at 6:25 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> > >
> > > Hi Wes,
> > > Like I said I think this approach looks good, I think what I'm looking
> > for is a little more documentation/examples on how additional types would
> > be handled.  I think Tensor would be a good example, we also had questions
> > about INET addresses previously, maybe this would be a another good
> > illustrative example.  Providing examples of serialized metadata in the
> > docs would be useful (clarifying that these are opaque binary blobs, that
> > will be passed along to extension type factories?)
> > >
> > > In this regard, I think it might be good to provide a further
> > recommendations for the name of extension types:  What do you think about
> > recommend organization/projects namespace them to according to some
> > convention, so that there aren't conflicts and extensions can be shared?
> > >
> > > Thanks,
> > > Micah
> > >
> > >
> > >
> > > On Sat, May 18, 2019 at 12:00 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >>
> > >>
> > >>
> > >> On Sat, May 18, 2019, 1:58 PM Wes McKinney <we...@gmail.com> wrote:
> > >>>
> > >>> Hi Micah,
> > >>>
> > >>> The use cases I'm aware of are mostly coming from proprietary
> > applications. My idea was for the extension metadata to be as unobtrusive
> > as possible. The only alternative as I see it would be to have an Extension
> > value in the Type union which would be more intrusive to applications
> > handling data for which they have no special handling. That doesn't seem
> > desirable if there are alternatives.
> > >>
> > >>
> > >> The other (3rd) option would be to add an extra member to Field. This
> > is also a bit more intrusive than having fields in the custom_metadata
> > dictionary.
> > >>
> > >>>
> > >>> As an immediate use case we could use extension types to embed Tensor
> > values in Binary arrays.
> > >>>
> > >>> Wes
> > >>>
> > >>> On Sat, May 18, 2019, 12:19 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> > >>>>
> > >>>> Hi Wes,
> > >>>> This approach seems reasonable to me.  I'm a little concerned we
> > haven't
> > >>>> validated many use-cases against the approach (but I don't see any
> > obvious
> > >>>> flaws).
> > >>>>
> > >>>> Thanks,
> > >>>> Micah
> > >>>>
> > >>>> On Fri, May 17, 2019 at 5:16 AM Wes McKinney <we...@gmail.com>
> > wrote:
> > >>>>
> > >>>> > As Micah brought up, as part of this we would like to formalize the
> > >>>> > use of "ARROW:" as a reserved metadata key prefix. This is similar
> > to
> > >>>> > Apache Avro which uses "avro." as a reserved prefix [1]. If someone
> > >>>> > has a different idea about what the prefix should be I'm open to
> > other
> > >>>> > ideas
> > >>>> >
> > >>>> > [1] :
> > https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files
> > >>>> >
> > >>>> > On Thu, May 16, 2019 at 7:29 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > >>>> > >
> > >>>> > > hi folks,
> > >>>> > >
> > >>>> > > In a prior mailing list thread from February [1] I brought up some
> > >>>> > > work I'd done in C++ to create an API to define custom data types
> > that
> > >>>> > > can be embedded in built-in Arrow logical types. These are
> > serialized
> > >>>> > > through IPC by adding special fields to the `custom_metadata`
> > member
> > >>>> > > of Field in the Flatbuffers metadata [2]. The idea is that if an
> > >>>> > > implementation does not understand the custom type, then they can
> > >>>> > > still interact with the underlying data if need be, or pass on the
> > >>>> > > extension metadata in subsequent IPC messages.
> > >>>> > >
> > >>>> > > David Li has put up a WIP PR to implement this for Java [4], so to
> > >>>> > > help the project move forward I think it's a good time to
> > formalize
> > >>>> > > this, and if there are disagreements to hash them out now. I have
> > just
> > >>>> > > opened a PR to the Arrow specification documents [3] that
> > describes
> > >>>> > > the current state of C++ and also the WIP Java PR.
> > >>>> > >
> > >>>> > > Any thought about this? If there is consensus about this solution
> > >>>> > > approach then I can hold a vote.
> > >>>> > >
> > >>>> > > Thanks
> > >>>> > > Wes
> > >>>> > >
> > >>>> > > [1]:
> > >>>> >
> > https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E
> > >>>> > > [2]:
> > https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291
> > >>>> > > [3]: https://github.com/apache/arrow/pull/4332
> > >>>> > > [4]: https://github.com/apache/arrow/pull/4251
> > >>>> >
> >

Re: [DISCUSS] Formalizing "extension type" metadata in the Arrow binary protocol

Posted by Jacques Nadeau <ja...@apache.org>.
Thanks for pushing this along. I think it is important. Sorry I'm coming
late to the conversation. Couple thoughts:

- Should we reconsider having this be an independent optional field as
opposed to overloading customer_metadata? It avoids having the weird string
prefixing behavior
- I'd be inclined to be much more stringent about type naming. Maybe even
make the name multiple parts to force the issue?

On Mon, Jun 3, 2019 at 12:08 PM Wes McKinney <we...@gmail.com> wrote:

> hi Micah,
>
> I have just updated my PR per your comments with more examples of
> extension types.
>
> https://github.com/apache/arrow/pull/4332
>
> Are there more comments about this? I can start a vote in a couple of
> days absent further opinions.
>
> Can someone volunteer to review David's Java PR? I would like to move
> this along so we have a chance of having working extension types in
> the 0.14 release. A number of people are also interested in bridging
> between pandas's ExtensionArray facility (for custom DataFrame column
> types [1]) and Arrow's ExtensionType
>
> Thanks
> Wes
>
> [1]:
> https://pandas.pydata.org/pandas-docs/stable/development/extending.html
>
> On Sat, May 18, 2019 at 6:25 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >
> > Hi Wes,
> > Like I said I think this approach looks good, I think what I'm looking
> for is a little more documentation/examples on how additional types would
> be handled.  I think Tensor would be a good example, we also had questions
> about INET addresses previously, maybe this would be a another good
> illustrative example.  Providing examples of serialized metadata in the
> docs would be useful (clarifying that these are opaque binary blobs, that
> will be passed along to extension type factories?)
> >
> > In this regard, I think it might be good to provide a further
> recommendations for the name of extension types:  What do you think about
> recommend organization/projects namespace them to according to some
> convention, so that there aren't conflicts and extensions can be shared?
> >
> > Thanks,
> > Micah
> >
> >
> >
> > On Sat, May 18, 2019 at 12:00 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>
> >>
> >>
> >> On Sat, May 18, 2019, 1:58 PM Wes McKinney <we...@gmail.com> wrote:
> >>>
> >>> Hi Micah,
> >>>
> >>> The use cases I'm aware of are mostly coming from proprietary
> applications. My idea was for the extension metadata to be as unobtrusive
> as possible. The only alternative as I see it would be to have an Extension
> value in the Type union which would be more intrusive to applications
> handling data for which they have no special handling. That doesn't seem
> desirable if there are alternatives.
> >>
> >>
> >> The other (3rd) option would be to add an extra member to Field. This
> is also a bit more intrusive than having fields in the custom_metadata
> dictionary.
> >>
> >>>
> >>> As an immediate use case we could use extension types to embed Tensor
> values in Binary arrays.
> >>>
> >>> Wes
> >>>
> >>> On Sat, May 18, 2019, 12:19 PM Micah Kornfield <em...@gmail.com>
> wrote:
> >>>>
> >>>> Hi Wes,
> >>>> This approach seems reasonable to me.  I'm a little concerned we
> haven't
> >>>> validated many use-cases against the approach (but I don't see any
> obvious
> >>>> flaws).
> >>>>
> >>>> Thanks,
> >>>> Micah
> >>>>
> >>>> On Fri, May 17, 2019 at 5:16 AM Wes McKinney <we...@gmail.com>
> wrote:
> >>>>
> >>>> > As Micah brought up, as part of this we would like to formalize the
> >>>> > use of "ARROW:" as a reserved metadata key prefix. This is similar
> to
> >>>> > Apache Avro which uses "avro." as a reserved prefix [1]. If someone
> >>>> > has a different idea about what the prefix should be I'm open to
> other
> >>>> > ideas
> >>>> >
> >>>> > [1] :
> https://avro.apache.org/docs/1.8.2/spec.html#Object+Container+Files
> >>>> >
> >>>> > On Thu, May 16, 2019 at 7:29 PM Wes McKinney <we...@gmail.com>
> wrote:
> >>>> > >
> >>>> > > hi folks,
> >>>> > >
> >>>> > > In a prior mailing list thread from February [1] I brought up some
> >>>> > > work I'd done in C++ to create an API to define custom data types
> that
> >>>> > > can be embedded in built-in Arrow logical types. These are
> serialized
> >>>> > > through IPC by adding special fields to the `custom_metadata`
> member
> >>>> > > of Field in the Flatbuffers metadata [2]. The idea is that if an
> >>>> > > implementation does not understand the custom type, then they can
> >>>> > > still interact with the underlying data if need be, or pass on the
> >>>> > > extension metadata in subsequent IPC messages.
> >>>> > >
> >>>> > > David Li has put up a WIP PR to implement this for Java [4], so to
> >>>> > > help the project move forward I think it's a good time to
> formalize
> >>>> > > this, and if there are disagreements to hash them out now. I have
> just
> >>>> > > opened a PR to the Arrow specification documents [3] that
> describes
> >>>> > > the current state of C++ and also the WIP Java PR.
> >>>> > >
> >>>> > > Any thought about this? If there is consensus about this solution
> >>>> > > approach then I can hold a vote.
> >>>> > >
> >>>> > > Thanks
> >>>> > > Wes
> >>>> > >
> >>>> > > [1]:
> >>>> >
> https://lists.apache.org/thread.html/f1fc039471a8a9c06f2f9600296a20d4eb3fda379b23685f809118ee@%3Cdev.arrow.apache.org%3E
> >>>> > > [2]:
> https://github.com/apache/arrow/blob/master/format/Schema.fbs#L291
> >>>> > > [3]: https://github.com/apache/arrow/pull/4332
> >>>> > > [4]: https://github.com/apache/arrow/pull/4251
> >>>> >
>