You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2022/08/17 16:09:18 UTC

DISCUSS: [Format] Rules and procedures for Canonical extension types

Hello all,

The Arrow format has support for extension types, but there's no 
official way to agree accross implementations on well-known extension types.

This issue has come up a couple times with people wanting to implement 
support for types such as JSON or UUID in order to enable better 
interoperability with third-party systems such as Parquet or databases.

I think it's time to discuss and decide how we should progressively 
standardize some well-known, "canonical", extension types.


I would temptatively propose the following rules:

* Canonical extension types are described in a separate document under 
the format specifications directory: 
https://github.com/apache/arrow/tree/master/docs/source/format (note 
this gets turned into HTML docs by Sphinx => 
https://arrow.apache.org/docs/index.html)

* Each canonical extension type requires a separate discussion and vote 
on the mailing-list

* The specification text to be added *must* follow these requirements

1) It *must* have a well-defined name starting with "ARROW:"
2) Its parameters, if any, *must* be described in the proposal
3) Its serialization *must* be described in the proposal and should not 
require unduly work or unusual software dependencies (for example, a 
trivial custom text format or JSON would be acceptable)
4) Its expected semantics *should* be described as well and any 
potential ambiguities or pain points addressed or at least mentioned

* The extension type *should* have one implementation submitted; 
preferably two if non-trivial (for example if parameterized)


Feel free to comment.

Regards

Antoine.

Re: DISCUSS: [Format] Rules and procedures for Canonical extension types

Posted by Pradeep Gollakota <pg...@google.com.INVALID>.
+1. The proposal looks good.

I'm happy to provide the first such document for JSON type to use as a test
vehicle.

On Wed, Aug 17, 2022 at 12:46 PM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> +1 on the overall proposal, documenting those in a central place sounds
> good to me.
>
> On Wed, 17 Aug 2022 at 18:10, Antoine Pitrou <an...@python.org> wrote:
>
> >
> > ....
> >
> > * The specification text to be added *must* follow these requirements
> >
> > 1) It *must* have a well-defined name starting with "ARROW:"
> >
>
> One remark on the specific naming convention: our documentation (
> https://arrow.apache.org/docs/format/Columnar.html#extension-types)
> currently recommends this kind of namespacing as well, but uses a
> "myorg.name_of_type" pattern as example. For the extension types that I am
> aware of (helped implementing), we followed that (for example, in pandas we
> define "pandas.interval" and "pandas.period" extension types, and in
> geoarrow
> <https://github.com/geopandas/geo-arrow-spec/blob/main/extension-types.md>
> we have "geoarrow.point", "geoarrow.polygon", etc).
> I don't have a strong opinion here, but so we can also continue using that
> pattern for the canonical types as well: "arrow.<type>" (or
> "org.apache.arrow.<type>" as mentioned during the sync meeting).
>
> Joris
>


-- 
Pradeep

Re: DISCUSS: [Format] Rules and procedures for Canonical extension types

Posted by Wes McKinney <we...@gmail.com>.
+1 to this proposal. It would be great to use the JSON type as a crash
dummy to work out the kinks in the process, but I think there are
meaningful benefits (Parquet round-tripping) to getting this work
under way.

On Wed, Aug 24, 2022 at 11:22 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 17/08/2022 à 18:45, Joris Van den Bossche a écrit :
> > +1 on the overall proposal, documenting those in a central place sounds
> > good to me.
> >
> > On Wed, 17 Aug 2022 at 18:10, Antoine Pitrou <an...@python.org> wrote:
> >
> >>
> >> ....
> >>
> >> * The specification text to be added *must* follow these requirements
> >>
> >> 1) It *must* have a well-defined name starting with "ARROW:"
> >>
> >
> > One remark on the specific naming convention: our documentation (
> > https://arrow.apache.org/docs/format/Columnar.html#extension-types)
> > currently recommends this kind of namespacing as well, but uses a
> > "myorg.name_of_type" pattern as example. For the extension types that I am
> > aware of (helped implementing), we followed that (for example, in pandas we
> > define "pandas.interval" and "pandas.period" extension types, and in
> > geoarrow
> > <https://github.com/geopandas/geo-arrow-spec/blob/main/extension-types.md>
> > we have "geoarrow.point", "geoarrow.polygon", etc).
> > I don't have a strong opinion here, but so we can also continue using that
> > pattern for the canonical types as well: "arrow.<type>" (or
> > "org.apache.arrow.<type>" as mentioned during the sync meeting).
>
> Point taken. I will adapt the proposal to the "org.apache.arrow."
> convention.
>
> Regards
>
> Antoine.

Re: DISCUSS: [Format] Rules and procedures for Canonical extension types

Posted by Antoine Pitrou <an...@python.org>.
Le 17/08/2022 à 18:45, Joris Van den Bossche a écrit :
> +1 on the overall proposal, documenting those in a central place sounds
> good to me.
> 
> On Wed, 17 Aug 2022 at 18:10, Antoine Pitrou <an...@python.org> wrote:
> 
>>
>> ....
>>
>> * The specification text to be added *must* follow these requirements
>>
>> 1) It *must* have a well-defined name starting with "ARROW:"
>>
> 
> One remark on the specific naming convention: our documentation (
> https://arrow.apache.org/docs/format/Columnar.html#extension-types)
> currently recommends this kind of namespacing as well, but uses a
> "myorg.name_of_type" pattern as example. For the extension types that I am
> aware of (helped implementing), we followed that (for example, in pandas we
> define "pandas.interval" and "pandas.period" extension types, and in
> geoarrow
> <https://github.com/geopandas/geo-arrow-spec/blob/main/extension-types.md>
> we have "geoarrow.point", "geoarrow.polygon", etc).
> I don't have a strong opinion here, but so we can also continue using that
> pattern for the canonical types as well: "arrow.<type>" (or
> "org.apache.arrow.<type>" as mentioned during the sync meeting).

Point taken. I will adapt the proposal to the "org.apache.arrow." 
convention.

Regards

Antoine.

Re: DISCUSS: [Format] Rules and procedures for Canonical extension types

Posted by Joris Van den Bossche <jo...@gmail.com>.
+1 on the overall proposal, documenting those in a central place sounds
good to me.

On Wed, 17 Aug 2022 at 18:10, Antoine Pitrou <an...@python.org> wrote:

>
> ....
>
> * The specification text to be added *must* follow these requirements
>
> 1) It *must* have a well-defined name starting with "ARROW:"
>

One remark on the specific naming convention: our documentation (
https://arrow.apache.org/docs/format/Columnar.html#extension-types)
currently recommends this kind of namespacing as well, but uses a
"myorg.name_of_type" pattern as example. For the extension types that I am
aware of (helped implementing), we followed that (for example, in pandas we
define "pandas.interval" and "pandas.period" extension types, and in
geoarrow
<https://github.com/geopandas/geo-arrow-spec/blob/main/extension-types.md>
we have "geoarrow.point", "geoarrow.polygon", etc).
I don't have a strong opinion here, but so we can also continue using that
pattern for the canonical types as well: "arrow.<type>" (or
"org.apache.arrow.<type>" as mentioned during the sync meeting).

Joris