You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Antoine Pitrou <an...@python.org> on 2022/08/17 16:09:18 UTC
DISCUSS: [Format] Rules and procedures for Canonical extension types
Hello all,
The Arrow format has support for extension types, but there's no
official way to agree accross implementations on well-known extension types.
This issue has come up a couple times with people wanting to implement
support for types such as JSON or UUID in order to enable better
interoperability with third-party systems such as Parquet or databases.
I think it's time to discuss and decide how we should progressively
standardize some well-known, "canonical", extension types.
I would temptatively propose the following rules:
* Canonical extension types are described in a separate document under
the format specifications directory:
https://github.com/apache/arrow/tree/master/docs/source/format (note
this gets turned into HTML docs by Sphinx =>
https://arrow.apache.org/docs/index.html)
* Each canonical extension type requires a separate discussion and vote
on the mailing-list
* The specification text to be added *must* follow these requirements
1) It *must* have a well-defined name starting with "ARROW:"
2) Its parameters, if any, *must* be described in the proposal
3) Its serialization *must* be described in the proposal and should not
require unduly work or unusual software dependencies (for example, a
trivial custom text format or JSON would be acceptable)
4) Its expected semantics *should* be described as well and any
potential ambiguities or pain points addressed or at least mentioned
* The extension type *should* have one implementation submitted;
preferably two if non-trivial (for example if parameterized)
Feel free to comment.
Regards
Antoine.
Re: DISCUSS: [Format] Rules and procedures for Canonical extension types
Posted by Pradeep Gollakota <pg...@google.com.INVALID>.
+1. The proposal looks good.
I'm happy to provide the first such document for JSON type to use as a test
vehicle.
On Wed, Aug 17, 2022 at 12:46 PM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:
> +1 on the overall proposal, documenting those in a central place sounds
> good to me.
>
> On Wed, 17 Aug 2022 at 18:10, Antoine Pitrou <an...@python.org> wrote:
>
> >
> > ....
> >
> > * The specification text to be added *must* follow these requirements
> >
> > 1) It *must* have a well-defined name starting with "ARROW:"
> >
>
> One remark on the specific naming convention: our documentation (
> https://arrow.apache.org/docs/format/Columnar.html#extension-types)
> currently recommends this kind of namespacing as well, but uses a
> "myorg.name_of_type" pattern as example. For the extension types that I am
> aware of (helped implementing), we followed that (for example, in pandas we
> define "pandas.interval" and "pandas.period" extension types, and in
> geoarrow
> <https://github.com/geopandas/geo-arrow-spec/blob/main/extension-types.md>
> we have "geoarrow.point", "geoarrow.polygon", etc).
> I don't have a strong opinion here, but so we can also continue using that
> pattern for the canonical types as well: "arrow.<type>" (or
> "org.apache.arrow.<type>" as mentioned during the sync meeting).
>
> Joris
>
--
Pradeep
Re: DISCUSS: [Format] Rules and procedures for Canonical extension types
Posted by Wes McKinney <we...@gmail.com>.
+1 to this proposal. It would be great to use the JSON type as a crash
dummy to work out the kinks in the process, but I think there are
meaningful benefits (Parquet round-tripping) to getting this work
under way.
On Wed, Aug 24, 2022 at 11:22 AM Antoine Pitrou <an...@python.org> wrote:
>
>
> Le 17/08/2022 à 18:45, Joris Van den Bossche a écrit :
> > +1 on the overall proposal, documenting those in a central place sounds
> > good to me.
> >
> > On Wed, 17 Aug 2022 at 18:10, Antoine Pitrou <an...@python.org> wrote:
> >
> >>
> >> ....
> >>
> >> * The specification text to be added *must* follow these requirements
> >>
> >> 1) It *must* have a well-defined name starting with "ARROW:"
> >>
> >
> > One remark on the specific naming convention: our documentation (
> > https://arrow.apache.org/docs/format/Columnar.html#extension-types)
> > currently recommends this kind of namespacing as well, but uses a
> > "myorg.name_of_type" pattern as example. For the extension types that I am
> > aware of (helped implementing), we followed that (for example, in pandas we
> > define "pandas.interval" and "pandas.period" extension types, and in
> > geoarrow
> > <https://github.com/geopandas/geo-arrow-spec/blob/main/extension-types.md>
> > we have "geoarrow.point", "geoarrow.polygon", etc).
> > I don't have a strong opinion here, but so we can also continue using that
> > pattern for the canonical types as well: "arrow.<type>" (or
> > "org.apache.arrow.<type>" as mentioned during the sync meeting).
>
> Point taken. I will adapt the proposal to the "org.apache.arrow."
> convention.
>
> Regards
>
> Antoine.
Re: DISCUSS: [Format] Rules and procedures for Canonical extension types
Posted by Antoine Pitrou <an...@python.org>.
Le 17/08/2022 à 18:45, Joris Van den Bossche a écrit :
> +1 on the overall proposal, documenting those in a central place sounds
> good to me.
>
> On Wed, 17 Aug 2022 at 18:10, Antoine Pitrou <an...@python.org> wrote:
>
>>
>> ....
>>
>> * The specification text to be added *must* follow these requirements
>>
>> 1) It *must* have a well-defined name starting with "ARROW:"
>>
>
> One remark on the specific naming convention: our documentation (
> https://arrow.apache.org/docs/format/Columnar.html#extension-types)
> currently recommends this kind of namespacing as well, but uses a
> "myorg.name_of_type" pattern as example. For the extension types that I am
> aware of (helped implementing), we followed that (for example, in pandas we
> define "pandas.interval" and "pandas.period" extension types, and in
> geoarrow
> <https://github.com/geopandas/geo-arrow-spec/blob/main/extension-types.md>
> we have "geoarrow.point", "geoarrow.polygon", etc).
> I don't have a strong opinion here, but so we can also continue using that
> pattern for the canonical types as well: "arrow.<type>" (or
> "org.apache.arrow.<type>" as mentioned during the sync meeting).
Point taken. I will adapt the proposal to the "org.apache.arrow."
convention.
Regards
Antoine.
Re: DISCUSS: [Format] Rules and procedures for Canonical extension types
Posted by Joris Van den Bossche <jo...@gmail.com>.
+1 on the overall proposal, documenting those in a central place sounds
good to me.
On Wed, 17 Aug 2022 at 18:10, Antoine Pitrou <an...@python.org> wrote:
>
> ....
>
> * The specification text to be added *must* follow these requirements
>
> 1) It *must* have a well-defined name starting with "ARROW:"
>
One remark on the specific naming convention: our documentation (
https://arrow.apache.org/docs/format/Columnar.html#extension-types)
currently recommends this kind of namespacing as well, but uses a
"myorg.name_of_type" pattern as example. For the extension types that I am
aware of (helped implementing), we followed that (for example, in pandas we
define "pandas.interval" and "pandas.period" extension types, and in
geoarrow
<https://github.com/geopandas/geo-arrow-spec/blob/main/extension-types.md>
we have "geoarrow.point", "geoarrow.polygon", etc).
I don't have a strong opinion here, but so we can also continue using that
pattern for the canonical types as well: "arrow.<type>" (or
"org.apache.arrow.<type>" as mentioned during the sync meeting).
Joris