You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Weston Pace <we...@gmail.com> on 2022/04/18 23:52:20 UTC

[DISCUSS] Policies for Substrait extensions

As we are starting to add more capability to the C++ Substrait
consumer we are starting to look at spots where extensions are needed
for the Substrait specification.  I'm wondering to what degree these
extensions are a part of the Arrow project and to what degree these
are part of a specific implementation.  I would appreciate any
guidance or opinions.

I'd like to give a few specific examples of these extensions as I
think the way we handle it may depend on the nature of the specific
extension.

1. Arrow-specific and applicable to all implementations

The Substrait spec has its own type system[1] which does not include
some Arrow types (e.g. unsigned integers).  An "extension" in this
case is mostly a (URI namespaced) name that producers and consumers
can agree on (e.g.
https://arrow.apache.org/substrait/v1/types.yaml#uint8).  In the
future there is the potential for some additional metadata to
accompany each type (e.g. a way to express types are variations of
existing types) but this hasn't yet been well defined.

I think this extension, though rather simple, will be of interest to
all users of Arrow, as well as developers of Arrow implementations
(e.g. consumers), and so the impact is pretty far-reaching.  However,
given the relative simplicity, I don't know that we need to do much
beyond Github PRs (e.g. we don't need two implementations to adopt
this, etc.)

At the moment there is a version at [2] which I will propose be the
official implementation for the Apache Arrow project (although it
needs a tiny bit of cleanup to remove a comment reference to C++).
Assuming the discussion doesn't raise any significant concerns in the
next week or so I'll propose a vote to adopt this.

Other things could fall into this category.  For example, we may need
a file format extension for Arrow IPC files (even if [3] merges we
still would want to extend that once Substrait supports writes).  We
may also want to define sink and source relations for the Arrow C
stream interface.  For anything in this category I think we should
have a single Arrow supported extension and vote on acceptance of the
initial implementation (as well as a criteria for making updates).

2. Non-Arrow specific features with wide support across implementations

An example here is a CSV file format extension.  CSV is an interesting
format as it is not very self-describing and will need a rather
extensive proto message (or messages) to describe how to read and
write files.  Several implementations support reading and writing
CSVs[4] and it would seem prudent that we agree on a common
definition.  However, CSV is not something Arrow has any ownership
over.  This raises a few questions:

 * Would we use "arrow" in the extension name (protobuf extensions, as
opposed to YAML extensions, don't really have a URI but they do have a
"package name")?
 * Should we vote on an "official" standard to use across
implementations or let each implementation choose their own?
 * Could it live within an Arrow repository or would it always live
outside the Arrow repos?
 * If it lived outside the Arrow repos would we include a pointer
within the Arrow repository to the voted on standard (assuming we vote
on a standard)?

3. Implementation-specific features

A major extension category in Substrait is extension functions.
However, these are likely to vary between implementations.  It is
possible some implementations may agree on descriptions for a common
collection of functions (e.g. geoJSON) and then these could follow the
procedures in 2.

In general, I think extension functions are likely to be specific to
individual implementations.  There wouldn't need to be any vote on
these and, in some cases, the YAML may be automatically generated
(e.g. in the C++ implementation we would probably like to
automatically generate the YAML from our function registry).

In addition to extension functions I think it likely that there will
probably also be some examples of relation extensions that are
specific to a given implementation.  The YAML and proto files for
these extensions could live in the implementation's code base.

 * Should we support "arrow hosted" names for these extensions (e.g.
https://arrow.apache.org/substrait/cpp/v1/function_types.yaml)?

[1] https://substrait.io/types/simple_logical_types/
[2] https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml
[3] https://github.com/substrait-io/substrait/pull/169
[4] https://arrow.apache.org/docs/status.html#third-party-data-formats

Re: [DISCUSS] Policies for Substrait extensions

Posted by Jeroen van Straten <je...@gmail.com>.
> At the moment there is a version at [2] which I will propose be the
> official implementation for the Apache Arrow project (although it
> needs a tiny bit of cleanup to remove a comment reference to C++).
> Assuming the discussion doesn't raise any significant concerns in the
> next week or so I'll propose a vote to adopt this.

AFAICT, Substrait currently lacks a specified method for referring from
one YAML extension to another. That is, types defined in one YAML file
can currently not be used by functions defined in another. This is a
rather obvious deficiency that I believe will be solved at one point or
another, but until that time, I would propose to rename the file to
something like "extensions.yaml" rather than "types.yaml". That way, if
we do end up needing to define functions in the same file, they won't
stand out as much.

> 2. Non-Arrow specific features with wide support across
> implementations

I would argue that anything that needs at least a de facto standard due
to widespread adoption should be added to the actual standard, i.e.
added to Substrait itself. CSV especially seems like a no-brainer for
this due to its ubiquity. Until something is added to the Substrait
specification, I would argue that the extensions should be
project-specific (or specific to a small subset of projects); going
through a voting process for extensions outside the scope of just
Arrow seems to me like just adding something directly to the Substrait
specification with extra steps.

Failing that, however, I would say that if Arrow handles the voting and
adoption for a particular extension, however generic, it should be
namespaced and hosted by Arrow. In the same way as the YAML file,
perhaps, so other projects using the extensions don't need to pull in
all of Arrow just to get to the proto file? For example,

https://arrow.apache.org/substrait/v1/extensions.yaml
https://arrow.apache.org/substrait/v1/extensions.proto

> Would we use "arrow" in the extension name (protobuf extensions, as
> opposed to YAML extensions, don't really have a URI but they do have
> a "package name")?

The type URLs are simply the fully-qualified protobuf message types,
and you can nest namespaces as deeply as you like. I don't have a
strong opinion as to what format should be used (anything from
"arrow.Something" to
"org.apache.arrow.substrait.v1.extensions.foo.bar.Something"
would do), but it should be sufficiently namespaced so it won't
conflict with anything we or anyone else does now or in the foreseeable
future. "arrow" as the top is probably unique enough as a toplevel
namespace (we use the same in C++, after all), but adding something
like "substrait.v1.extensions" seems like a good idea to me.
"arrow.CSVFile" could mean a lot more things than a Substrait extension
for describing the format of a CSV file, after all.

> It is possible some implementations may agree on descriptions for a
> common collection of functions (e.g. geoJSON) and then these could
> follow the procedures in 2.

This seems to be what [1] is for, though it's a bit of a mish-mash of
things right now.

P.S. This is my first post to the ML, so, hi all! :) I've been working
on a generic validator for Substrait plans for a while now [2], and
helped with the initial implementation of the Arrow Substrait consumer.

[1] https://github.com/substrait-io/substrait/tree/main/extensions
[2] https://github.com/substrait-io/substrait/pull/155


On Tue, 19 Apr 2022 at 01:52, Weston Pace <we...@gmail.com> wrote:

> As we are starting to add more capability to the C++ Substrait
> consumer we are starting to look at spots where extensions are needed
> for the Substrait specification.  I'm wondering to what degree these
> extensions are a part of the Arrow project and to what degree these
> are part of a specific implementation.  I would appreciate any
> guidance or opinions.
>
> I'd like to give a few specific examples of these extensions as I
> think the way we handle it may depend on the nature of the specific
> extension.
>
> 1. Arrow-specific and applicable to all implementations
>
> The Substrait spec has its own type system[1] which does not include
> some Arrow types (e.g. unsigned integers).  An "extension" in this
> case is mostly a (URI namespaced) name that producers and consumers
> can agree on (e.g.
> https://arrow.apache.org/substrait/v1/types.yaml#uint8).  In the
> future there is the potential for some additional metadata to
> accompany each type (e.g. a way to express types are variations of
> existing types) but this hasn't yet been well defined.
>
> I think this extension, though rather simple, will be of interest to
> all users of Arrow, as well as developers of Arrow implementations
> (e.g. consumers), and so the impact is pretty far-reaching.  However,
> given the relative simplicity, I don't know that we need to do much
> beyond Github PRs (e.g. we don't need two implementations to adopt
> this, etc.)
>
> At the moment there is a version at [2] which I will propose be the
> official implementation for the Apache Arrow project (although it
> needs a tiny bit of cleanup to remove a comment reference to C++).
> Assuming the discussion doesn't raise any significant concerns in the
> next week or so I'll propose a vote to adopt this.
>
> Other things could fall into this category.  For example, we may need
> a file format extension for Arrow IPC files (even if [3] merges we
> still would want to extend that once Substrait supports writes).  We
> may also want to define sink and source relations for the Arrow C
> stream interface.  For anything in this category I think we should
> have a single Arrow supported extension and vote on acceptance of the
> initial implementation (as well as a criteria for making updates).
>
> 2. Non-Arrow specific features with wide support across implementations
>
> An example here is a CSV file format extension.  CSV is an interesting
> format as it is not very self-describing and will need a rather
> extensive proto message (or messages) to describe how to read and
> write files.  Several implementations support reading and writing
> CSVs[4] and it would seem prudent that we agree on a common
> definition.  However, CSV is not something Arrow has any ownership
> over.  This raises a few questions:
>
>  * Would we use "arrow" in the extension name (protobuf extensions, as
> opposed to YAML extensions, don't really have a URI but they do have a
> "package name")?
>  * Should we vote on an "official" standard to use across
> implementations or let each implementation choose their own?
>  * Could it live within an Arrow repository or would it always live
> outside the Arrow repos?
>  * If it lived outside the Arrow repos would we include a pointer
> within the Arrow repository to the voted on standard (assuming we vote
> on a standard)?
>
> 3. Implementation-specific features
>
> A major extension category in Substrait is extension functions.
> However, these are likely to vary between implementations.  It is
> possible some implementations may agree on descriptions for a common
> collection of functions (e.g. geoJSON) and then these could follow the
> procedures in 2.
>
> In general, I think extension functions are likely to be specific to
> individual implementations.  There wouldn't need to be any vote on
> these and, in some cases, the YAML may be automatically generated
> (e.g. in the C++ implementation we would probably like to
> automatically generate the YAML from our function registry).
>
> In addition to extension functions I think it likely that there will
> probably also be some examples of relation extensions that are
> specific to a given implementation.  The YAML and proto files for
> these extensions could live in the implementation's code base.
>
>  * Should we support "arrow hosted" names for these extensions (e.g.
> https://arrow.apache.org/substrait/cpp/v1/function_types.yaml)?
>
> [1] https://substrait.io/types/simple_logical_types/
> [2]
> https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml
> [3] https://github.com/substrait-io/substrait/pull/169
> [4] https://arrow.apache.org/docs/status.html#third-party-data-formats
>