You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/04/19 00:59:58 UTC

[GitHub] [arrow] westonpace opened a new pull request, #12916: MINOR: [Format] Remove extraneous comment from extension_types.yaml

westonpace opened a new pull request, #12916:
URL: https://github.com/apache/arrow/pull/12916

   The point is described sufficiently in ARROW-15535 and I'd like to avoid mention of C++ in a implementation-agnostic file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jvanstraten commented on pull request #12916: MINOR: [Format] Remove extraneous comment from extension_types.yaml

Posted by GitBox <gi...@apache.org>.
jvanstraten commented on PR #12916:
URL: https://github.com/apache/arrow/pull/12916#issuecomment-1105027656

   I put it there (and it should also be in the C++ file) such that if someone modifies one, it's relatively obvious that they should also modify the other. I'd say that the mere existence of a JIRA somewhere is rather unlikely to accomplish that goal, but that's just my opinion; if it goes against Arrow project conventions, then remove it.
   
   That said... in my mind this isn't (going to be) an implementation-agnostic file at all, it describes exactly what the Arrow C++ consumer of Substrait understands. Right now it only specifies the types, but Substrait currently doesn't specify a way for YAML files to refer to each other, so unless support for that is added, all the functions would need to go in the same YAML file, or they wouldn't be able to use those types. And supporting that is hardly trivial -- how would you determine if two generalized URIs refer to the same file?
   
   I suppose you could also argue that the YAML file defines what a generalized Arrow Substrait consumer *should* support. In that case, though, ARROW-15535 should be closed and the comments at https://github.com/apache/arrow/blob/1dccb56a8328abde60238a890f720d7cc642cc94/cpp/src/arrow/engine/substrait/extension_set.cc#L223 and https://github.com/apache/arrow/blob/1dccb56a8328abde60238a890f720d7cc642cc94/cpp/src/arrow/engine/substrait/extension_set.cc#L248 should be updated accordingly, because it makes no sense to generate something that's implementation-agnostic from a particular implementation. It'd be a lot of manual, highly error-prone work to treat the file that way, though.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] kszucs merged pull request #12916: MINOR: [Format] Remove extraneous comment from extension_types.yaml

Posted by GitBox <gi...@apache.org>.
kszucs merged PR #12916:
URL: https://github.com/apache/arrow/pull/12916


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #12916: MINOR: [Format] Remove extraneous comment from extension_types.yaml

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #12916:
URL: https://github.com/apache/arrow/pull/12916#issuecomment-1104807358

   CC @jvanstraten mind taking a look?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] jvanstraten commented on pull request #12916: MINOR: [Format] Remove extraneous comment from extension_types.yaml

Posted by GitBox <gi...@apache.org>.
jvanstraten commented on PR #12916:
URL: https://github.com/apache/arrow/pull/12916#issuecomment-1105800137

   > A single file shouldn't have different URIs referring to it.
   
   I guess I was mainly thinking from the validator's perspective, where a URI is just something it resolves to a YAML file, in which case it's not unique at all. Then I started thinking about relative references, then about awkward versioning stuff (if we always have to use the exact same URI for Arrow's types, how can we ever change something without immediately wrecking everything previously out there? And if they're versioned, what's the granularity?), and then gave up thinking about it. I do agree that we'll need inter-YAML references at one point or another, though, and mandating that two URIs must be identical for them to be treated as the same extension is probably the right approach, unless the manner of referencing things is changed considerably.
   
   Anyway, it seems like you've thought about this in terms of compatibility across implementations a lot more than I have, which is ultimately more important, and what you're saying sounds reasonable to me.
   
   > I think this file should also contain kernels for all of the standard Substrait functions (e.g. `add_uint8_uint8`).
   
   Nitpicking, but I'm not sure what you mean by kernels here (did you mean prototypes?), and the function name would be just `add` in this case, with the overload referred to in the plan as `add:u!uint8_u!uint8` (I'm really only pointing it out because this part of the spec is not very obvious. The relevant bit is [here](https://substrait.io/extensions/#function-signature-compound-names); I missed it completely until Jacques pointed it out to me on slack.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow] westonpace commented on pull request #12916: MINOR: [Format] Remove extraneous comment from extension_types.yaml

Posted by GitBox <gi...@apache.org>.
westonpace commented on PR #12916:
URL: https://github.com/apache/arrow/pull/12916#issuecomment-1105722249

   > Right now it only specifies the types, but Substrait currently doesn't specify a way for YAML files to refer to each other, so unless support for that is added
   
   I'm assuming we will figure out how to support this.  We also need to be able to go the other way so we can add kernels to existing functions.
   
   > how would you determine if two generalized URIs refer to the same file?
   
   I'm not sure I understand this point.  A single file shouldn't have different URIs referring to it.  While that might be ok for URIs in general, we are using them as namespaces.  We already have this constraint.  If a producer asks for `file:///foo/bar/types.yaml#my_type` and the consumer only knows about `file:///foo/../foo/bar/types.yaml#my_type` then there will be a problem.
   
   > I suppose you could also argue that the YAML file defines what a generalized Arrow Substrait consumer should support.
   
   That is what I am arguing.  Arrow is a spec and it has a type system.  There should be a single URI (and ideally a single file) used to namespace all types that are in the Arrow spec but not in the Substrait spec.  I think this file should also contain kernels for all of the standard Substrait functions (e.g. `add_uint8_uint8`).
   
   For example, [DuckDb supports Arrow's type system wants a way to express unsigned integers](https://github.com/substrait-io/substrait/discussions/2#discussioncomment-2559648).  It would seem a tricky balance to ensure that DuckDb and Arrow are using the exact same URIs for uint8 if they each maintained their own definitions for this.
   
   I agree this means the file would not be automatically generated from any particular implementation.  I agree this leads to a bit of tedious work and we will need "spec verification tests" to make sure we support the spec names.  However, I don't know how you can have all Arrow consumers be compatible without doing this tedious work and that seems like a worthwhile goal.
   
   >  In that case, though, [ARROW-15535](https://issues.apache.org/jira/browse/ARROW-15535) should be closed and...
   
   I disagree.  I don't think every Arrow consumer will be identical.  I think we will eventually end up with three levels of "specification"
   
   * The Substrait spec: This is the most common and smallest subset of functionality.
   * The Substrait-Arrow spec: Extends the Substrait spec with all types in the Arrow type system.  Shared across consumers & producers that are willing to support Arrow.
   * The Substrait-Arrow-C++-Impl spec: This is an experimental proving ground for new functionality.  Should be automatically generated from the code.  It will be primarily used by functional tests of the Arrow consumer but a few functions might be used by producers that are tightly coupled to Arrow (e.g. pyarrow or arrow-r) to expose new functionality early before it has gone through the specification process.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org