You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/02/25 13:43:02 UTC

[GitHub] [arrow-datafusion] carols10cents opened a new pull request #1887: Create a new datafusion-serialization crate for datafusion protobuf serialization

carols10cents opened a new pull request #1887:
URL: https://github.com/apache/arrow-datafusion/pull/1887


   # Which issue does this PR close?
   
   Closes #1832.
   
   # Rationale for this change
   
   It would be nice for other projects (such as influxdata/influxdb_iox) to be able to serialize DataFusion types as protobuf without needing to depend on Ballista.
   
   # What changes are included in this PR?
   
   This PR extracts a new crate, datafusion-serialization, that can serialize and deserialize DataFusion types as protocol buffers.
   
   Ballista now depends on datafusion-serialization. However, [prost isn't able to provide a way to share/import .proto files between crates](), so `ballista/rust/core/proto/ballista.proto` doesn't have a line that says `import "datafusion.proto"`. This is a problem for other crates that would want to depend on datafusion-serialization too. Some solutions I considered and ruled out:
   
   - Have the build.rs file of `ballista` (and any other crate that wants to depend on datafusion-serialization) download `datafusion.proto` from GitHub. This seems fragile and makes building `ballista` dependent on network access. This would also likely introduce mismatched version problems.
   - Symlink `datafusion-serialization/proto/datafusion.proto` into `ballista/rust/core/proto`, which would work for `ballista` but not for any crate outside of this repo.
   - Manually copy `datafusion-serialization/proto/datafusion.proto` into `ballista/rust/core/proto`, which is a chore no one wants to do and would probably also cause mismatched version problems.
   
   Not liking any of these solutions, I decided the best way is that fields in `ballista` or other crate protos that want to contain `datafusion` types should serialize them as `google::protobuf::Any`, then depend on the `datafusion-serialization` crate to handle the actual interpretation of the bytes as the Rust types.
   
   Unfortunately, [prost doesn't have built in support for this](https://github.com/tokio-rs/prost/issues/299), and [the workaround crates I've looked at](https://github.com/tokio-rs/prost/pull/336) seem to provide more functionality than is strictly needed here. So I implemented this using a `TypeUrl` trait and some functions. It's a little messy because I can't use TryFrom/TryInto because of the orphan rule, so a bunch of the ballista from/to proto code needed to be updated.
   
   If there are other solutions I haven't thought of, I'd love to hear them!
   
   There's also plenty of further refactoring that could be done, but this PR is going to be a big review in any case.
   
   # Are there any user-facing changes?
   
   I'm not sure how strictly backwards compatibility is considered for ballista.proto-- I changed the types of a bunch of the fields, which isn't backwards compatible. If you'd rather I reserve the existing field names and pick a new name for the fields with the new type, let me know and I'm happy to do so. But as-is, it would be a user-facing change for anyone using the Ballista protobuf definitions.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] carols10cents commented on pull request #1887: Create a new datafusion-serialization crate for datafusion protobuf serialization

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #1887:
URL: https://github.com/apache/arrow-datafusion/pull/1887#issuecomment-1051306880


   > Would it be worth shortening the crate name to `datafusion-serde`? Another option would be to name it `datafusion-proto`.
   
   I would be worried that `datafusion-serde` would cause people to assume the `serde` crate is involved. `datafusion-proto` seems ok!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1887: Create a `datafusion-proto` crate for datafusion protobuf serialization

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1887:
URL: https://github.com/apache/arrow-datafusion/pull/1887#issuecomment-1063398244


   We'll just keep iterating I think


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] carols10cents commented on pull request #1887: Create a new datafusion-serialization crate for datafusion protobuf serialization

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #1887:
URL: https://github.com/apache/arrow-datafusion/pull/1887#issuecomment-1057276570


   That all sounds great! For this repo, would a symlink suffice? I think `cargo package` will still put the file in the ballista crate even if it's a symlink in the repo; I will verify.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1887: Create a new datafusion-serialization crate for datafusion protobuf serialization

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1887:
URL: https://github.com/apache/arrow-datafusion/pull/1887#issuecomment-1060972830


   FYI the https://github.com/datafusion-contrib/datafusion-substrait repo from @andygrove  may be related to this (as in maybe it eventually removes protobuf serialization). 
   
   Perhaps to plan for that eventually we could keep the serialization API operating on an opaque format (like `fn serialize(expr: Expr) -> Vec<u8>`) 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1887: Create a new datafusion-serialization crate for datafusion protobuf serialization

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1887:
URL: https://github.com/apache/arrow-datafusion/pull/1887#issuecomment-1061153712


   > That should be pretty easy to add - would you like me to do that in this PR or in a future PR?
   
   A future PR is good in my opinion


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] tustvold edited a comment on pull request #1887: Create a new datafusion-serialization crate for datafusion protobuf serialization

Posted by GitBox <gi...@apache.org>.
tustvold edited a comment on pull request #1887:
URL: https://github.com/apache/arrow-datafusion/pull/1887#issuecomment-1050884805


   > However, prost isn't able to provide a way to share/import .proto files between crates
   
   This link appears to just link back to this issue? FWIW it is fairly common for API specifications be they protobuf, OpenAPI, etc... to simply be vended into the client repositories. It's kind of gross, but it works. I guess it also gives you some notion of the version of the API that client is using... These proto specs shouldn't change in backwards incompatible ways, and so if your client is a bit out of date, it shouldn't matter by design.
   
   > types should serialize them as google::protobuf::Any
   
   My experience with `protobuf::Any` even in languages that purport to support it, e.g. Golang and python, is not great. The fact protobuf isn't self-describing makes interacting with these opaque blobs extremely frustrating, and it is extremely common for things to simply not work.
   
   To give an example of this, IOx's storage gRPC API has a `protobuf::Any` bundled in it, which prevents using tools like grpcui or grpcurl with it. In the end we had to add a custom CLI command to call these APIs in no small part because the ecosystem's support for `protobuf::Any` is so inconsistent.
   
   I dunno, perhaps there is no other option, but using `protobuf::Any` for a field that has a defined concrete type feels like a very large hammer...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] andygrove commented on pull request #1887: Create a new datafusion-serialization crate for datafusion protobuf serialization

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #1887:
URL: https://github.com/apache/arrow-datafusion/pull/1887#issuecomment-1050976389


   I like the idea of moving this to a separate crate. Would it be worth shortening the crate name to `datafusion-serde`? Another option would be to name it `datafusion-proto`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] tustvold commented on pull request #1887: Create a new datafusion-serialization crate for datafusion protobuf serialization

Posted by GitBox <gi...@apache.org>.
tustvold commented on pull request #1887:
URL: https://github.com/apache/arrow-datafusion/pull/1887#issuecomment-1050884805


   > However, prost isn't able to provide a way to share/import .proto files between crates
   
   This link appears to just link back to this issue? FWIW it is fairly common for API specifications be they protobuf, OpenAPI, etc... to simply be vended into the client repositories. It's kind of gross, but it works. I guess it also gives you some notion of the version of the API that client is using... These proto specs shouldn't change in backwards incompatible ways, and so if your client is a bit out of date, it shouldn't matter by design.
   
   > types should serialize them as google::protobuf::Any
   
   My experience with `protobuf::Any` even in languages that purport to support it, e.g. Golang and python, is not great. The fact protobuf isn't self-describing makes interacting with these opaque blobs extremely frustrating, and it is extremely common for things to simply not work.
   
   To give an example of this, IOx's read filter gRPC API has a `protobuf::Any` bundled in it, which prevents using tools like grpcui or grpcurl with it. In the end we had to add a custom CLI command to call these APIs in no small part because the ecosystem's support for `protobuf::Any` is so inconsistent.
   
   I dunno, perhaps there is no other option, but using `protobuf::Any` for a field that has a defined concrete type feels like a very large hammer...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1887: Create a new datafusion-serialization crate for datafusion protobuf serialization

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1887:
URL: https://github.com/apache/arrow-datafusion/pull/1887#issuecomment-1057289057


   I think a symlink for this repo would be great


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] carols10cents commented on pull request #1887: Create a `datafusion-proto` crate for datafusion protobuf serialization

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #1887:
URL: https://github.com/apache/arrow-datafusion/pull/1887#issuecomment-1063332918


   > I wonder if it would make sense to create a PR in the [influxdata/influxdb_iox](https://github.com/influxdata/influxdb_iox) repository that uses this PR / new crate as a proof of concept to how it would be used?
   
   It does make sense, [done](https://github.com/influxdata/influxdb_iox/pull/3997)! It seems to work pretty well except for with pbjson....


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] carols10cents commented on pull request #1887: Create a new datafusion-serialization crate for datafusion protobuf serialization

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #1887:
URL: https://github.com/apache/arrow-datafusion/pull/1887#issuecomment-1050907783


   > This link appears to just link back to this issue?
   
   Oops, sorry, forgot to fill in that link. I've fixed it now-- meant to link to https://github.com/tokio-rs/prost/issues/422.
   
   > FWIW it is fairly common for API specifications be they protobuf, OpenAPI, etc... to simply be manually vended into the client repositories. It's kind of gross, but it works.
   
   Yeeeeep, this also feels gross to me, but I'm happy to change to that if maintainers would like.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] tustvold edited a comment on pull request #1887: Create a new datafusion-serialization crate for datafusion protobuf serialization

Posted by GitBox <gi...@apache.org>.
tustvold edited a comment on pull request #1887:
URL: https://github.com/apache/arrow-datafusion/pull/1887#issuecomment-1050884805


   > However, prost isn't able to provide a way to share/import .proto files between crates
   
   This link appears to just link back to this issue? FWIW it is fairly common for API specifications be they protobuf, OpenAPI, etc... to simply be manually vended into the client repositories. It's kind of gross, but it works. I guess it also gives you some notion of the version of the API that client is using... These proto specs shouldn't change in backwards incompatible ways, and so if your client is a bit out of date, it shouldn't matter by design.
   
   > types should serialize them as google::protobuf::Any
   
   My experience with `protobuf::Any` even in languages that purport to support it, e.g. Golang and python, is not great. The fact protobuf isn't self-describing makes interacting with these opaque blobs extremely frustrating, and it is extremely common for things to simply not work.
   
   To give an example of this, IOx's storage gRPC API has a `protobuf::Any` bundled in it, which prevents using tools like grpcui or grpcurl with it. In the end we had to add a custom CLI command to call these APIs in no small part because the ecosystem's support for `protobuf::Any` is so inconsistent.
   
   I dunno, perhaps there is no other option, but using `protobuf::Any` for a field that has a defined concrete type feels like a very large hammer...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] carols10cents commented on pull request #1887: Create a new datafusion-serialization crate for datafusion protobuf serialization

Posted by GitBox <gi...@apache.org>.
carols10cents commented on pull request #1887:
URL: https://github.com/apache/arrow-datafusion/pull/1887#issuecomment-1061144961


   @alamb @tustvold Ok, I think this is ready for re-review, all the `Any` stuff is gone :) 
   
   
   > FYI the [datafusion-contrib/datafusion-substrait](https://github.com/datafusion-contrib/datafusion-substrait) repo from @andygrove may be related to this (as in maybe it eventually removes protobuf serialization).
   > 
   > Perhaps to plan for that eventually we could keep the serialization API operating on an opaque format (like `fn serialize(expr: Expr) -> Vec<u8>`) 🤔
   
   That should be pretty easy to add - would you like me to do that in this PR or in a future PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] Jimexist merged pull request #1887: Create a `datafusion-proto` crate for datafusion protobuf serialization

Posted by GitBox <gi...@apache.org>.
Jimexist merged pull request #1887:
URL: https://github.com/apache/arrow-datafusion/pull/1887


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org