You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/25 03:09:09 UTC

[GitHub] [arrow-datafusion] tustvold opened a new pull request, #3950: Vendor Generated Protobuf Code (#3947)

tustvold opened a new pull request, #3950:
URL: https://github.com/apache/arrow-datafusion/pull/3950

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #3538 
   Closes #3947
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   We don't want to require downstreams to install PROTOC, but we also don't want to allow the generated code to get out of sync. This is the approach we use with arrow-rs, and it appears to work fairly well. Specifically we don't bundle the .proto files in the released crate, and only generate the code if the .proto files exist.
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   
   datafusion-proto now always has a build dependency on pbjson-build. This is a fairly lightweight dependency and makes the logic significantly easier to follow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #3950: Vendor Generated Protobuf Code (#3947)

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #3950:
URL: https://github.com/apache/arrow-datafusion/pull/3950#issuecomment-1290784242

   > What do you think about this?
   
   My main concern is about making forward progress for the project, and I think not having our dependencies install `protoc` is essential for that. Both our PRs (see [3948](https://github.com/apache/arrow-datafusion/pull/3948) accomplish that goal.
   
   My secondary concern is for general project maintainability, and for that I think there's issues either way:
   
   checked in pros:
   1. it's less complex from a technical standpoint
   2. it's easier to ensure docs.rs works
   
   checked in cons:
   1. different versions of protoc will generate different output
   3. we'll need to document how to run protoc and keep it up to date
   4. formatters will create git conflicts
   5. the proto files and the checked in code can get out of sync
   6. users might edit the generated code accidentally
   
   generated pros:
   1. the "download the binary and unzip" should be transparent to most users
   2. anyone is free to use their own compiler
   3. redundant derived data is not stored in source control
   4. it is simpler from an operational perspective
   
   generated cons:
   1. it is more complex technically
   2. it requires maintenance (as does checking in)
   
   I must highlight though that I am concerned this does not resolve #3538 as that appears to be working as per my comment there.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #3950: Vendor Generated Protobuf Code (#3947)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #3950:
URL: https://github.com/apache/arrow-datafusion/pull/3950#discussion_r1004866574


##########
.github/workflows/rust.yml:
##########
@@ -81,16 +81,6 @@ jobs:
       - uses: actions/checkout@v3
         with:
           submodules: true
-      - name: Install protobuf compiler

Review Comment:
   This doesn't appear to be necessary anymore, the version in the system package manager is recent enough (only 2 years out of date) :laughing: 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #3950: Vendor Generated Protobuf Code (#3947)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #3950:
URL: https://github.com/apache/arrow-datafusion/pull/3950#discussion_r1003958558


##########
dev/release/rat_exclude_files.txt:
##########
@@ -129,3 +129,5 @@ Cargo.lock
 .history
 parquet-testing/*
 *rat.txt
+datafusion/proto/src/generated/pbjson.rs

Review Comment:
   There is precedent in arrow for excluding generated files - https://github.com/apache/arrow-rs/blob/master/dev/release/rat_exclude_files.txt#L20



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on pull request #3950: Vendor Generated Protobuf Code (#3947)

Posted by GitBox <gi...@apache.org>.
andygrove commented on PR #3950:
URL: https://github.com/apache/arrow-datafusion/pull/3950#issuecomment-1290826101

   I am ok with either approach, but I think I would prefer to check in the generated sources rather than have build.rs download protoc. It seems like this could be a security risk and also adds complexity. I don't want to try and support a user who hits some issue with protoc not working on their system, I am possibly just trying to avoid extra work here. :sweat_smile: 
   
   > formatters will create git conflicts
   
   We can configure rustfmt to ignore the generated files (https://rust-lang.github.io/rustfmt/?version=v1.5.1&search=#ignore). @tustvold would you be ok with adding that to this PR?
   
   > the proto files and the checked-in code can get out of sync
   
   We should have CI prevent that from happening. Would be good to include that in this PR if possible.
   
   > users might edit the generated code accidentally
   
   This is bound to happen, but the files are in a `generated` directory, so I think people will learn pretty quickly not to do this.
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] isidentical commented on pull request #3950: Vendor Generated Protobuf Code (#3947)

Posted by GitBox <gi...@apache.org>.
isidentical commented on PR #3950:
URL: https://github.com/apache/arrow-datafusion/pull/3950#issuecomment-1290746425

   @tustvold does it make sense to also include a GH action to ensure that PRs that actually touch the source also generate the latest version of the declarations? Something like the following: 
   https://github.com/apache/arrow-datafusion/blob/7559c4425e6f32655c6d09e8ed17c9c51896472b/.github/workflows/rust.yml#L65-L67


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on pull request #3950: Vendor Generated Protobuf Code (#3947)

Posted by GitBox <gi...@apache.org>.
avantgardnerio commented on PR #3950:
URL: https://github.com/apache/arrow-datafusion/pull/3950#issuecomment-1290837112

   I'd approve this PR if:
   
   1. generated files have a formatter exception
   2. there's clear doc on how to run protoc in a readme somewhere


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on pull request #3950: Vendor Generated Protobuf Code (#3947)

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #3950:
URL: https://github.com/apache/arrow-datafusion/pull/3950#issuecomment-1289916875

   @avantgardnerio What do you think about this?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #3950: Vendor Generated Protobuf Code (#3947)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #3950:
URL: https://github.com/apache/arrow-datafusion/pull/3950#discussion_r1004873121


##########
docs/source/contributor-guide/index.md:
##########
@@ -44,6 +44,28 @@ git-bash.exe
 cargo build
 ```
 
+## Protoc Installation
+
+Compiling DataFusion from sources requires an installed version of the protobuf compiler, `protoc`.
+
+On most platforms this can be installed from your system's package manager
+
+```
+$ apt install -y protobuf-compiler
+$ dnf install -y protobuf-compiler
+$ pacman -S protobuf-compiler
+$ brew install protobuf
+```
+
+You will want to verify the version installed is `3.12` or greater, which introduced support for explicit [field presence](https://github.com/protocolbuffers/protobuf/blob/v3.12.0/docs/field_presence.md). Older versions may fail to compile.

Review Comment:
   DataFusion doesn't actually use this feature, for good reason imo, but this is the only major functionality change to the core protoc functionality that may cause compatibility issues for us. More importantly 3.12 is the version currently used in CI



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on pull request #3950: Vendor Generated Protobuf Code (#3947)

Posted by GitBox <gi...@apache.org>.
andygrove commented on PR #3950:
URL: https://github.com/apache/arrow-datafusion/pull/3950#issuecomment-1290848178

   Can we provide a Dockerfile with protoc for generating the code?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on a diff in pull request #3950: Vendor Generated Protobuf Code (#3947)

Posted by GitBox <gi...@apache.org>.
tustvold commented on code in PR #3950:
URL: https://github.com/apache/arrow-datafusion/pull/3950#discussion_r1004873121


##########
docs/source/contributor-guide/index.md:
##########
@@ -44,6 +44,28 @@ git-bash.exe
 cargo build
 ```
 
+## Protoc Installation
+
+Compiling DataFusion from sources requires an installed version of the protobuf compiler, `protoc`.
+
+On most platforms this can be installed from your system's package manager
+
+```
+$ apt install -y protobuf-compiler
+$ dnf install -y protobuf-compiler
+$ pacman -S protobuf-compiler
+$ brew install protobuf
+```
+
+You will want to verify the version installed is `3.12` or greater, which introduced support for explicit [field presence](https://github.com/protocolbuffers/protobuf/blob/v3.12.0/docs/field_presence.md). Older versions may fail to compile.

Review Comment:
   DataFusion doesn't actually use this feature, for good reason imo, but this is the only major functionality change to the core protoc functionality that may cause compatibility issues for us.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] ursabot commented on pull request #3950: Vendor Generated Protobuf Code (#3947)

Posted by GitBox <gi...@apache.org>.
ursabot commented on PR #3950:
URL: https://github.com/apache/arrow-datafusion/pull/3950#issuecomment-1293090709

   Benchmark runs are scheduled for baseline = fa5cd7f30d785c5b3355e425e082a9d5a91bf567 and contender = 11c5255ac01e12e6b4607fc51bc0f358fb1c245a. 11c5255ac01e12e6b4607fc51bc0f358fb1c245a is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
   Conbench compare runs links:
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ec2-t3-xlarge-us-east-2] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/d64deda306254476907cfe28a73569e4...1318a1b03ab2407b8a9ae9b51ee60080/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on test-mac-arm] [test-mac-arm](https://conbench.ursa.dev/compare/runs/0a74e25e7c0744c4a6c7fd59378ebee6...9c1bd01712ce4481a2d950bd5dbad978/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-i9-9960x] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/ff952705cb464ecb80de16d2520a80e2...a2fbf57e7da541b294ea0c721a0d21f2/)
   [Skipped :warning: Benchmarking of arrow-datafusion-commits is not supported on ursa-thinkcentre-m75q] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/60210da92eaa4b62af37086b4ae76a75...45c7f5a860094264ac4b7df3301e2aac/)
   Buildkite builds:
   Supported benchmarks:
   ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
   test-mac-arm: Supported benchmark langs: C++, Python, R
   ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
   ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold merged pull request #3950: Vendor Generated Protobuf Code (#3947)

Posted by GitBox <gi...@apache.org>.
tustvold merged PR #3950:
URL: https://github.com/apache/arrow-datafusion/pull/3950


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] tustvold commented on pull request #3950: Vendor Generated Protobuf Code (#3947)

Posted by GitBox <gi...@apache.org>.
tustvold commented on PR #3950:
URL: https://github.com/apache/arrow-datafusion/pull/3950#issuecomment-1290965838

   > generated files have a formatter exception
   
   Already in place
   
   > there's clear doc on how to run protoc in a readme somewhere
   
   This is still automatically run as part of build.rs, I will add some docs though
   
   > different versions of protoc will generate different output
   
   Protoc is only used to generate a descriptor set, not the generated Rust code which is versioned by prost-build. The output should therefore be relatively stable.
   
   > dependents don't have to install protoc
   
   I suspect dependencies by git sha may still need protoc installed - as this is effectively a source dependency I think this is fine
   
   > We should have CI prevent that from happening
   
   I will add a CI check to here and arrow-rs
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org