You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2021/07/21 17:37:05 UTC

[GitHub] [arrow-rs] carols10cents opened a new issue #589: Parquet file content is different if `~/.cargo` is in a git checkout

carols10cents opened a new issue #589:
URL: https://github.com/apache/arrow-rs/issues/589


   **Describe the bug**
   
   I check my home directory into git. My home directory contains `.cargo`, my `CARGO_HOME` directory. When I write a Parquet file, its `FileMetaData` contains:
   
   ```
   created_by: Some(
       "parquet-rs version 5.0.0 (build 3ef76a677716df403a13964a58351abe37c1754d)",
   ),
   ```
   
   That SHA is of a commit in my home directory, not in Parquet, and not in the project using Parquet.
   
   I have a test in the project that verifies the size of the parquet file data, and the test was failing for me because the content was 49 bytes too much, the exact size of the extra content above. I verified that in CI, the test passes, and the `FileMetaData` under test contains:
   
   ```
   created_by: Some(
       "parquet-rs version 5.0.0",
   ),
   ```
   
   **To Reproduce**
   
   - Check your home directory into git, or alternately set `CARGO_HOME` to a directory in a git repository.
   - Generate a parquet file and check the metadata.
   - Observe the `created_by` contains a hash from the git directory `CARGO_HOME` is in.
   
   I'm not sure if it's going to be possible to create a failing test for this given the environmental aspect... [the current test](https://github.com/apache/arrow-rs/blob/30f1b1fe8681914d0bd8fc5062338aa78f35b1f1/parquet/src/file/properties.rs#L525) only checks that the `created_at` value is the value of the `PARQUET_CREATED_BY` environment variable but the problem is what gets in the `PARQUET_CREATED_BY` environment variable in the first place.
   
   **Expected behavior**
   
   I expected to get the exact same Parquet file content whether my home directory is checked into Git or not 🤣 
   
   **Additional context**
   
   The `PARQUET_CREATED_BY` environment variable is set [in the build script](https://github.com/apache/arrow-rs/blob/30f1b1fe8681914d0bd8fc5062338aa78f35b1f1/parquet/build.rs#L24-L27) if `git rev-parse HEAD` returns a value. Considering this is only getting set if you have a non-standard setup like I do, I think this should just be removed entirely. I'm going to prepare a PR for discussion with this solution :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-rs] nevi-me closed issue #589: Parquet file content is different if `~/.cargo` is in a git checkout

Posted by GitBox <gi...@apache.org>.
nevi-me closed issue #589:
URL: https://github.com/apache/arrow-rs/issues/589


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org