You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/11 15:34:23 UTC
[GitHub] [arrow-rs] alamb opened a new issue, #3090: Expose `SortingColumn` when reading and writing parquet metadata
alamb opened a new issue, #3090:
URL: https://github.com/apache/arrow-rs/issues/3090
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
Storing sorted data in parquet is often a key performance technique as it "clusters" data in interesting ways than can make predicate evaluation and other query techniques faster.
The parquet file format contains a way to encode the sortedness of data stored there using a "SortingColumn" in the format
https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L685-L698
Which is then in the RowGroup metadata:
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L829-L832
However, I did not find any code to read/write this metadata yet in the parquet crate
https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/apache/arrow-rs%24+SortingColumn&patternType=standard
**Describe the solution you'd like**
I would like some way to provide the parquet writer the `SortingColumn` when creating `RowgroupMetadata`
Perhaps we could add something to the `WriterProperties`
https://docs.rs/parquet/26.0.0/parquet/file/properties/struct.WriterProperties.html
Likewise, I would like a way to get the relevant `SortingColumn` list from `RowGroupMetadata`:
https://docs.rs/parquet/26.0.0/parquet/file/metadata/struct.RowGroupMetaData.html
**Describe alternatives you've considered**
It might be worth considering having the parquet writer determine automatically if the data was sorted (maybe this would be better than letting the caller have to verify it)? However, verifying in the writer would likely be a significant performance hit.
I also
**Additional context**
DataFusion is getting more sophisticated in its ability to track and use sortedness information (e.g. https://github.com/apache/arrow-datafusion/pull/4122). If this metadata was included in the parquet file, DataFusion might be able to take more advantage of it (TODO datafusion ticket link)
There is more discussion about this topic here https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1311572149
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] alamb commented on issue #3090: Expose `SortingColumn` when reading and writing parquet metadata
Posted by GitBox <gi...@apache.org>.
alamb commented on issue #3090:
URL: https://github.com/apache/arrow-rs/issues/3090#issuecomment-1314312904
It seems to have been around a while https://github.com/apache/parquet-format/commit/934da01871738a54e545a29ed52cd62d99c9e3d9
But I don't know the history.
Interesting
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] tustvold closed issue #3090: Expose `SortingColumn` when reading and writing parquet metadata
Posted by GitBox <gi...@apache.org>.
tustvold closed issue #3090: Expose `SortingColumn` when reading and writing parquet metadata
URL: https://github.com/apache/arrow-rs/issues/3090
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] askoa commented on issue #3090: Expose `SortingColumn` when reading and writing parquet metadata
Posted by GitBox <gi...@apache.org>.
askoa commented on issue #3090:
URL: https://github.com/apache/arrow-rs/issues/3090#issuecomment-1312758758
I'll attempt this one.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-rs] mingmwang commented on issue #3090: Expose `SortingColumn` when reading and writing parquet metadata
Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #3090:
URL: https://github.com/apache/arrow-rs/issues/3090#issuecomment-1312963245
Looks like even the Java parquet implementations didn't read the Sort Columns info when it read the footer and try to convert the parquet meta.
https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org