You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/11 15:34:23 UTC

[GitHub] [arrow-rs] alamb opened a new issue, #3090: Expose `SortingColumn` when reading and writing parquet metadata

alamb opened a new issue, #3090:
URL: https://github.com/apache/arrow-rs/issues/3090

   **Is your feature request related to a problem or challenge? Please describe what you are trying to do.**
   Storing sorted data in parquet is often a key performance technique as it "clusters" data in interesting ways than can make predicate evaluation and other query techniques faster. 
   
   The parquet file format contains a way to encode the sortedness of data stored there using a "SortingColumn" in the format
   https://github.com/apache/parquet-format/blob/54e53e5d7794d383529dd30746378f19a12afd58/src/main/thrift/parquet.thrift#L685-L698
   
   Which is then in the RowGroup metadata:
   https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L829-L832
    
   However, I did not find any code to read/write this metadata yet in the parquet crate
   https://sourcegraph.com/search?q=context:global+repo:%5Egithub%5C.com/apache/arrow-rs%24+SortingColumn&patternType=standard
   
   
   **Describe the solution you'd like**
   
   I would like some way to provide the parquet writer the `SortingColumn` when creating `RowgroupMetadata`
   
   Perhaps we could add something to the `WriterProperties`
   
   https://docs.rs/parquet/26.0.0/parquet/file/properties/struct.WriterProperties.html
   
   Likewise, I would like a way to get the relevant `SortingColumn` list from `RowGroupMetadata`: 
   https://docs.rs/parquet/26.0.0/parquet/file/metadata/struct.RowGroupMetaData.html
   
   
   **Describe alternatives you've considered**
   It might be worth considering having the parquet writer determine automatically if the data was sorted (maybe this would be better than letting the caller have to verify it)? However, verifying in the writer would likely be a significant performance hit. 
   
   I also 
   
   
   **Additional context**
   DataFusion is getting more sophisticated in its ability to track and use sortedness information (e.g. https://github.com/apache/arrow-datafusion/pull/4122). If this metadata was included in the parquet file, DataFusion might be able to take more advantage of it (TODO datafusion ticket link)
   
   
   There is more discussion about this topic here https://github.com/apache/arrow-datafusion/issues/4169#issuecomment-1311572149
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] alamb commented on issue #3090: Expose `SortingColumn` when reading and writing parquet metadata

Posted by GitBox <gi...@apache.org>.
alamb commented on issue #3090:
URL: https://github.com/apache/arrow-rs/issues/3090#issuecomment-1314312904

   It seems to have been around a while https://github.com/apache/parquet-format/commit/934da01871738a54e545a29ed52cd62d99c9e3d9
   
   But I don't know the history.
   
   Interesting


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] tustvold closed issue #3090: Expose `SortingColumn` when reading and writing parquet metadata

Posted by GitBox <gi...@apache.org>.
tustvold closed issue #3090: Expose `SortingColumn` when reading and writing parquet metadata
URL: https://github.com/apache/arrow-rs/issues/3090


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] askoa commented on issue #3090: Expose `SortingColumn` when reading and writing parquet metadata

Posted by GitBox <gi...@apache.org>.
askoa commented on issue #3090:
URL: https://github.com/apache/arrow-rs/issues/3090#issuecomment-1312758758

   I'll attempt this one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-rs] mingmwang commented on issue #3090: Expose `SortingColumn` when reading and writing parquet metadata

Posted by GitBox <gi...@apache.org>.
mingmwang commented on issue #3090:
URL: https://github.com/apache/arrow-rs/issues/3090#issuecomment-1312963245

   Looks like even the Java parquet implementations didn't read the Sort Columns info when it read the footer and try to convert the parquet meta.
   
   https://github.com/apache/parquet-mr/blob/master/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org