You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "westonpace (via GitHub)" <gi...@apache.org> on 2023/05/08 17:24:06 UTC

[GitHub] [arrow] westonpace commented on pull request #35453: GH-35331: [Python] Expose Parquet sorting metadata

westonpace commented on PR #35453:
URL: https://github.com/apache/arrow/pull/35453#issuecomment-1538756938

   > @westonpace This PR might interest you. It occurred to me while working with this we can persist the sort order of data in Parquet and retrieve it. Makes me wonder where we could integrate it into the rest of the code base.
   
   It is interesting.  Though one challenge would be that a single file doesn't neccesarily tell you about an entire dataset.  For example, if the file foo.parquet is sorted by "date" then are all files in that dataset sorted by date?  Does foo.parquet come before bar.parquet?
   
   That being said, we could probably hook it into exec plans pretty easily for the cases where the source dataset is a single file.
   
   There may also be some benefit in knowing that individual files are sorted (even if the dataset is not) but I'm not quite sure how exactly we'd exploit that yet.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org