You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "suremarc (via GitHub)" <gi...@apache.org> on 2023/06/26 18:39:06 UTC

[GitHub] [arrow-datafusion] suremarc commented on issue #6672: Optimization: Avoid sort for already sorted Parquet files that do not overlap values on condition

suremarc commented on issue #6672:
URL: https://github.com/apache/arrow-datafusion/issues/6672#issuecomment-1608027489

   I have had a somewhat overlapping (no pun intended) issue where DataFusion abandons the `SortPreservingMergeStream` and does a global sort if there are multiple files in any file groups. It should be possible for DataFusion to realize that, if the files are non-overlapping, the file groups can be re-ordered to satisfy the required output ordering. We would be partitioning a poset of files into a series of chains, where A < B if they are non-overlapping, and every row in A goes before every row in B. Then each chain becomes one file group in the physical plan, which would be read sequentially. Using statistics and partition columns it should be possible to perform this analysis without reading any rows. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org