You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/11/14 03:02:26 UTC

[GitHub] [arrow-datafusion] mingmwang commented on a diff in pull request #4170: Add ability to specify external sort information for ListingTables

mingmwang commented on code in PR #4170:
URL: https://github.com/apache/arrow-datafusion/pull/4170#discussion_r1021037377


##########
datafusion/core/src/datasource/listing/table.rs:
##########
@@ -220,6 +225,16 @@ pub struct ListingOptions {
     /// Group files to avoid that the number of partitions exceeds
     /// this limit
     pub target_partitions: usize,
+    /// Optional pre-known sort order. Must be `SortExpr`s.
+    ///
+    /// DataFusion may take advantage of this ordering to omit sorts
+    /// or use more efficient algorithms. Currently sortedness must be
+    /// provided if it is known by some external mechanism, but may in
+    /// the future be automatically determined, for example using
+    /// parquet metadata.
+    ///
+    /// See <https://github.com/apache/arrow-datafusion/issues/4177>
+    pub file_sort_order: Option<Vec<Expr>>,

Review Comment:
   I would prefer option (1) as well. 
   In future, we can implement a new SortPreservingParquetExec which can read multiple pre-sorted parquet files in a partition and leverage the efficient merge sort to keep the sort ordering. And during physical planning time, if the parent plan do needs the sort info, the physical planner can choose the SortPreservingParquetExec instead of normal ParquetExec.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org