You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "berkaysynnada (via GitHub)" <gi...@apache.org> on 2023/05/27 09:31:32 UTC

[GitHub] [arrow-datafusion] berkaysynnada opened a new pull request, #6469: Support Defining Ordering Equivalence at the Source

berkaysynnada opened a new pull request, #6469:
URL: https://github.com/apache/arrow-datafusion/pull/6469

   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #6468.
   
   # Rationale for this change
   
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   If we try to assign 2 or more orderings to a table with `CREATE EXTERNAL TABLE`, we cannot express it with multiple WITH ORDER's. With this PR, we can create such tables like this:
   ```
   CREATE EXTERNAL TABLE multiple_ordered_table (
   ...
    )
    STORED AS CSV
    WITH HEADER ROW
    WITH ORDER (a ASC, b ASC)
    WITH ORDER (c ASC)
    LOCATION '...';
   ```
   
   # What changes are included in this PR?
   
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   Parser can handle multiple WITH ORDER options. In the codebase, output orderings are in the type of `Option<Vec<PhysicalSortExpr>>`. It is refactored as `Vec<Vec<PhysicalSortExpr>>`.
   
   # Are these changes tested?
   
   <!--
   We typically require tests for all PRs in order to:
   1. Prevent the code from being accidentally broken by subsequent changes
   2. Serve as another way to document the expected behavior of the code
   
   If tests are not included in your PR, please explain why (for example, are they covered by existing tests)?
   -->
   
   Yes, related slt tests are added showing the usage of feature on plans.
   
   # Are there any user-facing changes?
   
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   
   The ListingOptions API is changed from 
   `pub fn with_file_sort_order(mut self, file_sort_order: Option<Vec<Expr>>) -> Self ` to 
   `pub fn with_file_sort_order(mut self, file_sort_order: Vec<Vec<Expr>>) -> Self`. This change was necessary to give all sort orderings at once.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #6469: Support Defining Ordering Equivalence at the Source

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #6469:
URL: https://github.com/apache/arrow-datafusion/pull/6469#issuecomment-1568841192

   I think everything looks good to me here -- thanks again @berkaysynnada and @mustafasrepo 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #6469: Support Defining Ordering Equivalence at the Source

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #6469:
URL: https://github.com/apache/arrow-datafusion/pull/6469


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] berkaysynnada commented on a diff in pull request #6469: Support Defining Ordering Equivalence at the Source

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.
berkaysynnada commented on code in PR #6469:
URL: https://github.com/apache/arrow-datafusion/pull/6469#discussion_r1210622599


##########
datafusion/core/src/datasource/listing/table.rs:
##########
@@ -418,7 +421,7 @@ impl ListingOptions {
     ///
     /// assert_eq!(listing_options.file_sort_order, file_sort_order);
     /// ```
-    pub fn with_file_sort_order(mut self, file_sort_order: Option<Vec<Expr>>) -> Self {
+    pub fn with_file_sort_order(mut self, file_sort_order: Vec<Vec<Expr>>) -> Self {

Review Comment:
   We have also considered such a design, and you can check our observations [here](https://github.com/synnada-ai/arrow-datafusion/pull/107#pullrequestreview-1443036721). In summary, that API cannot provide a reset for sort orders, and the user needs to call it multiple times for multiple orderings (that is not the case for other methods).
   
   We have replaced `Vec<Vec<OrderByExpr>>` and `Vec<Vec<PhysicalSortExpr>>` with  `Vec<LexOrdering>` where the inner vector is used for a lexicographical sorting and the outer is used for another lex sorting. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6469: Support Defining Ordering Equivalence at the Source

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6469:
URL: https://github.com/apache/arrow-datafusion/pull/6469#discussion_r1210334068


##########
datafusion/core/src/datasource/listing/table.rs:
##########
@@ -418,7 +421,7 @@ impl ListingOptions {
     ///
     /// assert_eq!(listing_options.file_sort_order, file_sort_order);
     /// ```
-    pub fn with_file_sort_order(mut self, file_sort_order: Option<Vec<Expr>>) -> Self {
+    pub fn with_file_sort_order(mut self, file_sort_order: Vec<Vec<Expr>>) -> Self {

Review Comment:
   🤔  i wonder if this API would be simpler to use if it was something like
   
   ```
       /// add a new equivalent sort order
       pub fn with_file_sort_order(mut self, file_sort_order: Vec<Expr>) -> Self {
         self.file_sort_order.push(file_sort_order);
         self
       }
   ```
   
   I realize that would be different too 🤔 
   
   I am just thinking in general `Vec<Vec<...>>` makes for a harder to understand API because the type names don't help you and you have to figure out that the inner Vec represents.



##########
datafusion/sql/src/parser.rs:
##########
@@ -585,7 +588,7 @@ impl<'a> DFParser<'a> {
             delimiter: Option<char>,
             file_compression_type: Option<CompressionTypeVariant>,
             table_partition_cols: Option<Vec<String>>,
-            order_exprs: Option<Vec<OrderByExpr>>,
+            order_exprs: Vec<Vec<OrderByExpr>>,

Review Comment:
   🤔  I wonder why some changes in this PR use the newly introduced `LexOrdering` and some places use `Vec<OrderByExpr>` ? It would probably make the code easier to follow if one convention was used throughout



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] berkaysynnada commented on a diff in pull request #6469: Support Defining Ordering Equivalence at the Source

Posted by "berkaysynnada (via GitHub)" <gi...@apache.org>.
berkaysynnada commented on code in PR #6469:
URL: https://github.com/apache/arrow-datafusion/pull/6469#discussion_r1210628201


##########
datafusion/sql/src/parser.rs:
##########
@@ -585,7 +588,7 @@ impl<'a> DFParser<'a> {
             delimiter: Option<char>,
             file_compression_type: Option<CompressionTypeVariant>,
             table_partition_cols: Option<Vec<String>>,
-            order_exprs: Option<Vec<OrderByExpr>>,
+            order_exprs: Vec<Vec<OrderByExpr>>,

Review Comment:
   I have replaced all `Vec<Vec<OrderByExpr>> `'s with `Vec<LexOrdering>`'s. Thanks for the review 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org