You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "metesynnada (via GitHub)" <gi...@apache.org> on 2023/03/16 07:37:27 UTC

[GitHub] [arrow-datafusion] metesynnada opened a new pull request, #5618: Feature/ordered by support

metesynnada opened a new pull request, #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618

   # Which issue does this PR close?
   
   Closes #5372.
   
   # Rationale for this change
   
   When working with files, specifying column orders can be highly effective. By leveraging pre-sorted files, we can avoid the need to materialize and sort the entire dataset, resulting in significant performance gains.
   
   # What changes are included in this PR?
   
   - `WITH ORDER` support on the parser.
   - Necessary adjustments on `CreateExternalTable` (parser, logical plan, proto).
   - `ListingTable` order support from SQL queries
   
   # Are these changes tested?
   
   Yes
   
   # Are there any user-facing changes?
   
   Now,
   
   ```sql
   CREATE EXTERNAL TABLE t(c1 int) STORED AS CSV WITH ORDER (c1 ASC) LOCATION 'foo.csv'
   ```
   
   is supported.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5618: WITH ORDER support on CREATE EXTERNAL TABLE

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618#issuecomment-1473819515

   > [@alamb](https://github.com/alamb) I think you have use cases for parquet, does this enough to cover yours as well?
   
   Yes thank you


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5618: WITH ORDER support on CREATE EXTERNAL TABLE

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618#discussion_r1141983461


##########
datafusion/core/tests/sql/order.rs:
##########
@@ -67,6 +68,25 @@ async fn create_external_table_with_order() -> Result<()> {
 async fn create_external_table_with_ddl_ordered_non_cols() -> Result<()> {
     let ctx = SessionContext::new();
     let sql = "CREATE EXTERNAL TABLE dt (a_id integer, a_str string, a_bool boolean) STORED AS CSV WITH ORDER (a ASC) LOCATION 'file://path/to/table';";
-    assert!(ctx.state().create_logical_plan(sql).await.is_err());
+    match ctx.state().create_logical_plan(sql).await {
+        Ok(_) => panic!("Expecting error."),
+        Err(e) => {
+            assert_eq!(e.to_string(), "Error during planning: Column a is not in schema")
+        }
+    }

Review Comment:
   FWI in the future you can write this more concisely with `unwrap_err`:
   
   ```rust
       let e = ctx.state().create_logical_plan(sql).await.unwrap_err();
       assert_eq!(e.to_string(), "Error during planning: Column a is not in schema");
   ```



##########
docs/source/user-guide/sql/ddl.md:
##########
@@ -98,6 +98,12 @@ WITH ORDER (sort_expression1 [ASC | DESC] [NULLS { FIRST | LAST }]
          [, sort_expression2 [ASC | DESC] [NULLS { FIRST | LAST }] ...])
 ```
 
+#### Cautions When Using the WITH ORDER Clause
+
+- It's important to understand that using the `WITH ORDER` clause in the `CREATE EXTERNAL TABLE` statement only specifies the order in which the data should be read from the external file. If the data in the file is not already sorted according to the specified order, then the results may not be correct.

Review Comment:
   👍 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #5618: WITH ORDER support on CREATE EXTERNAL TABLE

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618#discussion_r1138841500


##########
datafusion/sql/src/parser.rs:
##########
@@ -253,6 +256,49 @@ impl<'a> DFParser<'a> {
         Ok(partitions)
     }
 
+    /// Parse the ordering clause of a `CREATE EXTERNAL TABLE` SQL statement
+    pub fn parse_order_by_exprs(&mut self) -> Result<Vec<OrderByExpr>, ParserError> {
+        let mut values = vec![];
+        self.parser.expect_token(&Token::LParen)?;
+        loop {
+            values.push(self.parse_order_by_expr()?);
+            if !self.parser.consume_token(&Token::Comma) {
+                self.parser.expect_token(&Token::RParen)?;
+                return Ok(values);
+            }
+        }
+    }
+
+    /// Parse an ORDER BY sub-expression optionally followed by ASC or DESC.
+    pub fn parse_order_by_expr(&mut self) -> Result<OrderByExpr, ParserError> {
+        let expr = self.parser.parse_expr()?;
+
+        let asc = if self.parser.parse_keyword(Keyword::ASC) {
+            Some(true)
+        } else if self.parser.parse_keyword(Keyword::DESC) {
+            Some(false)
+        } else {
+            None

Review Comment:
   None is no ordering?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on a diff in pull request #5618: WITH ORDER support on CREATE EXTERNAL TABLE

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on code in PR #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618#discussion_r1138917137


##########
datafusion/sql/src/statement.rs:
##########
@@ -425,6 +425,34 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
         }))
     }
 
+    fn build_order_by(
+        &self,
+        order_exprs: Vec<OrderByExpr>,
+        schema: &DFSchemaRef,
+    ) -> Result<Vec<datafusion_expr::Expr>> {
+        // Convert each OrderByExpr to a SortExpr:
+        let result = order_exprs
+            .into_iter()
+            .map(|e| self.order_by_to_sort_expr(e, schema))
+            .collect::<Result<Vec<_>>>()?;
+
+        // Verify that columns of all SortExprs exist in the schema:
+        for expr in result.iter() {

Review Comment:
   Imagine the `WITH ORDER (a + b ASC, c DESC)` query. First, we get the expressions, `a@ + b@` and `c@`, then for each expression, we iterate the columns to check if they are in the schema.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #5618: WITH ORDER support on CREATE EXTERNAL TABLE

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #5618: WITH ORDER support on CREATE EXTERNAL TABLE

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618#discussion_r1138847512


##########
datafusion/sql/src/parser.rs:
##########
@@ -713,6 +782,79 @@ mod tests {
         });
         expect_parse_ok(sql, expected)?;
 
+        // Ordered Col
+        let sql = "CREATE EXTERNAL TABLE t(c1 int, c2 int) STORED AS CSV WITH ORDER (c1 ASC, c2 DESC NULLS FIRST) LOCATION 'foo.csv'";

Review Comment:
   only `desc nulls first` covered.
   Do you plan to cover also
   `asc nulls first`
   `asc nulls last`
   `desc null last`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on a diff in pull request #5618: WITH ORDER support on CREATE EXTERNAL TABLE

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on code in PR #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618#discussion_r1138911301


##########
datafusion/sql/src/parser.rs:
##########
@@ -253,6 +256,49 @@ impl<'a> DFParser<'a> {
         Ok(partitions)
     }
 
+    /// Parse the ordering clause of a `CREATE EXTERNAL TABLE` SQL statement
+    pub fn parse_order_by_exprs(&mut self) -> Result<Vec<OrderByExpr>, ParserError> {
+        let mut values = vec![];
+        self.parser.expect_token(&Token::LParen)?;
+        loop {
+            values.push(self.parse_order_by_expr()?);
+            if !self.parser.consume_token(&Token::Comma) {
+                self.parser.expect_token(&Token::RParen)?;
+                return Ok(values);
+            }
+        }
+    }
+
+    /// Parse an ORDER BY sub-expression optionally followed by ASC or DESC.
+    pub fn parse_order_by_expr(&mut self) -> Result<OrderByExpr, ParserError> {
+        let expr = self.parser.parse_expr()?;
+
+        let asc = if self.parser.parse_keyword(Keyword::ASC) {
+            Some(true)
+        } else if self.parser.parse_keyword(Keyword::DESC) {
+            Some(false)
+        } else {
+            None

Review Comment:
   Actually None means ASC by default, I handle it after returning None here since there is no order information. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #5618: WITH ORDER support on CREATE EXTERNAL TABLE

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618#discussion_r1138837330


##########
datafusion/sql/src/statement.rs:
##########
@@ -425,6 +425,34 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
         }))
     }
 
+    fn build_order_by(
+        &self,
+        order_exprs: Vec<OrderByExpr>,
+        schema: &DFSchemaRef,
+    ) -> Result<Vec<datafusion_expr::Expr>> {
+        // Convert each OrderByExpr to a SortExpr:
+        let result = order_exprs
+            .into_iter()
+            .map(|e| self.order_by_to_sort_expr(e, schema))
+            .collect::<Result<Vec<_>>>()?;
+
+        // Verify that columns of all SortExprs exist in the schema:
+        for expr in result.iter() {

Review Comment:
   nit: do we really need to loop twice here? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on pull request #5618: WITH ORDER support on CREATE EXTERNAL TABLE

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on PR #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618#issuecomment-1476065964

   I'll plan to merge when CI is clean


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on a diff in pull request #5618: WITH ORDER support on CREATE EXTERNAL TABLE

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on code in PR #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618#discussion_r1138908546


##########
datafusion/sql/src/parser.rs:
##########
@@ -713,6 +782,79 @@ mod tests {
         });
         expect_parse_ok(sql, expected)?;
 
+        // Ordered Col
+        let sql = "CREATE EXTERNAL TABLE t(c1 int, c2 int) STORED AS CSV WITH ORDER (c1 ASC, c2 DESC NULLS FIRST) LOCATION 'foo.csv'";

Review Comment:
   I can add tests, ofc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5618: WITH ORDER support on CREATE EXTERNAL TABLE

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618#discussion_r1140119378


##########
datafusion/core/tests/sql/order.rs:
##########
@@ -39,3 +42,31 @@ async fn sort_with_lots_of_repetition_values() -> Result<()> {
     }
     Ok(())
 }
+
+#[tokio::test]
+async fn create_external_table_with_order() -> Result<()> {
+    let ctx = SessionContext::new();
+    let sql = "CREATE EXTERNAL TABLE dt (a_id integer, a_str string, a_bool boolean) STORED AS CSV WITH ORDER (a_id ASC) LOCATION 'file://path/to/table';";
+    if let LogicalPlan::CreateExternalTable(cmd) =
+        ctx.state().create_logical_plan(sql).await?
+    {
+        let listing_table_factory = Arc::new(ListingTableFactory::new());
+        let table_dyn = listing_table_factory.create(&ctx.state(), &cmd).await?;
+        let table = table_dyn.as_any().downcast_ref::<ListingTable>().unwrap();
+        assert_eq!(
+            &cmd.order_exprs,

Review Comment:
   I would recommend explicitly asserting the content of the order by exprs here -- as written if `cmd.order_exprs` were empty the test would still pass, I think.



##########
datafusion/sql/src/parser.rs:
##########
@@ -713,11 +782,144 @@ mod tests {
         });
         expect_parse_ok(sql, expected)?;
 
+        // Ordered Col
+        let sql = "CREATE EXTERNAL TABLE t(c1 int) STORED AS CSV WITH ORDER (c1) LOCATION 'foo.csv'";
+        let display = None;
+        let expected = Statement::CreateExternalTable(CreateExternalTable {
+            name: "t".into(),
+            columns: vec![make_column_def("c1", DataType::Int(display))],
+            file_type: "CSV".to_string(),
+            has_header: false,
+            delimiter: ',',
+            location: "foo.csv".into(),
+            table_partition_cols: vec![],
+            order_exprs: vec![OrderByExpr {
+                expr: Identifier(Ident {
+                    value: "c1".to_owned(),
+                    quote_style: None,
+                }),
+                asc: None,
+                nulls_first: None,
+            }],
+            if_not_exists: false,
+            file_compression_type: UNCOMPRESSED,
+            options: HashMap::new(),
+        });
+        expect_parse_ok(sql, expected)?;
+
+        // Ordered Col
+        let sql = "CREATE EXTERNAL TABLE t(c1 int) STORED AS CSV WITH ORDER (c1 NULLS FIRST) LOCATION 'foo.csv'";
+        let display = None;
+        let expected = Statement::CreateExternalTable(CreateExternalTable {
+            name: "t".into(),
+            columns: vec![make_column_def("c1", DataType::Int(display))],
+            file_type: "CSV".to_string(),
+            has_header: false,
+            delimiter: ',',
+            location: "foo.csv".into(),
+            table_partition_cols: vec![],
+            order_exprs: vec![OrderByExpr {
+                expr: Identifier(Ident {
+                    value: "c1".to_owned(),
+                    quote_style: None,
+                }),
+                asc: None,
+                nulls_first: Some(true),
+            }],
+            if_not_exists: false,
+            file_compression_type: UNCOMPRESSED,
+            options: HashMap::new(),
+        });
+        expect_parse_ok(sql, expected)?;
+
+        // Ordered Col
+        let sql = "CREATE EXTERNAL TABLE t(c1 int, c2 int) STORED AS CSV WITH ORDER (c1 ASC, c2 DESC NULLS FIRST) LOCATION 'foo.csv'";
+        let display = None;
+        let expected = Statement::CreateExternalTable(CreateExternalTable {
+            name: "t".into(),
+            columns: vec![
+                make_column_def("c1", DataType::Int(display)),
+                make_column_def("c2", DataType::Int(display)),
+            ],
+            file_type: "CSV".to_string(),
+            has_header: false,
+            delimiter: ',',
+            location: "foo.csv".into(),
+            table_partition_cols: vec![],
+            order_exprs: vec![
+                OrderByExpr {
+                    expr: Identifier(Ident {
+                        value: "c1".to_owned(),
+                        quote_style: None,
+                    }),
+                    asc: Some(true),
+                    nulls_first: None,
+                },
+                OrderByExpr {
+                    expr: Identifier(Ident {
+                        value: "c2".to_owned(),
+                        quote_style: None,
+                    }),
+                    asc: Some(false),
+                    nulls_first: Some(true),

Review Comment:
   I didn't see any test for `nulls_first: Some(false)`



##########
datafusion/expr/src/logical_plan/plan.rs:
##########
@@ -1632,6 +1632,8 @@ pub struct CreateExternalTable {
     pub if_not_exists: bool,
     /// SQL used to create the table, if available
     pub definition: Option<String>,
+    /// Order expressions

Review Comment:
   ```suggestion
       /// Order expressions supplied by user.
       ///
       /// NOTE: DataFusion will assume the data is sorted in this way
       /// and does not verify it. Thus if the data is not actually sorted as
       /// described, WRONG ANSERS can be produced. 
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] comphead commented on a diff in pull request #5618: WITH ORDER support on CREATE EXTERNAL TABLE

Posted by "comphead (via GitHub)" <gi...@apache.org>.
comphead commented on code in PR #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618#discussion_r1139112318


##########
datafusion/sql/src/statement.rs:
##########
@@ -425,6 +425,34 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
         }))
     }
 
+    fn build_order_by(
+        &self,
+        order_exprs: Vec<OrderByExpr>,
+        schema: &DFSchemaRef,
+    ) -> Result<Vec<datafusion_expr::Expr>> {
+        // Convert each OrderByExpr to a SortExpr:
+        let result = order_exprs
+            .into_iter()
+            .map(|e| self.order_by_to_sort_expr(e, schema))
+            .collect::<Result<Vec<_>>>()?;
+
+        // Verify that columns of all SortExprs exist in the schema:
+        for expr in result.iter() {

Review Comment:
   Right, I was thinking about we can iterate just once and return resuls if everything is ok or error if something is wrong https://doc.rust-lang.org/rust-by-example/error/iter_result.html#fail-the-entire-operation-with-collect



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on pull request #5618: WITH ORDER support on CREATE EXTERNAL TABLE

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on PR #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618#issuecomment-1473671391

   @alamb I think you have use cases for parquet, does this enough to cover yours as well?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on pull request #5618: WITH ORDER support on CREATE EXTERNAL TABLE

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on PR #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618#issuecomment-1475968441

   @alamb I think this is ready to merge unless you have additional suggestions on the document.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] metesynnada commented on a diff in pull request #5618: WITH ORDER support on CREATE EXTERNAL TABLE

Posted by "metesynnada (via GitHub)" <gi...@apache.org>.
metesynnada commented on code in PR #5618:
URL: https://github.com/apache/arrow-datafusion/pull/5618#discussion_r1138911301


##########
datafusion/sql/src/parser.rs:
##########
@@ -253,6 +256,49 @@ impl<'a> DFParser<'a> {
         Ok(partitions)
     }
 
+    /// Parse the ordering clause of a `CREATE EXTERNAL TABLE` SQL statement
+    pub fn parse_order_by_exprs(&mut self) -> Result<Vec<OrderByExpr>, ParserError> {
+        let mut values = vec![];
+        self.parser.expect_token(&Token::LParen)?;
+        loop {
+            values.push(self.parse_order_by_expr()?);
+            if !self.parser.consume_token(&Token::Comma) {
+                self.parser.expect_token(&Token::RParen)?;
+                return Ok(values);
+            }
+        }
+    }
+
+    /// Parse an ORDER BY sub-expression optionally followed by ASC or DESC.
+    pub fn parse_order_by_expr(&mut self) -> Result<OrderByExpr, ParserError> {
+        let expr = self.parser.parse_expr()?;
+
+        let asc = if self.parser.parse_keyword(Keyword::ASC) {
+            Some(true)
+        } else if self.parser.parse_keyword(Keyword::DESC) {
+            Some(false)
+        } else {
+            None

Review Comment:
   It was the default, I so kept it. It means there is no order, identical to `vec![]`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org