You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2020/11/27 16:07:49 UTC

[GitHub] [arrow] andygrove opened a new pull request #8785: ARROW-10729: [Rust] [DataFusion] Add SQL support for JOIN using implicit syntax [WIP]

andygrove opened a new pull request #8785:
URL: https://github.com/apache/arrow/pull/8785


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8785: ARROW-10729: [Rust] [DataFusion] Add SQL support for JOIN using implicit syntax [WIP]

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8785:
URL: https://github.com/apache/arrow/pull/8785#issuecomment-734914984


   Got it. I understand now.
   
   I agree that there is a bug: the push down should not push down like it is doing now. 
   
   I also think that the filter should not be there, as it is an extra filter equivalent to "1 = 1".
   
   The former is definitely more important to fix than the second :) I will try to address it this weekend.
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8785: ARROW-10729: [Rust] [DataFusion] Add SQL support for JOIN using implicit syntax

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8785:
URL: https://github.com/apache/arrow/pull/8785#issuecomment-735405468


   I rebased after merging the join test and then re-enabled the test that was previously failing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8785: ARROW-10729: [Rust] [DataFusion] Add SQL support for JOIN using implicit syntax [WIP]

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8785:
URL: https://github.com/apache/arrow/pull/8785#issuecomment-734910372


   With this syntax, the WHERE clause can contain both JOIN conditions and regular filter conditions. For example:
   
   ```sql
   FROM
       customer,
       orders,
       lineitem
   WHERE
       c_mktsegment = 'BUILDING'
       AND c_custkey = o_custkey
       AND l_orderkey = o_orderkey
       AND o_orderdate < date '1995-03-15'
       AND l_shipdate > date '1995-03-15'
   ```
   
   I think there are two options for addressing this.
   
   1. Have the SQL planner remove the join expressions from the filter expression
   2. Modify the filter push down logic to prevent filters being pushed through joins when they reference both join inputs
   
   I'm not sure which approach is best.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao commented on pull request #8785: ARROW-10729: [Rust] [DataFusion] Add SQL support for JOIN using implicit syntax [WIP]

Posted by GitBox <gi...@apache.org>.
jorgecarleitao commented on pull request #8785:
URL: https://github.com/apache/arrow/pull/8785#issuecomment-734907835


   Thanks @andygrove. 
   
   I am a bit confused: why is there a filter on the plan? I thought that the logical plan of 
   
   ```
   SELECT t1_id, t1_name, t2_name FROM t1, t2 WHERE t2_id = t1_id ORDER BY t1_id
   ```
   
   would not contain a filter, as `=` in `WHERE t2_id = t1_id` represents "equal under hashing on a join" (that is already represented on the `Join: t1_id = t2_id` bit of the plan.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on pull request #8785: ARROW-10729: [Rust] [DataFusion] Add SQL support for JOIN using implicit syntax [WIP]

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #8785:
URL: https://github.com/apache/arrow/pull/8785#issuecomment-735225696


   > @jorgecarleitao This code is basically working but I think it has exposed a bug in the filter push down logic where it tries to push a filter down through the join where the expression is left_column = right_column so it is not valid to push this through the join because it depends on the output from the join.
   
   I think another way to think about this is that "the expression refers to two relations and thus can only be evaluated when they have already been joined"
   
   One way to represent this is to track which relation each `Column` reference is to (aka with multiple tables in a SQL query, a single column name is insufficient to identify what column is being referred to).
   
   Until we add "to what what relation is this column" into the column reference itself, this kind of plan will be tricky to pull off


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] alamb commented on a change in pull request #8785: ARROW-10729: [Rust] [DataFusion] Add SQL support for JOIN using implicit syntax [WIP]

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #8785:
URL: https://github.com/apache/arrow/pull/8785#discussion_r532035179



##########
File path: rust/datafusion/src/sql/planner.rs
##########
@@ -322,10 +317,80 @@ impl<'a, S: SchemaProvider> SqlToRel<'a, S> {
             ));
         }
 
-        let plan = self.plan_tables_with_joins(&select.from)?;
+        let plans = self.plan_from_tables(&select.from)?;
 
-        // filter (also known as selection) first
-        let plan = self.filter(&plan, &select.selection)?;
+        let plan = match &select.selection {
+            Some(predicate_expr) => {
+                // build join schema
+                let mut fields = vec![];
+                for plan in &plans {
+                    let schema = plan.schema();
+                    for field in schema.fields() {
+                        fields.push(field.clone());
+                    }
+                }
+                let join_schema = Schema::new(fields);

Review comment:
       I am not sure how this code handles duplicated column names in the two inputs. TPCH helpfully prefixes id columns with the relation (e.g. `l_orderkey` and `o_orderkey`) but in many real world inputs both tables would probably have a column named `orderkey`)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #8785: ARROW-10729: [Rust] [DataFusion] Add SQL support for JOIN using implicit syntax [WIP]

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #8785:
URL: https://github.com/apache/arrow/pull/8785#discussion_r532054884



##########
File path: rust/datafusion/src/sql/planner.rs
##########
@@ -322,10 +317,80 @@ impl<'a, S: SchemaProvider> SqlToRel<'a, S> {
             ));
         }
 
-        let plan = self.plan_tables_with_joins(&select.from)?;
+        let plans = self.plan_from_tables(&select.from)?;
 
-        // filter (also known as selection) first
-        let plan = self.filter(&plan, &select.selection)?;
+        let plan = match &select.selection {
+            Some(predicate_expr) => {
+                // build join schema
+                let mut fields = vec![];
+                for plan in &plans {
+                    let schema = plan.schema();
+                    for field in schema.fields() {
+                        fields.push(field.clone());
+                    }
+                }
+                let join_schema = Schema::new(fields);

Review comment:
       Thanks @alamb. I will add the check for duplicate column names and merge this once the predicate push-down bug is resolved.
   
   After this, I plan on working on table/relation aliases and qualified column names in queries [1].
   
   https://issues.apache.org/jira/browse/ARROW-10732




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] github-actions[bot] commented on pull request #8785: ARROW-10729: [Rust] [DataFusion] Add SQL support for JOIN using implicit syntax [WIP]

Posted by GitBox <gi...@apache.org>.
github-actions[bot] commented on pull request #8785:
URL: https://github.com/apache/arrow/pull/8785#issuecomment-734900524


   https://issues.apache.org/jira/browse/ARROW-10729


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove edited a comment on pull request #8785: ARROW-10729: [Rust] [DataFusion] Add SQL support for JOIN using implicit syntax

Posted by GitBox <gi...@apache.org>.
andygrove edited a comment on pull request #8785:
URL: https://github.com/apache/arrow/pull/8785#issuecomment-735405468


   I rebased after merging the join predicate push down fix and then re-enabled the test that was previously failing.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8785: ARROW-10729: [Rust] [DataFusion] Add SQL support for JOIN using implicit syntax [WIP]

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8785:
URL: https://github.com/apache/arrow/pull/8785#issuecomment-734952829


   @jorgecarleitao I made some improvements to remove any join expressions from the filter which helped for some cases but there is still one failing test related to filter push down. No rush to look at this but just wanted to update you.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on pull request #8785: ARROW-10729: [Rust] [DataFusion] Add SQL support for JOIN using implicit syntax [WIP]

Posted by GitBox <gi...@apache.org>.
andygrove commented on pull request #8785:
URL: https://github.com/apache/arrow/pull/8785#issuecomment-734903494


   @jorgecarleitao This code is basically working but I think it has exposed a bug in the filter push down logic where it tries to push a filter down through the join where the expression is `left_column = right_column` so it is not valid to push this through the join because it depends on the output from the join.
   
   ```
   ---- equijoin_implicit_syntax stdout ----
   thread 'equijoin_implicit_syntax' panicked at 'Creating physical plan for 'SELECT t1_id, t1_name, t2_name FROM t1, t2 WHERE t1_id = t2_id ORDER BY t1_id': Sort: #t1_id ASC NULLS FIRST
     Projection: #t1_id, #t1_name, #t2_name
       Join: t1_id = t2_id
         Filter: #t1_id Eq #t2_id
           TableScan: t1 projection=Some([0, 1])
         Filter: #t1_id Eq #t2_id
           TableScan: t2 projection=Some([0, 1]): ArrowError(InvalidArgumentError("Unable to get field named \"t2_id\". Valid fields: [\"t1_id\", \"t1_name\"]"))', datafusion/tests/sql.rs:1243:48
   ```


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] jorgecarleitao edited a comment on pull request #8785: ARROW-10729: [Rust] [DataFusion] Add SQL support for JOIN using implicit syntax [WIP]

Posted by GitBox <gi...@apache.org>.
jorgecarleitao edited a comment on pull request #8785:
URL: https://github.com/apache/arrow/pull/8785#issuecomment-734907835


   Thanks @andygrove. 
   
   I am a bit confused: why is there a filter on the plan? I thought that the logical plan of 
   
   ```
   SELECT t1_id, t1_name, t2_name FROM t1, t2 WHERE t2_id = t1_id ORDER BY t1_id
   ```
   
   would not contain a filter, as `=` in `WHERE t2_id = t1_id` represents "equal under hashing on a join" (that is already represented on the `Join: t1_id = t2_id` bit of the plan).


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove closed pull request #8785: ARROW-10729: [Rust] [DataFusion] Add SQL support for JOIN using implicit syntax

Posted by GitBox <gi...@apache.org>.
andygrove closed pull request #8785:
URL: https://github.com/apache/arrow/pull/8785


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow] andygrove commented on a change in pull request #8785: ARROW-10729: [Rust] [DataFusion] Add SQL support for JOIN using implicit syntax [WIP]

Posted by GitBox <gi...@apache.org>.
andygrove commented on a change in pull request #8785:
URL: https://github.com/apache/arrow/pull/8785#discussion_r532057124



##########
File path: rust/datafusion/src/sql/planner.rs
##########
@@ -322,10 +317,80 @@ impl<'a, S: SchemaProvider> SqlToRel<'a, S> {
             ));
         }
 
-        let plan = self.plan_tables_with_joins(&select.from)?;
+        let plans = self.plan_from_tables(&select.from)?;
 
-        // filter (also known as selection) first
-        let plan = self.filter(&plan, &select.selection)?;
+        let plan = match &select.selection {
+            Some(predicate_expr) => {
+                // build join schema
+                let mut fields = vec![];
+                for plan in &plans {
+                    let schema = plan.schema();
+                    for field in schema.fields() {
+                        fields.push(field.clone());
+                    }
+                }
+                let join_schema = Schema::new(fields);

Review comment:
       On second thoughts, I have ignored the failing test and added a link to the bug [1].
   
   I have also added the check for duplicate column names after a join.
   
   [1] https://issues.apache.org/jira/browse/ARROW-10760




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org