You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/15 08:20:45 UTC

[GitHub] [arrow-datafusion] xudong963 opened a new pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

xudong963 opened a new pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
   -->
   
   Closes #1293 
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.  
   -->
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
   -->
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be updated before approving the PR.
   -->
   
   <!--
   If there are any breaking changes to public APIs, please add the `api change` label.
   -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r787711438



##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -214,8 +220,14 @@ pub struct Extension {
     pub node: Arc<dyn UserDefinedLogicalNode + Send + Sync>,
 }
 
+impl PartialEq for Extension {
+    fn eq(&self, _other: &Self) -> bool {
+        todo!()

Review comment:
       > For the future, I think you can change
   `PartialEq` isn't auto-trait,  so it can't be used as additional trait in a trait object
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1015791735


   Also, @xudong963  thank you for taking on this task. Really neat to see


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013734364


   Thank you @xudong963  -- this looks very cool. I'll try and review it carefully tomorrow πŸ‘ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 edited a comment on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 edited a comment on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1014105703


   > @xudong963 This might be a bit nitpicky but this will enter an infinite loop if multiple cross joins are required for a plan. As an example, the following explain select will never finish:
   > 
   > ```
   > 
   > ❯ create table part as select 1 as p_partkey, 2 as p_partprice, 3 as p_partprice
   > _taxadded;
   > 0 rows in set. Query took 0.036 seconds.
   > ❯ create table lineitem as select 1 as l_itemkey;
   > 0 rows in set. Query took 0.001 seconds.
   > ❯ create table supplier as select 1 as s_suppkey;
   > 0 rows in set.0 rows in set. Query took 0.001 seconds. Query took 0.001 seconds.
   > ❯ explain select * from part,supplier,lineitem where p_partprice=p_partprice_tax
   > added;
   > ```
   > 
   > In practice I'm not sure how much of a deal this is as it requires multiple cross joins with a filter that operates solely on a single table.
   
   Good catch, thanks @pjmore. I think we should process the case because we can't guarantee what kind of queries the user will write 😁


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r786529381



##########
File path: datafusion/src/sql/planner.rs
##########
@@ -731,20 +732,33 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                         }
                     }
                     if join_keys.is_empty() {
-                        left =
-                            LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+                        // check [1..idx] if contains current plan to avoid infinite loop
+                        if mut_plans[1..idx].contains(&mut_plans[idx])

Review comment:
       Here, we use `contains()` which requests `Logical Plan` implement `PartialEq` trait  @houqp 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r787711438



##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -214,8 +220,14 @@ pub struct Extension {
     pub node: Arc<dyn UserDefinedLogicalNode + Send + Sync>,
 }
 
+impl PartialEq for Extension {
+    fn eq(&self, _other: &Self) -> bool {
+        todo!()

Review comment:
       > For the future, I think you can change
   
   `PartialEq` isn't auto-trait,  so it can't be used as an additional trait in a trait object
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb edited a comment on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1016398895


   > Yes, this is exactly what I have in mind. 
   
   Filed  https://github.com/apache/arrow-datafusion/issues/1612 to track the idea. It is a very good one @houqp πŸ‘ 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r787756794



##########
File path: datafusion/src/sql/planner.rs
##########
@@ -714,33 +713,79 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                 extract_possible_join_keys(&filter_expr, &mut possible_join_keys)?;
 
                 let mut all_join_keys = HashSet::new();
-                let mut left = plans[0].clone();
-                for right in plans.iter().skip(1) {
-                    let left_schema = left.schema();
-                    let right_schema = right.schema();
+
+                let mut plans = plans.into_iter();
+                let mut left = plans.next().unwrap(); // have at least one plan
+
+                // List of the plans that have not yet been joined
+                let mut remaining_plans: Vec<Option<LogicalPlan>> =
+                    plans.into_iter().map(Some).collect();
+
+                // Take from the list of remaining plans,
+                loop {
                     let mut join_keys = vec![];
-                    for (l, r) in &possible_join_keys {
-                        if left_schema.field_from_column(l).is_ok()
-                            && right_schema.field_from_column(r).is_ok()
-                        {
-                            join_keys.push((l.clone(), r.clone()));
-                        } else if left_schema.field_from_column(r).is_ok()
-                            && right_schema.field_from_column(l).is_ok()
-                        {
-                            join_keys.push((r.clone(), l.clone()));
+
+                    // Search all remaining plans for the next to
+                    // join. Prefer the first one that has a join
+                    // predicate in the predicate lists
+                    let plan_with_idx = remaining_plans.iter().enumerate().find(|(_idx, plan)| {
+                        // skip plans that have been joined already
+                        let plan = if let Some(plan) = plan {
+                            plan
+                        } else {
+                            return false;
+                        };
+
+                        // can we find a match?
+                        let left_schema = left.schema();
+                        let right_schema = plan.schema();
+                        for (l, r) in &possible_join_keys {
+                            if left_schema.field_from_column(l).is_ok()
+                                && right_schema.field_from_column(r).is_ok()
+                            {
+                                join_keys.push((l.clone(), r.clone()));
+                            } else if left_schema.field_from_column(r).is_ok()
+                                && right_schema.field_from_column(l).is_ok()
+                            {
+                                join_keys.push((r.clone(), l.clone()));
+                            }
                         }
-                    }
+                        // stop if we found join keys
+                        !join_keys.is_empty()
+                    });
+
+                    // If we did not find join keys, either there are
+                    // no more plans, or we can't find any plans that
+                    // can be joined with predicates
                     if join_keys.is_empty() {
-                        left =
-                            LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+                        assert!(plan_with_idx.is_none());

Review comment:
       Very tight!

##########
File path: datafusion/src/sql/planner.rs
##########
@@ -714,33 +713,79 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                 extract_possible_join_keys(&filter_expr, &mut possible_join_keys)?;
 
                 let mut all_join_keys = HashSet::new();
-                let mut left = plans[0].clone();
-                for right in plans.iter().skip(1) {
-                    let left_schema = left.schema();
-                    let right_schema = right.schema();
+
+                let mut plans = plans.into_iter();
+                let mut left = plans.next().unwrap(); // have at least one plan
+
+                // List of the plans that have not yet been joined
+                let mut remaining_plans: Vec<Option<LogicalPlan>> =
+                    plans.into_iter().map(Some).collect();
+
+                // Take from the list of remaining plans,
+                loop {
                     let mut join_keys = vec![];
-                    for (l, r) in &possible_join_keys {
-                        if left_schema.field_from_column(l).is_ok()
-                            && right_schema.field_from_column(r).is_ok()
-                        {
-                            join_keys.push((l.clone(), r.clone()));
-                        } else if left_schema.field_from_column(r).is_ok()
-                            && right_schema.field_from_column(l).is_ok()
-                        {
-                            join_keys.push((r.clone(), l.clone()));
+
+                    // Search all remaining plans for the next to
+                    // join. Prefer the first one that has a join
+                    // predicate in the predicate lists
+                    let plan_with_idx = remaining_plans.iter().enumerate().find(|(_idx, plan)| {

Review comment:
       Here, I use `plan_with_idx` to replace `idx`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r787301598



##########
File path: datafusion/src/sql/planner.rs
##########
@@ -731,20 +732,33 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                         }
                     }
                     if join_keys.is_empty() {
-                        left =
-                            LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+                        // check [1..idx] if contains current plan to avoid infinite loop
+                        if mut_plans[1..idx].contains(&mut_plans[idx])

Review comment:
       Oh never mind, just noticed @alamb already implemented a similar approach at https://github.com/xudong963/arrow-datafusion/pull/2 :D nice!




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r786062776



##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -214,8 +220,14 @@ pub struct Extension {
     pub node: Arc<dyn UserDefinedLogicalNode + Send + Sync>,
 }
 
+impl PartialEq for Extension {
+    fn eq(&self, _other: &Self) -> bool {
+        todo!()

Review comment:
       I'm not sure if need to do something.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013877492


   > Could you possibly provide some tests @xudong963 ?
   
   Sure.
   
   
   > I was expecting to see code that basically applied an algebraic transformation on predicates like:
   
   The ticket doesn't do the transformation.  It does the following thing.
   
   First of all, let's see the example:
   ```
   ❯ create table part as select 1 as p_partkey;                                                                                                                                                
   0 rows in set. Query took 0.003 seconds.                                                      
   ❯ create table lineitem as select 1 as l_partkey, 2 as l_suppkey;                             
   0 rows in set. Query took 0.005 seconds.                                                      
   ❯ create table supplier as select 1 as s_suppkey;                                             
   0 rows in set. Query took 0.002 seconds.                                                                                                
   ❯ explain select * from part, supplier, lineitem where p_partkey = l_partkey and s_suppkey = l_suppkey;                                                                                      
   +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   --------------------------------------------------+                       
   | plan_type     | plan                                                                                                                                                                                                                         |               
   +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   --------------------------------------------------+  
   | logical_plan  | Projection: #part.p_partkey, #supplier.s_suppkey, #lineitem.l_partkey, #lineitem.l_suppkey                                                                                 
                                                     | 
   |               |   Join: #part.p_partkey = #lineitem.l_partkey, #supplier.s_suppkey = #lineitem.l_suppkey                                                                                   
                                                     |                                           
   |               |     CrossJoin:                                                                                                                                                                                                               |                          
   |               |       TableScan: part projection=Some([0])                                  
                                                     |                                                                                                                                          
   |               |       TableScan: supplier projection=Some([0])                                                                                                                             
                                                     |                                                                                                                                          
   |               |     TableScan: lineitem projection=Some([0, 1])      
   ```
   https://github.com/apache/arrow-datafusion/blob/6f7b2d25fb75c843efed67fbd72d09b2c2d6c2eb/datafusion/src/sql/planner.rs#L718
   In the `for` loop, at first, `left` is `part`, `right` is `supplier`, there is no `join key` between `part` and `right`, so result in `cross join` between `part` and `right`. It's heavy.
   
   In the ticket, we can push `supplier` to `mut_plans`, after inner join `part` and `lineitem`, `supplier` can inner join with them.
   ```
   +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                                                                                                               |
   +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: #part.p_partkey, #lineitem.l_partkey, #lineitem.l_suppkey, #supplier.s_suppkey                                                         |
   |               |   Join: #lineitem.l_suppkey = #supplier.s_suppkey                                                                                                  |
   |               |     Join: #part.p_partkey = #lineitem.l_partkey                                                                                                    |
   |               |       TableScan: part projection=Some([0])                                                                                                         |
   |               |       TableScan: lineitem projection=Some([0, 1])                                                                                                  |
   |               |     TableScan: supplier projection=Some([0])   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1014608623


   add tests and fix the extreme case. cc @alamb @pjmore 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 edited a comment on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 edited a comment on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013642370


   cc @houqp @alamb 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1014105703


   > @xudong963 This might be a bit nitpicky but this will enter an infinite loop if multiple cross joins are required for a plan. As an example, the following explain select will never finish:
   > 
   > ```
   > 
   > ❯ create table part as select 1 as p_partkey, 2 as p_partprice, 3 as p_partprice
   > _taxadded;
   > 0 rows in set. Query took 0.036 seconds.
   > ❯ create table lineitem as select 1 as l_itemkey;
   > 0 rows in set. Query took 0.001 seconds.
   > ❯ create table supplier as select 1 as s_suppkey;
   > 0 rows in set.0 rows in set. Query took 0.001 seconds. Query took 0.001 seconds.
   > ❯ explain select * from part,supplier,lineitem where p_partprice=p_partprice_tax
   > added;
   > ```
   > 
   > In practice I'm not sure how much of a deal this is as it requires multiple cross joins with a filter that operates solely on a single table.
   
   Good catch, I think we should process the case because we can't guarantee what kind of queries the user will write 😁


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] pjmore commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
pjmore commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1014093807


   @xudong963 This might be a bit nitpicky but this will enter an infinite loop if multiple cross joins are required for a plan. As an example, the following explain select will never finish:
   ```DataFusion CLI v5.1.0
   
   ❯ create table part as select 1 as p_partkey, 2 as p_partprice, 3 as p_partprice
   _taxadded;
   0 rows in set. Query took 0.036 seconds.
   ❯ create table lineitem as select 1 as l_itemkey;
   0 rows in set. Query took 0.001 seconds.
   ❯ create table supplier as select 1 as s_suppkey;
   0 rows in set.0 rows in set. Query took 0.001 seconds. Query took 0.001 seconds.
   ❯ explain select * from part,supplier,lineitem where p_partprice=p_partprice_tax
   added;
   ```
   In practice I'm not sure how much of a deal this is as it requires multiple cross joins with a filter that operates solely on a single table. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp edited a comment on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
houqp edited a comment on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1016020781


   > The only thing I can think of would be to extend FilterPushdown in a way that would try and push down column = column type expressions from Filter nodes into Join nodes?
   
   Yes, this is exactly what I have in mind. With this optimizer rule, the sql planner can just naively plan these types of queries as cross joins and let the optimizer rewrite them into inner joins whenever applicable. And dataframe api users can write dumb cross join queries that run as fast as optimized inner join queries.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1015189038


   As a future follow up task, I think there is value in handling this plan rewrite in our optimizer layer so dataframe users can benefit from it as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r786099463



##########
File path: datafusion/src/sql/planner.rs
##########
@@ -731,20 +732,33 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                         }
                     }
                     if join_keys.is_empty() {
-                        left =
-                            LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+                        /// check [1..idx] if contains current plan to avoid infinite loop

Review comment:
       ```suggestion
                           // check [1..idx] if contains current plan to avoid infinite loop
   ```




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r787300922



##########
File path: datafusion/src/sql/planner.rs
##########
@@ -731,20 +732,33 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                         }
                     }
                     if join_keys.is_empty() {
-                        left =
-                            LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+                        // check [1..idx] if contains current plan to avoid infinite loop
+                        if mut_plans[1..idx].contains(&mut_plans[idx])

Review comment:
       Got it, I am thinking that we might be able to avoid this extra PartialEq and contains check overhead by breaking this up into two loops:
   
   1. extract the loop logic into its own function
   2. first loop through `plans` once, for each plan that doesn't have join key matches, we push its index in `plans` into a retry vector
   3. second loop to go through the retry vector
   
   This way, we can guarantee every plan in the original `plans` vector is evaluated at most twice without getting into an infinite loop while avoiding particleq compare and clones. What do you think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r787307010



##########
File path: datafusion/src/sql/planner.rs
##########
@@ -731,20 +732,33 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                         }
                     }
                     if join_keys.is_empty() {
-                        left =
-                            LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+                        // check [1..idx] if contains current plan to avoid infinite loop
+                        if mut_plans[1..idx].contains(&mut_plans[idx])

Review comment:
       Yeah, I'll look it carefully! πŸ‘




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r786527051



##########
File path: datafusion/src/sql/planner.rs
##########
@@ -731,20 +732,33 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                         }
                     }
                     if join_keys.is_empty() {
-                        left =
-                            LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+                        // check [1..idx] if contains current plan to avoid infinite loop
+                        if mut_plans[1..idx].contains(&mut_plans[idx])
+                            || idx == mut_plans.len() - 1
+                        {
+                            left = LogicalPlanBuilder::from(left)
+                                .cross_join(&mut_plans[idx])?
+                                .build()?;
+                        } else {
+                            mut_plans.push(mut_plans[idx].clone());

Review comment:
       Yes, you understand exactly right πŸ‘




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013642299


   
   
   
   
   ```
   ❯ create table part as select 1 as p_partkey;                                                                                                                                                
   0 rows in set. Query took 0.003 seconds.                                                      
   ❯ create table lineitem as select 1 as l_partkey, 2 as l_suppkey;                             
   0 rows in set. Query took 0.005 seconds.                                                      
   ❯ create table supplier as select 1 as s_suppkey;                                             
   0 rows in set. Query took 0.002 seconds.                                                                                                
   ❯ explain select * from part, supplier, lineitem where p_partkey = l_partkey and s_suppkey = l_suppkey; 
   
   +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                                                                                                               |
   +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: #part.p_partkey, #lineitem.l_partkey, #lineitem.l_suppkey, #supplier.s_suppkey                                                         |
   |               |   Join: #lineitem.l_suppkey = #supplier.s_suppkey                                                                                                  |
   |               |     Join: #part.p_partkey = #lineitem.l_partkey                                                                                                    |
   |               |       TableScan: part projection=Some([0])                                                                                                         |
   |               |       TableScan: lineitem projection=Some([0, 1])                                                                                                  |
   |               |     TableScan: supplier projection=Some([0])                                                                                                       |
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1018138268


   Nice work @xudong963 @alamb !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r787050608



##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -214,8 +220,14 @@ pub struct Extension {
     pub node: Arc<dyn UserDefinedLogicalNode + Send + Sync>,
 }
 
+impl PartialEq for Extension {
+    fn eq(&self, _other: &Self) -> bool {
+        todo!()

Review comment:
       If you leave `todo!()` here it means the code will assert for plans with `UserDefinedPlanNode`s and cross joins (which is likely not possible given the inputs are from the `FROM` list, and thus shoudl be `Select` (for subqueries) or `TableScan` for table references
   
   For the future, I think you can change
   
   ```rust
   #[derive(Clone)]
   pub struct Extension {
       /// The runtime extension operator
       pub node: Arc<dyn UserDefinedLogicalNode + PartialEq Send + Sync>,
   }
   ```
   
   to the following (aka force the `UserDefinedLogicalNode` to implement `PartialEq`):
   
   ```rust
   #[derive(Clone, PartialEq)]
   pub struct Extension {
       /// The runtime extension operator
       pub node: Arc<dyn UserDefinedLogicalNode + PartialEq Send + Sync>,
   }
   ```
   
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1016020781


   > The only thing I can think of would be to extend FilterPushdown in a way that would try and push down column = column type expressions from Filter nodes into Join nodes?
   
   Yes, this is exactly what I have in mind. With this optimizer rule, the sql planner can just naively plan these types of queries as cross joins and let the optimizer rewrite them into inner joins whenever applicable.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp merged pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
houqp merged pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r786061330



##########
File path: datafusion/src/sql/planner.rs
##########
@@ -3818,6 +3832,31 @@ mod tests {
             \n  TableScan: public.person projection=None";
         quick_test(sql, expected);
     }
+
+    #[test]
+    fn cross_join_to_inner_join() {
+        let sql = "select person.id from person, orders, lineitem where person.id = lineitem.l_item_id and orders.o_item_id = lineitem.l_description;";
+        let expected = "Projection: #person.id\
+                                 \n  Join: #lineitem.l_description = #orders.o_item_id\
+                                 \n    Join: #person.id = #lineitem.l_item_id\
+                                 \n      TableScan: person projection=None\
+                                 \n      TableScan: lineitem projection=None\
+                                 \n    TableScan: orders projection=None";
+        quick_test(sql, expected);
+    }
+
+    #[test]
+    fn cross_join_not_to_inner_join() {
+        let sql = "select person.id from person, orders, lineitem where person.id = person.age;";

Review comment:
       here @pjmore 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1015733848


   > As a future follow up task, I think there is value in handling this plan rewrite in our optimizer layer so dataframe API users can benefit from it as well.
   
   @houqp  what do you have in mind? 
   
   With the `dataframe` API users can't create implicit joins the way you can with a sql query like `SELECT .. FROM foo, bar`;
   
   The only thing I can think of would be to extend `FilterPushdown` in a way that would try and push down `column = column` type expressions from `Filter` nodes into `Join` nodes? 
   
   So like
   
   ```
   Filter(a = b)
     Join (Cross)
       TableScan A
       TableScan B
   ```
   
   To 
   ```
     Join (Inner) exprs: {a = b}
       TableScan A
       TableScan B
   ```
   
   And 
   
   ```
   Filter(a = b)
     Join (Inner) exprs: {a2 = b2}
       TableScan A
       TableScan B
   ```
   
   ```
   Filter(a = b)
     Join (Inner) exprs: {a2 = b2 AND a = b}
       TableScan A
       TableScan B
   ```
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1016477189


   Thanks, @alamb ❀️. I look at https://github.com/xudong963/arrow-datafusion/pull/2 carefully and I think the implementation is very elegant, so I directly use it with only minor fixes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 edited a comment on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 edited a comment on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013673068


   Currently, query 8 can't seem to pass due to `case ... when ... then ... else`, so I delete it to run the bench
   ```sql
   select
       o_year,
   from
       (
           select
               extract(year from o_orderdate) as o_year,
               l_extendedprice * (1 - l_discount) as volume,
               n2.n_name as nation
           from
               part,
               supplier,
               lineitem,
               orders,
               customer,
               nation n1,
               nation n2,
               region
           where
                   p_partkey = l_partkey
             and s_suppkey = l_suppkey
             and l_orderkey = o_orderkey
             and o_custkey = c_custkey
             and c_nationkey = n1.n_nationkey
             and n1.n_regionkey = r_regionkey
             and r_name = 'AMERICA'
             and s_nationkey = n2.n_nationkey
             and o_orderdate between date '1995-01-01' and date '1996-12-31'
             and p_type = 'ECONOMY ANODIZED STEEL'
       ) as all_nations
   group by
       o_year
   order by
       o_year;
   ```
   
   ### master
   ```
   ➜  benchmarks git:(master) cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 8 --batch-size 4096
     
   Query 8 iteration 0 took 5047.0 ms
   Query 8 iteration 1 took 4540.6 ms
   Query 8 iteration 2 took 4552.0 ms
   Query 8 avg time: 4713.19 ms
   ```
   ### xudong963:fix_cross_join
   ```
   ➜  benchmarks git:(fix_cross_join) βœ— cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 8 --batch-size 4096
   Query 8 iteration 0 took 2823.8 ms
   Query 8 iteration 1 took 2543.9 ms
   Query 8 iteration 2 took 2442.7 ms
   Query 8 avg time: 2603.48 ms
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] alamb commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1016398895


   > Yes, this is exactly what I have in mind. 
   
   Filed  https://github.com/apache/arrow-datafusion/issues/1612 to track the idea


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013673068


   Currently, query 8 can't seem to pass due to `case ... when ... then ... else`, so I delete it to run the bench
   ```sql
   select
       o_year,
   from
       (
           select
               extract(year from o_orderdate) as o_year,
               l_extendedprice * (1 - l_discount) as volume,
               n2.n_name as nation
           from
               part,
               supplier,
               lineitem,
               orders,
               customer,
               nation n1,
               nation n2,
               region
           where
                   p_partkey = l_partkey
             and s_suppkey = l_suppkey
             and l_orderkey = o_orderkey
             and o_custkey = c_custkey
             and c_nationkey = n1.n_nationkey
             and n1.n_regionkey = r_regionkey
             and r_name = 'AMERICA'
             and s_nationkey = n2.n_nationkey
             and o_orderdate between date '1995-01-01' and date '1996-12-31'
             and p_type = 'ECONOMY ANODIZED STEEL'
       ) as all_nations
   group by
       o_year
   order by
       o_year;
   ```
   
   ### master
   ```
   ➜  benchmarks git:(master) cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 8 --batch-size 4096
     
   Query 8 iteration 0 took 5047.0 ms
   Query 8 iteration 1 took 4540.6 ms
   Query 8 iteration 2 took 4552.0 ms
   Query 8 avg time: 4713.19 ms
   ```
   ### xudong963:fix_cross_join
   ```
   ➜  benchmarks git:(fix_cross_join) βœ— cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data -
   Query 8 iteration 0 took 2823.8 ms
   Query 8 iteration 1 took 2543.9 ms
   Query 8 iteration 2 took 2442.7 ms
   Query 8 avg time: 2603.48 ms
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 edited a comment on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 edited a comment on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013877492


   > Could you possibly provide some tests @xudong963 ?
   
   Sure. The resulting correctness test can be overridden by the current test. I can add a test about the logical plan.
   
   
   > I was expecting to see code that basically applied an algebraic transformation on predicates like:
   
   The ticket doesn't do the transformation.  It does the following thing.
   
   First of all, let's see the example:
   ```
   ❯ create table part as select 1 as p_partkey;                                                                                                                                                
   0 rows in set. Query took 0.003 seconds.                                                      
   ❯ create table lineitem as select 1 as l_partkey, 2 as l_suppkey;                             
   0 rows in set. Query took 0.005 seconds.                                                      
   ❯ create table supplier as select 1 as s_suppkey;                                             
   0 rows in set. Query took 0.002 seconds.                                                                                                
   ❯ explain select * from part, supplier, lineitem where p_partkey = l_partkey and s_suppkey = l_suppkey;                                                                                      
   +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   --------------------------------------------------+                       
   | plan_type     | plan                                                                                                                                                                                                                         |               
   +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
   --------------------------------------------------+  
   | logical_plan  | Projection: #part.p_partkey, #supplier.s_suppkey, #lineitem.l_partkey, #lineitem.l_suppkey                                                                                 
                                                     | 
   |               |   Join: #part.p_partkey = #lineitem.l_partkey, #supplier.s_suppkey = #lineitem.l_suppkey                                                                                   
                                                     |                                           
   |               |     CrossJoin:                                                                                                                                                                                                               |                          
   |               |       TableScan: part projection=Some([0])                                  
                                                     |                                                                                                                                          
   |               |       TableScan: supplier projection=Some([0])                                                                                                                             
                                                     |                                                                                                                                          
   |               |     TableScan: lineitem projection=Some([0, 1])      
   ```
   https://github.com/apache/arrow-datafusion/blob/6f7b2d25fb75c843efed67fbd72d09b2c2d6c2eb/datafusion/src/sql/planner.rs#L718
   In the `for` loop, at first, `left` is `part`, `right` is `supplier`, there is no `join key` between `part` and `right`, so result in `cross join` between `part` and `right`. It's heavy.
   
   In the ticket, we can push `supplier` to `mut_plans`, after inner join `part` and `lineitem`, `supplier` can inner join with them.
   ```
   +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
   | plan_type     | plan                                                                                                                                               |
   +---------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
   | logical_plan  | Projection: #part.p_partkey, #lineitem.l_partkey, #lineitem.l_suppkey, #supplier.s_suppkey                                                         |
   |               |   Join: #lineitem.l_suppkey = #supplier.s_suppkey                                                                                                  |
   |               |     Join: #part.p_partkey = #lineitem.l_partkey                                                                                                    |
   |               |       TableScan: part projection=Some([0])                                                                                                         |
   |               |       TableScan: lineitem projection=Some([0, 1])                                                                                                  |
   |               |     TableScan: supplier projection=Some([0])   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] xudong963 commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013642370


   cc @houqp 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp edited a comment on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
houqp edited a comment on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1015189038


   As a future follow up task, I think there is value in handling this plan rewrite in our optimizer layer so dataframe API users can benefit from it as well.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates

Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r786524099



##########
File path: datafusion/src/sql/planner.rs
##########
@@ -731,20 +732,33 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
                         }
                     }
                     if join_keys.is_empty() {
-                        left =
-                            LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+                        // check [1..idx] if contains current plan to avoid infinite loop
+                        if mut_plans[1..idx].contains(&mut_plans[idx])
+                            || idx == mut_plans.len() - 1
+                        {
+                            left = LogicalPlanBuilder::from(left)
+                                .cross_join(&mut_plans[idx])?
+                                .build()?;
+                        } else {
+                            mut_plans.push(mut_plans[idx].clone());

Review comment:
       could be helpful to add a simple comment here to explain why pushing the same plan to the end helps. my understanding is we want to retry the join key matching after we go through all the plan once just in case this plan can be indirectly joined with left through other plans.

##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -214,8 +220,14 @@ pub struct Extension {
     pub node: Arc<dyn UserDefinedLogicalNode + Send + Sync>,
 }
 
+impl PartialEq for Extension {
+    fn eq(&self, _other: &Self) -> bool {
+        todo!()

Review comment:
       What's the reason for implementing PartialEq for all the plan nodes?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org