You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/01/15 08:20:45 UTC
[GitHub] [arrow-datafusion] xudong963 opened a new pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
xudong963 opened a new pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566
# Which issue does this PR close?
<!--
We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123.
-->
Closes #1293
# Rationale for this change
<!--
Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes.
-->
# What changes are included in this PR?
<!--
There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR.
-->
# Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be updated before approving the PR.
-->
<!--
If there are any breaking changes to public APIs, please add the `api change` label.
-->
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r787711438
##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -214,8 +220,14 @@ pub struct Extension {
pub node: Arc<dyn UserDefinedLogicalNode + Send + Sync>,
}
+impl PartialEq for Extension {
+ fn eq(&self, _other: &Self) -> bool {
+ todo!()
Review comment:
> For the future, I think you can change
`PartialEq` isn't auto-trait, so it can't be used as additional trait in a trait object
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1015791735
Also, @xudong963 thank you for taking on this task. Really neat to see
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013734364
Thank you @xudong963 -- this looks very cool. I'll try and review it carefully tomorrow π
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 edited a comment on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 edited a comment on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1014105703
> @xudong963 This might be a bit nitpicky but this will enter an infinite loop if multiple cross joins are required for a plan. As an example, the following explain select will never finish:
>
> ```
>
> β― create table part as select 1 as p_partkey, 2 as p_partprice, 3 as p_partprice
> _taxadded;
> 0 rows in set. Query took 0.036 seconds.
> β― create table lineitem as select 1 as l_itemkey;
> 0 rows in set. Query took 0.001 seconds.
> β― create table supplier as select 1 as s_suppkey;
> 0 rows in set.0 rows in set. Query took 0.001 seconds. Query took 0.001 seconds.
> β― explain select * from part,supplier,lineitem where p_partprice=p_partprice_tax
> added;
> ```
>
> In practice I'm not sure how much of a deal this is as it requires multiple cross joins with a filter that operates solely on a single table.
Good catch, thanks @pjmore. I think we should process the case because we can't guarantee what kind of queries the user will write π
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r786529381
##########
File path: datafusion/src/sql/planner.rs
##########
@@ -731,20 +732,33 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
}
}
if join_keys.is_empty() {
- left =
- LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+ // check [1..idx] if contains current plan to avoid infinite loop
+ if mut_plans[1..idx].contains(&mut_plans[idx])
Review comment:
Here, we use `contains()` which requests `Logical Plan` implement `PartialEq` trait @houqp
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r787711438
##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -214,8 +220,14 @@ pub struct Extension {
pub node: Arc<dyn UserDefinedLogicalNode + Send + Sync>,
}
+impl PartialEq for Extension {
+ fn eq(&self, _other: &Self) -> bool {
+ todo!()
Review comment:
> For the future, I think you can change
`PartialEq` isn't auto-trait, so it can't be used as an additional trait in a trait object
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb edited a comment on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
alamb edited a comment on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1016398895
> Yes, this is exactly what I have in mind.
Filed https://github.com/apache/arrow-datafusion/issues/1612 to track the idea. It is a very good one @houqp π
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r787756794
##########
File path: datafusion/src/sql/planner.rs
##########
@@ -714,33 +713,79 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
extract_possible_join_keys(&filter_expr, &mut possible_join_keys)?;
let mut all_join_keys = HashSet::new();
- let mut left = plans[0].clone();
- for right in plans.iter().skip(1) {
- let left_schema = left.schema();
- let right_schema = right.schema();
+
+ let mut plans = plans.into_iter();
+ let mut left = plans.next().unwrap(); // have at least one plan
+
+ // List of the plans that have not yet been joined
+ let mut remaining_plans: Vec<Option<LogicalPlan>> =
+ plans.into_iter().map(Some).collect();
+
+ // Take from the list of remaining plans,
+ loop {
let mut join_keys = vec![];
- for (l, r) in &possible_join_keys {
- if left_schema.field_from_column(l).is_ok()
- && right_schema.field_from_column(r).is_ok()
- {
- join_keys.push((l.clone(), r.clone()));
- } else if left_schema.field_from_column(r).is_ok()
- && right_schema.field_from_column(l).is_ok()
- {
- join_keys.push((r.clone(), l.clone()));
+
+ // Search all remaining plans for the next to
+ // join. Prefer the first one that has a join
+ // predicate in the predicate lists
+ let plan_with_idx = remaining_plans.iter().enumerate().find(|(_idx, plan)| {
+ // skip plans that have been joined already
+ let plan = if let Some(plan) = plan {
+ plan
+ } else {
+ return false;
+ };
+
+ // can we find a match?
+ let left_schema = left.schema();
+ let right_schema = plan.schema();
+ for (l, r) in &possible_join_keys {
+ if left_schema.field_from_column(l).is_ok()
+ && right_schema.field_from_column(r).is_ok()
+ {
+ join_keys.push((l.clone(), r.clone()));
+ } else if left_schema.field_from_column(r).is_ok()
+ && right_schema.field_from_column(l).is_ok()
+ {
+ join_keys.push((r.clone(), l.clone()));
+ }
}
- }
+ // stop if we found join keys
+ !join_keys.is_empty()
+ });
+
+ // If we did not find join keys, either there are
+ // no more plans, or we can't find any plans that
+ // can be joined with predicates
if join_keys.is_empty() {
- left =
- LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+ assert!(plan_with_idx.is_none());
Review comment:
Very tightοΌ
##########
File path: datafusion/src/sql/planner.rs
##########
@@ -714,33 +713,79 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
extract_possible_join_keys(&filter_expr, &mut possible_join_keys)?;
let mut all_join_keys = HashSet::new();
- let mut left = plans[0].clone();
- for right in plans.iter().skip(1) {
- let left_schema = left.schema();
- let right_schema = right.schema();
+
+ let mut plans = plans.into_iter();
+ let mut left = plans.next().unwrap(); // have at least one plan
+
+ // List of the plans that have not yet been joined
+ let mut remaining_plans: Vec<Option<LogicalPlan>> =
+ plans.into_iter().map(Some).collect();
+
+ // Take from the list of remaining plans,
+ loop {
let mut join_keys = vec![];
- for (l, r) in &possible_join_keys {
- if left_schema.field_from_column(l).is_ok()
- && right_schema.field_from_column(r).is_ok()
- {
- join_keys.push((l.clone(), r.clone()));
- } else if left_schema.field_from_column(r).is_ok()
- && right_schema.field_from_column(l).is_ok()
- {
- join_keys.push((r.clone(), l.clone()));
+
+ // Search all remaining plans for the next to
+ // join. Prefer the first one that has a join
+ // predicate in the predicate lists
+ let plan_with_idx = remaining_plans.iter().enumerate().find(|(_idx, plan)| {
Review comment:
Here, I use `plan_with_idx` to replace `idx`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r787301598
##########
File path: datafusion/src/sql/planner.rs
##########
@@ -731,20 +732,33 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
}
}
if join_keys.is_empty() {
- left =
- LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+ // check [1..idx] if contains current plan to avoid infinite loop
+ if mut_plans[1..idx].contains(&mut_plans[idx])
Review comment:
Oh never mind, just noticed @alamb already implemented a similar approach at https://github.com/xudong963/arrow-datafusion/pull/2 :D nice!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r786062776
##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -214,8 +220,14 @@ pub struct Extension {
pub node: Arc<dyn UserDefinedLogicalNode + Send + Sync>,
}
+impl PartialEq for Extension {
+ fn eq(&self, _other: &Self) -> bool {
+ todo!()
Review comment:
I'm not sure if need to do something.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013877492
> Could you possibly provide some tests @xudong963 ?
Sure.
> I was expecting to see code that basically applied an algebraic transformation on predicates like:
The ticket doesn't do the transformation. It does the following thing.
First of all, let's see the example:
```
β― create table part as select 1 as p_partkey;
0 rows in set. Query took 0.003 seconds.
β― create table lineitem as select 1 as l_partkey, 2 as l_suppkey;
0 rows in set. Query took 0.005 seconds.
β― create table supplier as select 1 as s_suppkey;
0 rows in set. Query took 0.002 seconds.
β― explain select * from part, supplier, lineitem where p_partkey = l_partkey and s_suppkey = l_suppkey;
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------+
| logical_plan | Projection: #part.p_partkey, #supplier.s_suppkey, #lineitem.l_partkey, #lineitem.l_suppkey
|
| | Join: #part.p_partkey = #lineitem.l_partkey, #supplier.s_suppkey = #lineitem.l_suppkey
|
| | CrossJoin: |
| | TableScan: part projection=Some([0])
|
| | TableScan: supplier projection=Some([0])
|
| | TableScan: lineitem projection=Some([0, 1])
```
https://github.com/apache/arrow-datafusion/blob/6f7b2d25fb75c843efed67fbd72d09b2c2d6c2eb/datafusion/src/sql/planner.rs#L718
In the `for` loop, at first, `left` is `part`, `right` is `supplier`, there is no `join key` between `part` and `right`, so result in `cross join` between `part` and `right`. It's heavy.
In the ticket, we can push `supplier` to `mut_plans`, after inner join `part` and `lineitem`, `supplier` can inner join with them.
```
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: #part.p_partkey, #lineitem.l_partkey, #lineitem.l_suppkey, #supplier.s_suppkey |
| | Join: #lineitem.l_suppkey = #supplier.s_suppkey |
| | Join: #part.p_partkey = #lineitem.l_partkey |
| | TableScan: part projection=Some([0]) |
| | TableScan: lineitem projection=Some([0, 1]) |
| | TableScan: supplier projection=Some([0])
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1014608623
add tests and fix the extreme case. cc @alamb @pjmore
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 edited a comment on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 edited a comment on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013642370
cc @houqp @alamb
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1014105703
> @xudong963 This might be a bit nitpicky but this will enter an infinite loop if multiple cross joins are required for a plan. As an example, the following explain select will never finish:
>
> ```
>
> β― create table part as select 1 as p_partkey, 2 as p_partprice, 3 as p_partprice
> _taxadded;
> 0 rows in set. Query took 0.036 seconds.
> β― create table lineitem as select 1 as l_itemkey;
> 0 rows in set. Query took 0.001 seconds.
> β― create table supplier as select 1 as s_suppkey;
> 0 rows in set.0 rows in set. Query took 0.001 seconds. Query took 0.001 seconds.
> β― explain select * from part,supplier,lineitem where p_partprice=p_partprice_tax
> added;
> ```
>
> In practice I'm not sure how much of a deal this is as it requires multiple cross joins with a filter that operates solely on a single table.
Good catch, I think we should process the case because we can't guarantee what kind of queries the user will write π
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] pjmore commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
pjmore commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1014093807
@xudong963 This might be a bit nitpicky but this will enter an infinite loop if multiple cross joins are required for a plan. As an example, the following explain select will never finish:
```DataFusion CLI v5.1.0
β― create table part as select 1 as p_partkey, 2 as p_partprice, 3 as p_partprice
_taxadded;
0 rows in set. Query took 0.036 seconds.
β― create table lineitem as select 1 as l_itemkey;
0 rows in set. Query took 0.001 seconds.
β― create table supplier as select 1 as s_suppkey;
0 rows in set.0 rows in set. Query took 0.001 seconds. Query took 0.001 seconds.
β― explain select * from part,supplier,lineitem where p_partprice=p_partprice_tax
added;
```
In practice I'm not sure how much of a deal this is as it requires multiple cross joins with a filter that operates solely on a single table.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp edited a comment on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
houqp edited a comment on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1016020781
> The only thing I can think of would be to extend FilterPushdown in a way that would try and push down column = column type expressions from Filter nodes into Join nodes?
Yes, this is exactly what I have in mind. With this optimizer rule, the sql planner can just naively plan these types of queries as cross joins and let the optimizer rewrite them into inner joins whenever applicable. And dataframe api users can write dumb cross join queries that run as fast as optimized inner join queries.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1015189038
As a future follow up task, I think there is value in handling this plan rewrite in our optimizer layer so dataframe users can benefit from it as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r786099463
##########
File path: datafusion/src/sql/planner.rs
##########
@@ -731,20 +732,33 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
}
}
if join_keys.is_empty() {
- left =
- LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+ /// check [1..idx] if contains current plan to avoid infinite loop
Review comment:
```suggestion
// check [1..idx] if contains current plan to avoid infinite loop
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r787300922
##########
File path: datafusion/src/sql/planner.rs
##########
@@ -731,20 +732,33 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
}
}
if join_keys.is_empty() {
- left =
- LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+ // check [1..idx] if contains current plan to avoid infinite loop
+ if mut_plans[1..idx].contains(&mut_plans[idx])
Review comment:
Got it, I am thinking that we might be able to avoid this extra PartialEq and contains check overhead by breaking this up into two loops:
1. extract the loop logic into its own function
2. first loop through `plans` once, for each plan that doesn't have join key matches, we push its index in `plans` into a retry vector
3. second loop to go through the retry vector
This way, we can guarantee every plan in the original `plans` vector is evaluated at most twice without getting into an infinite loop while avoiding particleq compare and clones. What do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r787307010
##########
File path: datafusion/src/sql/planner.rs
##########
@@ -731,20 +732,33 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
}
}
if join_keys.is_empty() {
- left =
- LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+ // check [1..idx] if contains current plan to avoid infinite loop
+ if mut_plans[1..idx].contains(&mut_plans[idx])
Review comment:
Yeah, I'll look it carefully! π
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r786527051
##########
File path: datafusion/src/sql/planner.rs
##########
@@ -731,20 +732,33 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
}
}
if join_keys.is_empty() {
- left =
- LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+ // check [1..idx] if contains current plan to avoid infinite loop
+ if mut_plans[1..idx].contains(&mut_plans[idx])
+ || idx == mut_plans.len() - 1
+ {
+ left = LogicalPlanBuilder::from(left)
+ .cross_join(&mut_plans[idx])?
+ .build()?;
+ } else {
+ mut_plans.push(mut_plans[idx].clone());
Review comment:
Yes, you understand exactly right π
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013642299
```
β― create table part as select 1 as p_partkey;
0 rows in set. Query took 0.003 seconds.
β― create table lineitem as select 1 as l_partkey, 2 as l_suppkey;
0 rows in set. Query took 0.005 seconds.
β― create table supplier as select 1 as s_suppkey;
0 rows in set. Query took 0.002 seconds.
β― explain select * from part, supplier, lineitem where p_partkey = l_partkey and s_suppkey = l_suppkey;
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: #part.p_partkey, #lineitem.l_partkey, #lineitem.l_suppkey, #supplier.s_suppkey |
| | Join: #lineitem.l_suppkey = #supplier.s_suppkey |
| | Join: #part.p_partkey = #lineitem.l_partkey |
| | TableScan: part projection=Some([0]) |
| | TableScan: lineitem projection=Some([0, 1]) |
| | TableScan: supplier projection=Some([0]) |
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1018138268
Nice work @xudong963 @alamb !
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
alamb commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r787050608
##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -214,8 +220,14 @@ pub struct Extension {
pub node: Arc<dyn UserDefinedLogicalNode + Send + Sync>,
}
+impl PartialEq for Extension {
+ fn eq(&self, _other: &Self) -> bool {
+ todo!()
Review comment:
If you leave `todo!()` here it means the code will assert for plans with `UserDefinedPlanNode`s and cross joins (which is likely not possible given the inputs are from the `FROM` list, and thus shoudl be `Select` (for subqueries) or `TableScan` for table references
For the future, I think you can change
```rust
#[derive(Clone)]
pub struct Extension {
/// The runtime extension operator
pub node: Arc<dyn UserDefinedLogicalNode + PartialEq Send + Sync>,
}
```
to the following (aka force the `UserDefinedLogicalNode` to implement `PartialEq`):
```rust
#[derive(Clone, PartialEq)]
pub struct Extension {
/// The runtime extension operator
pub node: Arc<dyn UserDefinedLogicalNode + PartialEq Send + Sync>,
}
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
houqp commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1016020781
> The only thing I can think of would be to extend FilterPushdown in a way that would try and push down column = column type expressions from Filter nodes into Join nodes?
Yes, this is exactly what I have in mind. With this optimizer rule, the sql planner can just naively plan these types of queries as cross joins and let the optimizer rewrite them into inner joins whenever applicable.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp merged pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
houqp merged pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r786061330
##########
File path: datafusion/src/sql/planner.rs
##########
@@ -3818,6 +3832,31 @@ mod tests {
\n TableScan: public.person projection=None";
quick_test(sql, expected);
}
+
+ #[test]
+ fn cross_join_to_inner_join() {
+ let sql = "select person.id from person, orders, lineitem where person.id = lineitem.l_item_id and orders.o_item_id = lineitem.l_description;";
+ let expected = "Projection: #person.id\
+ \n Join: #lineitem.l_description = #orders.o_item_id\
+ \n Join: #person.id = #lineitem.l_item_id\
+ \n TableScan: person projection=None\
+ \n TableScan: lineitem projection=None\
+ \n TableScan: orders projection=None";
+ quick_test(sql, expected);
+ }
+
+ #[test]
+ fn cross_join_not_to_inner_join() {
+ let sql = "select person.id from person, orders, lineitem where person.id = person.age;";
Review comment:
here @pjmore
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1015733848
> As a future follow up task, I think there is value in handling this plan rewrite in our optimizer layer so dataframe API users can benefit from it as well.
@houqp what do you have in mind?
With the `dataframe` API users can't create implicit joins the way you can with a sql query like `SELECT .. FROM foo, bar`;
The only thing I can think of would be to extend `FilterPushdown` in a way that would try and push down `column = column` type expressions from `Filter` nodes into `Join` nodes?
So like
```
Filter(a = b)
Join (Cross)
TableScan A
TableScan B
```
To
```
Join (Inner) exprs: {a = b}
TableScan A
TableScan B
```
And
```
Filter(a = b)
Join (Inner) exprs: {a2 = b2}
TableScan A
TableScan B
```
```
Filter(a = b)
Join (Inner) exprs: {a2 = b2 AND a = b}
TableScan A
TableScan B
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1016477189
Thanks, @alamb β€οΈ. I look at https://github.com/xudong963/arrow-datafusion/pull/2 carefully and I think the implementation is very elegant, so I directly use it with only minor fixes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 edited a comment on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 edited a comment on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013673068
Currently, query 8 can't seem to pass due to `case ... when ... then ... else`, so I delete it to run the bench
```sql
select
o_year,
from
(
select
extract(year from o_orderdate) as o_year,
l_extendedprice * (1 - l_discount) as volume,
n2.n_name as nation
from
part,
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2,
region
where
p_partkey = l_partkey
and s_suppkey = l_suppkey
and l_orderkey = o_orderkey
and o_custkey = c_custkey
and c_nationkey = n1.n_nationkey
and n1.n_regionkey = r_regionkey
and r_name = 'AMERICA'
and s_nationkey = n2.n_nationkey
and o_orderdate between date '1995-01-01' and date '1996-12-31'
and p_type = 'ECONOMY ANODIZED STEEL'
) as all_nations
group by
o_year
order by
o_year;
```
### master
```
β benchmarks git:(master) cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 8 --batch-size 4096
Query 8 iteration 0 took 5047.0 ms
Query 8 iteration 1 took 4540.6 ms
Query 8 iteration 2 took 4552.0 ms
Query 8 avg time: 4713.19 ms
```
### xudong963:fix_cross_join
```
β benchmarks git:(fix_cross_join) β cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 8 --batch-size 4096
Query 8 iteration 0 took 2823.8 ms
Query 8 iteration 1 took 2543.9 ms
Query 8 iteration 2 took 2442.7 ms
Query 8 avg time: 2603.48 ms
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] alamb commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
alamb commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1016398895
> Yes, this is exactly what I have in mind.
Filed https://github.com/apache/arrow-datafusion/issues/1612 to track the idea
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013673068
Currently, query 8 can't seem to pass due to `case ... when ... then ... else`, so I delete it to run the bench
```sql
select
o_year,
from
(
select
extract(year from o_orderdate) as o_year,
l_extendedprice * (1 - l_discount) as volume,
n2.n_name as nation
from
part,
supplier,
lineitem,
orders,
customer,
nation n1,
nation n2,
region
where
p_partkey = l_partkey
and s_suppkey = l_suppkey
and l_orderkey = o_orderkey
and o_custkey = c_custkey
and c_nationkey = n1.n_nationkey
and n1.n_regionkey = r_regionkey
and r_name = 'AMERICA'
and s_nationkey = n2.n_nationkey
and o_orderdate between date '1995-01-01' and date '1996-12-31'
and p_type = 'ECONOMY ANODIZED STEEL'
) as all_nations
group by
o_year
order by
o_year;
```
### master
```
β benchmarks git:(master) cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data --format tbl --query 8 --batch-size 4096
Query 8 iteration 0 took 5047.0 ms
Query 8 iteration 1 took 4540.6 ms
Query 8 iteration 2 took 4552.0 ms
Query 8 avg time: 4713.19 ms
```
### xudong963:fix_cross_join
```
β benchmarks git:(fix_cross_join) β cargo run --release --bin tpch -- benchmark datafusion --iterations 3 --path ./data -
Query 8 iteration 0 took 2823.8 ms
Query 8 iteration 1 took 2543.9 ms
Query 8 iteration 2 took 2442.7 ms
Query 8 avg time: 2603.48 ms
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 edited a comment on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 edited a comment on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013877492
> Could you possibly provide some tests @xudong963 ?
Sure. The resulting correctness test can be overridden by the current test. I can add a test about the logical plan.
> I was expecting to see code that basically applied an algebraic transformation on predicates like:
The ticket doesn't do the transformation. It does the following thing.
First of all, let's see the example:
```
β― create table part as select 1 as p_partkey;
0 rows in set. Query took 0.003 seconds.
β― create table lineitem as select 1 as l_partkey, 2 as l_suppkey;
0 rows in set. Query took 0.005 seconds.
β― create table supplier as select 1 as s_suppkey;
0 rows in set. Query took 0.002 seconds.
β― explain select * from part, supplier, lineitem where p_partkey = l_partkey and s_suppkey = l_suppkey;
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
--------------------------------------------------+
| logical_plan | Projection: #part.p_partkey, #supplier.s_suppkey, #lineitem.l_partkey, #lineitem.l_suppkey
|
| | Join: #part.p_partkey = #lineitem.l_partkey, #supplier.s_suppkey = #lineitem.l_suppkey
|
| | CrossJoin: |
| | TableScan: part projection=Some([0])
|
| | TableScan: supplier projection=Some([0])
|
| | TableScan: lineitem projection=Some([0, 1])
```
https://github.com/apache/arrow-datafusion/blob/6f7b2d25fb75c843efed67fbd72d09b2c2d6c2eb/datafusion/src/sql/planner.rs#L718
In the `for` loop, at first, `left` is `part`, `right` is `supplier`, there is no `join key` between `part` and `right`, so result in `cross join` between `part` and `right`. It's heavy.
In the ticket, we can push `supplier` to `mut_plans`, after inner join `part` and `lineitem`, `supplier` can inner join with them.
```
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+----------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: #part.p_partkey, #lineitem.l_partkey, #lineitem.l_suppkey, #supplier.s_suppkey |
| | Join: #lineitem.l_suppkey = #supplier.s_suppkey |
| | Join: #part.p_partkey = #lineitem.l_partkey |
| | TableScan: part projection=Some([0]) |
| | TableScan: lineitem projection=Some([0, 1]) |
| | TableScan: supplier projection=Some([0])
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] xudong963 commented on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
xudong963 commented on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1013642370
cc @houqp
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp edited a comment on pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
houqp edited a comment on pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#issuecomment-1015189038
As a future follow up task, I think there is value in handling this plan rewrite in our optimizer layer so dataframe API users can benefit from it as well.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
[GitHub] [arrow-datafusion] houqp commented on a change in pull request #1566: fix: sql planner creates cross join instead of inner join from select predicates
Posted by GitBox <gi...@apache.org>.
houqp commented on a change in pull request #1566:
URL: https://github.com/apache/arrow-datafusion/pull/1566#discussion_r786524099
##########
File path: datafusion/src/sql/planner.rs
##########
@@ -731,20 +732,33 @@ impl<'a, S: ContextProvider> SqlToRel<'a, S> {
}
}
if join_keys.is_empty() {
- left =
- LogicalPlanBuilder::from(left).cross_join(right)?.build()?;
+ // check [1..idx] if contains current plan to avoid infinite loop
+ if mut_plans[1..idx].contains(&mut_plans[idx])
+ || idx == mut_plans.len() - 1
+ {
+ left = LogicalPlanBuilder::from(left)
+ .cross_join(&mut_plans[idx])?
+ .build()?;
+ } else {
+ mut_plans.push(mut_plans[idx].clone());
Review comment:
could be helpful to add a simple comment here to explain why pushing the same plan to the end helps. my understanding is we want to retry the join key matching after we go through all the plan once just in case this plan can be indirectly joined with left through other plans.
##########
File path: datafusion/src/logical_plan/plan.rs
##########
@@ -214,8 +220,14 @@ pub struct Extension {
pub node: Arc<dyn UserDefinedLogicalNode + Send + Sync>,
}
+impl PartialEq for Extension {
+ fn eq(&self, _other: &Self) -> bool {
+ todo!()
Review comment:
What's the reason for implementing PartialEq for all the plan nodes?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org