You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wolffcm (via GitHub)" <gi...@apache.org> on 2023/04/20 18:28:39 UTC

[GitHub] [arrow-datafusion] wolffcm opened a new pull request, #6077: fix: make simplify_expressions use a single schema for resolution

wolffcm opened a new pull request, #6077:
URL: https://github.com/apache/arrow-datafusion/pull/6077

   # Which issue does this PR close?
   
   Closes #5996.
   
   # Rationale for this change
   
   Fixing this will allow lots of expression simplification to succeed where it failed before.
   
   In IOx we would like to be able to run the logical optimizer with `skip_failed_rules` set to `false`, since some of our custom rules are required to run and produce user-facing errors.
   
   When I ran IOx tests with `skip_failed_rules` set to `false`, I found issue #5996.
   
   # What changes are included in this PR?
   
   The issue is with the `SimplifyExpressions` logical rule. As it traverses the logical plan, it includes both the output schema for the current node and the merged input schema from the node's inputs.
   
   The inclusion of two schemas causes an issue when various expression simplifiers inquire about attributes of expressions, e.g., here, which was the case for #5996:
   https://github.com/apache/arrow-datafusion/blob/011adc203e4a02a822a8f0008852b90fb7441264/datafusion/optimizer/src/simplify_expressions/context.rs#L130-L139
   
   This PR refines the schema used when simplifying to just the node's merged input schema. There is just one exception to using the child schema: for table scans with inlined predicates, those expressions need to be evaluated against the scan's output schema.
   
   # Are these changes tested?
   
   I added unit tests for both the issue with `power(f, 1)` from the issue, and also for the case of an inlined table scan.
   
   As a baseline I ran all the unit tests on `main` with `skip_failed_rules` set to false. I found just two failures:
   - `sql::select::use_between_expression_in_select_query`
   - `sql::subqueries::support_limit_subquery`
   
   (Seems like there should be more, based on #4685? Maybe I am not running all of the tests)
   
   When I run my branch wtih `skip_failed_rules` set to `false`. There was just one failure, which was already on `main`:
   - `sql::subqueries::support_limit_subquery`
   
   So this PR also allows `use_between_expression_in_select_query` to pass when skipping failed rules. It was failing in a similar way as #5996.
   
   I also ensured that the tests mentioned in #5208 (a recent bug fix that also touches on what schemas to use) all pass on my branch with `skip_failed_rules` set to `false`:
   - `right_anti_filter_push_down`
   - `right_semi_with_alias_filter`
   - `csv_query_group_by_and_having_and_where` (has inlined table predicate)
   
   # Are there any user-facing changes?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6077: fix: make simplify_expressions use a single schema for resolution

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6077:
URL: https://github.com/apache/arrow-datafusion/pull/6077#discussion_r1174403940


##########
datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:
##########
@@ -61,16 +63,14 @@ impl SimplifyExpressions {
         plan: &LogicalPlan,
         execution_props: &ExecutionProps,
     ) -> Result<LogicalPlan> {
-        // Pass down the `children merge schema` and `plan schema` to evaluate expression types.
-        // pass all `child schema` and `plan schema` isn't enough, because like `t1 semi join t2 on
-        // on t1.id = t2.id`, each individual schema can't contain all the columns in it.
-        let children_merge_schema = DFSchemaRef::new(merge_schema(plan.inputs()));
-        let schemas = vec![plan.schema(), &children_merge_schema];
-        let info = schemas
-            .into_iter()
-            .fold(SimplifyContext::new(execution_props), |context, schema| {
-                context.with_schema(schema.clone())
-            });
+        let schema = if plan.inputs().is_empty() {
+            // When predicates are pushed into a table scan, there needs to be

Review Comment:
   Thank you



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb merged pull request #6077: fix: make simplify_expressions use a single schema for resolution

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb merged PR #6077:
URL: https://github.com/apache/arrow-datafusion/pull/6077


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] wolffcm commented on pull request #6077: fix: make simplify_expressions use a single schema for resolution

Posted by "wolffcm (via GitHub)" <gi...@apache.org>.
wolffcm commented on PR #6077:
URL: https://github.com/apache/arrow-datafusion/pull/6077#issuecomment-1516768346

   cc @alamb (for when you get back from vacation)
   
   @jackwener I am also interested in your thoughts on this PR since you worked on #5208.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #6077: fix: make simplify_expressions use a single schema for resolution

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on code in PR #6077:
URL: https://github.com/apache/arrow-datafusion/pull/6077#discussion_r1174101023


##########
datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:
##########
@@ -61,16 +63,14 @@ impl SimplifyExpressions {
         plan: &LogicalPlan,
         execution_props: &ExecutionProps,
     ) -> Result<LogicalPlan> {
-        // Pass down the `children merge schema` and `plan schema` to evaluate expression types.
-        // pass all `child schema` and `plan schema` isn't enough, because like `t1 semi join t2 on
-        // on t1.id = t2.id`, each individual schema can't contain all the columns in it.
-        let children_merge_schema = DFSchemaRef::new(merge_schema(plan.inputs()));
-        let schemas = vec![plan.schema(), &children_merge_schema];
-        let info = schemas
-            .into_iter()
-            .fold(SimplifyContext::new(execution_props), |context, schema| {
-                context.with_schema(schema.clone())
-            });
+        let schema = if plan.inputs().is_empty() {
+            // When predicates are pushed into a table scan, there needs to be

Review Comment:
   I think this schema should be empty (aka have no columns) rather than use the same plan schema. Logically the schema should be the columns that are available to evaluate expressions within the plan node



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] wolffcm commented on a diff in pull request #6077: fix: make simplify_expressions use a single schema for resolution

Posted by "wolffcm (via GitHub)" <gi...@apache.org>.
wolffcm commented on code in PR #6077:
URL: https://github.com/apache/arrow-datafusion/pull/6077#discussion_r1174187518


##########
datafusion/optimizer/src/simplify_expressions/simplify_exprs.rs:
##########
@@ -61,16 +63,14 @@ impl SimplifyExpressions {
         plan: &LogicalPlan,
         execution_props: &ExecutionProps,
     ) -> Result<LogicalPlan> {
-        // Pass down the `children merge schema` and `plan schema` to evaluate expression types.
-        // pass all `child schema` and `plan schema` isn't enough, because like `t1 semi join t2 on
-        // on t1.id = t2.id`, each individual schema can't contain all the columns in it.
-        let children_merge_schema = DFSchemaRef::new(merge_schema(plan.inputs()));
-        let schemas = vec![plan.schema(), &children_merge_schema];
-        let info = schemas
-            .into_iter()
-            .fold(SimplifyContext::new(execution_props), |context, schema| {
-                context.with_schema(schema.clone())
-            });
+        let schema = if plan.inputs().is_empty() {
+            // When predicates are pushed into a table scan, there needs to be

Review Comment:
   If I made this so it used the empty schema whenever the plan node had no inputs, the test `csv_query_group_by_and_having_and_where` would fail, since it attempts to simplify predicates that are inlined into a scan:
   https://github.com/apache/arrow-datafusion/blob/caa60337c7a57572d93d8bd3cbc18006aabe55e6/datafusion/expr/src/logical_plan/plan.rs#L1428-L1429
   
   It seems to be that inlined scan filters are a bit of an exception in that they are evaluated on top of the scan itself.
   
   For other kinds of node (e.g., `Values`) I think you're right though, so I pushed a commit that refines the logic a bit more to use the plan's schema only for table scans.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org