You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by GitBox <gi...@apache.org> on 2022/10/11 11:00:54 UTC

[GitHub] [arrow-datafusion] alamb opened a new issue, #3794: When DataFusion optimizer rewrites an expression, it should not change the expressions "name" / "identity" / "alias"

alamb opened a new issue, #3794:
URL: https://github.com/apache/arrow-datafusion/issues/3794

   **Background**
   
   Every expression in DataFusion has a name, which is used as the *column* name. For example, in this example the output contains a single column with the name `"COUNT(aggregate_test_100.c9)"`:
   
   ```sql
   ❯ select count(c9) from aggregate_test_100;
   +------------------------------+
   | COUNT(aggregate_test_100.c9) |
   +------------------------------+
   | 100                          |
   +------------------------------+
   ```
   
   These names are used to refer to the columns in both subqueries as well as internally from one stage of the `LogicalPlan` to another. For example
   
   ```
   ❯ select "COUNT(aggregate_test_100.c9)" + 1 from (select count(c9) from aggregate_test_100) as sq;
   +--------------------------------------------+
   | sq.COUNT(aggregate_test_100.c9) + Int64(1) |
   +--------------------------------------------+
   | 101                                        |
   +--------------------------------------------+
   1 row in set. Query took 0.005 seconds.
   ```
   
   **Implication***
   
   DataFusion contains an extensive set of OptimizerRules that may rewrite the plan and/or its expressions so they execute more quickly. However, the optimizer rules must so 
   
   Because DataFusion identifies columns using a string name, it means it is critical that the *names* of expressions are not changed by the optimizer when it rewrites expressions. This is typically accomplished by renaming a rewritten expression by adding an alias.
   
   Here is a simple example of such a rewrite. The expression `1 + 2` can be internally simplified to `3` but must still be displayed the same as `1 + 2`:
   
   ```sql
   ❯ select 1 + 2;
   +---------------------+
   | Int64(1) + Int64(2) |
   +---------------------+
   | 3                   |
   +---------------------+
   ```
   
   Looking at the `EXPLAIN` output we can see that the optimizer has effectively rewritten `1 + 2` into effectively `3 as "1 + 2"
   ```sql
   ❯ explain select 1 + 2;
   +---------------+-------------------------------------------------+
   | plan_type     | plan                                            |
   +---------------+-------------------------------------------------+
   | logical_plan  | Projection: Int64(3) AS Int64(1) + Int64(2)     |
   |               |   EmptyRelation                                 |
   | physical_plan | ProjectionExec: expr=[3 as Int64(1) + Int64(2)] |
   |               |   EmptyExec: produce_one_row=true               |
   |               |                                                 |
   +---------------+-------------------------------------------------+
   2 rows in set. Query took 0.001 seconds.
   ``` 
   
   If the expression name is not preserved, bugs such as https://github.com/apache/arrow-datafusion/issues/3704 and https://github.com/apache/arrow-datafusion/issues/3555 occur where the expected columns can not be found. 
   
   **Problems**
   
   There are at least three places that we have rediscovered the issue of names changing when rewriting expressions such as https://github.com/apache/arrow-datafusion/issues/3704, https://github.com/apache/arrow-datafusion/issues/3555 and https://github.com/apache/arrow-datafusion/blob/master/datafusion/optimizer/src/simplify_expressions.rs#L316
   
   I tried to consolidate this learning into https://github.com/apache/arrow-datafusion/pull/3727 but @liu
   
   However, as @liukun4515  has pointed out https://github.com/apache/arrow-datafusion/pull/3727#issuecomment-1269276333 , there are likely several other places that have similar bugs lurking
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #3794: When DataFusion optimizer rewrites an expression, it should not change the expressions "name" / "identity" / "alias"

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #3794:
URL: https://github.com/apache/arrow-datafusion/issues/3794#issuecomment-1276147539

   I have stolen the great content in this PR and included it in https://github.com/apache/arrow-datafusion/pull/3788


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] andygrove commented on issue #3794: When DataFusion optimizer rewrites an expression, it should not change the expressions "name" / "identity" / "alias"

Posted by GitBox <gi...@apache.org>.
andygrove commented on issue #3794:
URL: https://github.com/apache/arrow-datafusion/issues/3794#issuecomment-1274662465

   There are currently two ways to create a "name" for an expression in the logical plan.
   
   ```
       /// Returns the name of this expression as it should appear in a schema. This name
       /// will not include any CAST expressions.
       pub fn name(&self) -> Result<String> {
           create_name(self)
       }
   
       /// Returns a full and complete string representation of this expression.
       pub fn canonical_name(&self) -> String {
           format!("{}", self)
       }
   ```
   
   I have seen that we sometimes use `name` when we should use `canonical_name` instead. 
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org