You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "helgikrs (via GitHub)" <gi...@apache.org> on 2023/11/15 11:20:22 UTC

[I] Unsound common subexpression elimination of a subquery [arrow-datafusion]

helgikrs opened a new issue, #8190:
URL: https://github.com/apache/arrow-datafusion/issues/8190

   ### Describe the bug
   
   A query containing two distinct InSubquery expressions are merged (one eliminated) by common subexpression elimination optimization pass where the expressions are not the same.
   
   ### To Reproduce
   
   In datafusion-cli
   
   ```sql
   create table a (a bigint);
   create table b (b bigint);
   create table c (c bigint);
   ```
   
   ```sql
   explain select * from a where a in (select b from b) AND a > 0 OR a in (select c from c);
   +---------------+-------------------------------------------------------------------------------+
   | plan_type     | plan                                                                          |
   +---------------+-------------------------------------------------------------------------------+
   | logical_plan  | LeftSemi Join: a.a = __correlated_sq_1.b                                      |
   |               |   TableScan: a projection=[a]                                                 |
   |               |   SubqueryAlias: __correlated_sq_1                                            |
   |               |     TableScan: b projection=[b]                                               |
   | physical_plan | CoalesceBatchesExec: target_batch_size=8192                                   |
   |               |   HashJoinExec: mode=Partitioned, join_type=LeftSemi, on=[(a@0, b@0)]         |
   |               |     CoalesceBatchesExec: target_batch_size=8192                               |
   |               |       RepartitionExec: partitioning=Hash([a@0], 24), input_partitions=24      |
   |               |         RepartitionExec: partitioning=RoundRobinBatch(24), input_partitions=1 |
   |               |           MemoryExec: partitions=1, partition_sizes=[0]                       |
   |               |     CoalesceBatchesExec: target_batch_size=8192                               |
   |               |       RepartitionExec: partitioning=Hash([b@0], 24), input_partitions=24      |
   |               |         RepartitionExec: partitioning=RoundRobinBatch(24), input_partitions=1 |
   |               |           MemoryExec: partitions=1, partition_sizes=[0]                       |
   |               |                                                                               |
   +---------------+-------------------------------------------------------------------------------+
   2 rows in set. Query took 0.003 seconds.
   ```
   
   We can see the second table scan removed after common subexpression elimination
   
   ```sql
   ❯ explain verbose select * from a where a in (select b from b) AND a > 0 OR a in (select c from c);
   +------------------------------------------------------------+-------------------------------------------------------------------------------------------------+
   | plan_type                                                  | plan                                                                                            |
   +------------------------------------------------------------+-------------------------------------------------------------------------------------------------+
   | initial_logical_plan                                       | Projection: a.a                                                                                 |
   |                                                            |   Filter: a.a IN (<subquery>) AND a.a > Int64(0) OR a.a IN (<subquery>)                         |
   |                                                            |     Subquery:                                                                                   |
   |                                                            |       Projection: b.b                                                                           |
   |                                                            |         TableScan: b                                                                            |
   |                                                            |     Subquery:                                                                                   |
   |                                                            |       Projection: c.c                                                                           |
   |                                                            |         TableScan: c                                                                            |
   |                                                            |     TableScan: a                                                                                |
   | logical_plan after inline_table_scan                       | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after type_coercion                           | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after count_wildcard_rule                     | SAME TEXT AS ABOVE                                                                              |
   | analyzed_logical_plan                                      | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after simplify_expressions                    | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after unwrap_cast_in_comparison               | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after replace_distinct_aggregate              | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after eliminate_join                          | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after decorrelate_predicate_subquery          | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after scalar_subquery_to_join                 | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after extract_equijoin_predicate              | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after simplify_expressions                    | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after merge_projection                        | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after rewrite_disjunctive_predicate           | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after eliminate_duplicated_expr               | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after eliminate_filter                        | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after eliminate_cross_join                    | SAME TEXT AS ABOVE                                                                              |
   | logical_plan after common_sub_expression_eliminate         | Projection: a.a                                                                                 |
   |                                                            |   Projection: a.a                                                                               |
   |                                                            |     Filter: a.a IN (<subquery>)a.a AS IN AND a.a > Int64(0) OR a.a IN (<subquery>)a.a AS IN     |
   |                                                            |       Projection: a.a IN (<subquery>) AS a.a IN (<subquery>)a.a, a.a                            |
   |                                                            |         Subquery:                                                                               |
   |                                                            |           Projection: b.b                                                                       |
   |                                                            |             TableScan: b                                                                        |
   |                                                            |         TableScan: a                                                                            |
   ... <truncated> ...
   ```
   
   ### Expected behavior
   
   The table scan for `c` should not be removed, the resulting logical plan should show the two subqueries.
   
   ### Additional context
   
   Changing the second InSubquery such that the lhs expression is different from the first InSubquery expression avoids triggering the bug (looks like maybe CSE is just using the expression, and not the subquery plan for evaluating equivalence?)
   
   ```sql
   explain select * from a where a in (select b from b) AND a > 0 OR (a + 1) in (select c from c);
   +--------------+----------------------------------------------------------------------------------+
   | plan_type    | plan                                                                             |
   +--------------+----------------------------------------------------------------------------------+
   | logical_plan | Filter: a.a IN (<subquery>) AND a.a > Int64(0) OR a.a + Int64(1) IN (<subquery>) |
   |              |   Subquery:                                                                      |
   |              |     Projection: b.b                                                              |
   |              |       TableScan: b                                                               |
   |              |   Subquery:                                                                      |
   |              |     Projection: c.c                                                              |
   |              |       TableScan: c                                                               |
   |              |   TableScan: a projection=[a]                                                    |
   +--------------+----------------------------------------------------------------------------------+
   ```
   
   P.S.
   
   I think the `AND a > 0` clause is immaterial, but without it a different bug is triggered, I'll file a separate issue for that.
   
   ```sql
   select * from a where a in (select b from b) OR a in (select c from c);
   Optimizer rule 'simplify_expressions' failed
   caused by
   Error during planning: Attempted to create Filter predicate with expression `a.a IN (<subquery>)` aliased as 'IN'. Filter predicates should not be aliased.
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Incorrect common subexpression elimination of a subquery [arrow-datafusion]

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb commented on issue #8190:
URL: https://github.com/apache/arrow-datafusion/issues/8190#issuecomment-1815097986

   cc @waynexia 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


Re: [I] Incorrect common subexpression elimination of a subquery [arrow-datafusion]

Posted by "tustvold (via GitHub)" <gi...@apache.org>.
tustvold commented on issue #8190:
URL: https://github.com/apache/arrow-datafusion/issues/8190#issuecomment-1812918463

   I altered the title text as unsound in Rust has a very specific meaning - https://doc.rust-lang.org/nomicon/working-with-unsafe.html


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org