You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "HaoYang670 (via GitHub)" <gi...@apache.org> on 2023/03/31 07:50:22 UTC

[GitHub] [arrow-datafusion] HaoYang670 opened a new issue, #5808: `decorrelate_where_in` reports error when optimizing `limit subquery`

HaoYang670 opened a new issue, #5808:
URL: https://github.com/apache/arrow-datafusion/issues/5808

   ### Describe the bug
   
   `decorrelate_where_in` currently only support `Predicate` as the top level plan in the sub-queries, otherwise it will return an error:
   https://github.com/apache/arrow-datafusion/blob/667f19ebad216b7592af5a91b70a24fb21c3bb64/datafusion/optimizer/src/decorrelate_where_in.rs#L151-L152
   
   However, for limit subquery, the top level plan might be `Limit` which let `decorrelate_where_in` fail.
   
   ### To Reproduce
   
   Set `skip_failed_rules` to `false` and run the test `support_limit_subquery`, you will fail on the test with the error message
   ```
   Error: Context("decorrelate_where_in", Internal("Optimizer rule 'decorrelate_where_in' failed due to unexpected error: a projection is required at datafusion/optimizer/src/decorrelate_where_in.rs:152\ncaused by\nError during planning: Could not coerce into Projection! at datafusion/expr/src/logical_plan/plan.rs:1293"))
   test sql::subqueries::support_limit_subquery ... FAILED
   ```
   
   ### Expected behavior
   
   No error should be generated. At least, we can let `decorrelate_where_in` return `Ok(None)`
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on issue #5808: `decorrelate_where_in` reports error when optimizing `limit subquery`

Posted by "avantgardnerio (via GitHub)" <gi...@apache.org>.
avantgardnerio commented on issue #5808:
URL: https://github.com/apache/arrow-datafusion/issues/5808#issuecomment-1494718839

   Taking a look now... how do I:
   
   > Set skip_failed_rules to false
   
   ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] alamb closed issue #5808: `decorrelate_where_in` reports error when optimizing `limit subquery`

Posted by "alamb (via GitHub)" <gi...@apache.org>.
alamb closed issue #5808: `decorrelate_where_in` reports error when optimizing `limit subquery`
URL: https://github.com/apache/arrow-datafusion/issues/5808


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] avantgardnerio commented on issue #5808: `decorrelate_where_in` reports error when optimizing `limit subquery`

Posted by "avantgardnerio (via GitHub)" <gi...@apache.org>.
avantgardnerio commented on issue #5808:
URL: https://github.com/apache/arrow-datafusion/issues/5808#issuecomment-1492124630

   > take a look if you have time
   
   Unfortunately my availability is low right now. If @mingmwang 's claim is correct (which I have no reason to doubt) that:
   
   ```
   SELECT t1.id, t1.name FROM t1 WHERE t1.id in (SELECT t2.id FROM t2 where t1.name = t2.name limit 10)
   ```
   
   > can not be de-correlated
   
   then I think we'll need to have the ability to execute plans even if this rule fails (i.e. nested loop execution). I don't think I ever intended it to decorrelate _all_ subqueries - it was designed to hit the 80% case and get TPC-H working.
   
   At the time, returning an error was considered the proper thing to do. The API changed so now the rule needs to be updated to plumb `Ok(None)` down through all the layers of recursion, which can be verbose and non-trivial.
   
   My recommendation at the time (which I would still assert) is that it would make the life of optimizer rule authors considerably simpler if we add a `DataFusionError::CanNotOptimize` error and simply return that in this case, which would get treated the same as `Ok(None)` so it keeps the code readable and simplifies plumbing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on issue #5808: `decorrelate_where_in` reports error when optimizing `limit subquery`

Posted by "HaoYang670 (via GitHub)" <gi...@apache.org>.
HaoYang670 commented on issue #5808:
URL: https://github.com/apache/arrow-datafusion/issues/5808#issuecomment-1495125976

   Hi @avantgardnerio, you could just set the default value to false:
   https://github.com/apache/arrow-datafusion/blob/8eaa292d8a1adf777ab7817ccfa4d749de0c7e1c/datafusion/common/src/config.rs#L308
   
   , and rerun `cargo test`


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] HaoYang670 commented on issue #5808: `decorrelate_where_in` reports error when optimizing `limit subquery`

Posted by "HaoYang670 (via GitHub)" <gi...@apache.org>.
HaoYang670 commented on issue #5808:
URL: https://github.com/apache/arrow-datafusion/issues/5808#issuecomment-1491467615

   Hi @avantgardnerio @mingmwang, could you please help to take a look if you have time?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #5808: `decorrelate_where_in` reports error when optimizing `limit subquery`

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on issue #5808:
URL: https://github.com/apache/arrow-datafusion/issues/5808#issuecomment-1491736320

   Sure, I will take a look. It is tricky to support decorate the correlated In/Exist subqueries which contains `Limit`/`OrderBy `clauses. I remember SparkSQL will report error in the case, PostgreSQL will not report error but will not decorate and just keep the nested subquery. Hyper is more advanced to handle this.
   
   In DataFusion, if we want to support this, we need to think and test all the difference cases carefully:
   
   -- Expected behavior: can be de-correlated, limit must be removed
   explain  
   SELECT t1.id, t1.name FROM t1 WHERE EXISTS (SELECT * FROM t2 WHERE t2.id = t1.id limit 1);
   
   -- Expected behavior: can be de-correlated, should keep the inner limit and must remove the outer limit
   explain  
   SELECT t1.id, t1.name FROM t1 WHERE EXISTS (SELECT * FROM (SELECT * FROM t2 limit 10) as t2 WHERE t2.id = t1.id limit 1);
   
   -- Expected behavior: can be de-correlated, must keep the limit
   explain 
   SELECT t1.id, t1.name FROM t1 WHERE t1.id in (SELECT t2.id FROM t2 limit 10);
   
   -- Expected behavior: can not be de-correlated, must keep limit
   explain  
   SELECT t1.id, t1.name FROM t1 WHERE t1.id in (SELECT t2.id FROM t2 where t1.name = t2.name limit 10)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [arrow-datafusion] mingmwang commented on issue #5808: `decorrelate_where_in` reports error when optimizing `limit subquery`

Posted by "mingmwang (via GitHub)" <gi...@apache.org>.
mingmwang commented on issue #5808:
URL: https://github.com/apache/arrow-datafusion/issues/5808#issuecomment-1491755517

   Why it is tricky is because Subquery can be think of as a specific kind of nested loop join, the join condition is very specific and contains limit, the de-correlation process can be consider to push down the joins and covert the nested loop join to a normal join without limit as the join condition, it changes the evaluation ordering of the original operators, removing or keeping the limit in the re-written query will impact the correctness.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org