You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "alamb (via GitHub)" <gi...@apache.org> on 2023/04/13 18:15:46 UTC

[GitHub] [arrow-datafusion] alamb commented on a diff in pull request #5907: optimize non-correlated where exists subquery rewrite to scalar subquery

alamb commented on code in PR #5907:
URL: https://github.com/apache/arrow-datafusion/pull/5907#discussion_r1165877527


##########
datafusion/core/tests/sql/subqueries.rs:
##########
@@ -129,12 +129,14 @@ async fn exists_subquery_with_same_table() -> Result<()> {
 
     let expected = vec![
         "Explain [plan_type:Utf8, plan:Utf8]",
-        "  Filter: EXISTS (<subquery>) [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
-        "    Subquery: [t1_int:UInt32;N]",
-        "      Projection: t1.t1_int [t1_int:UInt32;N]",
-        "        Filter: t1.t1_id > t1.t1_int [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
-        "          TableScan: t1 [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
-        "    TableScan: t1 projection=[t1_id, t1_name, t1_int] [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
+        "  Projection: t1.t1_id, t1.t1_name, t1.t1_int [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
+        "    CrossJoin: [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N, COUNT(UInt8(1)):Int64;N]",
+        "      TableScan: t1 projection=[t1_id, t1_name, t1_int] [t1_id:UInt32;N, t1_name:Utf8;N, t1_int:UInt32;N]",
+        "      SubqueryAlias: __scalar_sq_1 [COUNT(UInt8(1)):Int64;N]",
+        "        Filter: COUNT(UInt8(1)) > Int64(0) [COUNT(UInt8(1)):Int64;N]",
+        "          Aggregate: groupBy=[[]], aggr=[[COUNT(UInt8(1))]] [COUNT(UInt8(1)):Int64;N]",
+        "            Filter: t1.t1_id > t1.t1_int [t1_id:UInt32;N, t1_int:UInt32;N]",
+        "              TableScan: t1 projection=[t1_id, t1_int] [t1_id:UInt32;N, t1_int:UInt32;N]",

Review Comment:
   I think the classic way to rewrite a subquery like
   
   ```
   WHERE  EXISTS (SELECT b FROM t2 where a>1 );
   ```
   
   Is to use a SEMI Join
   
   ```
   SemiJoin ()
     Filter(a > 1)
       TableScan(a, b)
   ```
   
   
   I don't think it is necessary to actually do an aggregate to count how many rows -- all that is needed is to check if there is any row
   
   
   Perhaps @jackwener  or @ygf11  have some suggestion on the way to extend subquery rewrites to support non correlated exists 🤔 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org