You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "agubichev (via GitHub)" <gi...@apache.org> on 2023/09/25 23:26:04 UTC

[GitHub] [spark] agubichev opened a new pull request, #43111: Support correlated exists subqueries using DecorrelateInnerQuery framework

agubichev opened a new pull request, #43111:
URL: https://github.com/apache/spark/pull/43111

### What changes were proposed in this pull request?

### Why are the changes needed?

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

### Was this patch authored or co-authored using generative AI tooling?

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "agubichev (via GitHub)" <gi...@apache.org>.

agubichev commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1345923815


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##########
@@ -283,6 +305,15 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
         } else {
           a
         }
+
+      case l @ Limit(_, _) if predicateMap.nonEmpty =>

Review Comment:
   no, we don't.
   In fact, CheckAnalysis now allows LIMIT in the correlated subqueries as we support them in lateral/scalar/ EXISTs and IN (the latter is done in this PR).
   
   This check just makes sure that the legacy path (aka PullupCorrelatedPredicates) does not allow LIMITs.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "agubichev (via GitHub)" <gi...@apache.org>.

agubichev commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1342795235


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3416,6 +3416,14 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DECORRELATE_EXISTS_AND_IN_SUBQUERIES =
+    buildConf("spark.sql.optimizer.decorrelateExistsIn.enabled")
+      .internal()
+      .doc("Decorrelate EXISTS and IN subqueries.")
+      .version("4.0.0")
+      .booleanConf
+      .createWithDefault(true)

Review Comment:
   some queries were not supported and are supported with this flag, that's the only result change.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "agubichev (via GitHub)" <gi...@apache.org>.

agubichev commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1342801676


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##########
@@ -63,6 +64,18 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper {
     Join(outerPlan, dedupSubplan, joinType, condition, JoinHint(None, subHint))
   }
 
+  private def removeDomainJoins(

Review Comment:
   `rewriteDomainJoinsIfPresent`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] agubichev commented on a diff in pull request #43111: [SPARK-36112] [SQL] Support correlated exists subqueries using DecorrelateInnerQuery framework

Posted by "agubichev (via GitHub)" <gi...@apache.org>.

agubichev commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1341562525


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##########
@@ -1397,6 +1400,10 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB
           failOnInvalidOuterReference(l)
           checkPlan(input, aggregated, canContainOuter)
 
+        case o @ Offset(_, input) =>

Review Comment:
   some tests with EXISTS had offset, so i decided to include the offset handling.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##########
@@ -1134,8 +1134,11 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB
       isLateral: Boolean = false): Unit = {
     // Some query shapes are only supported with the DecorrelateInnerQuery framework.
     // Currently we only use this new framework for scalar and lateral subqueries.

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.

jchen5 commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1343058747


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3416,6 +3416,14 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DECORRELATE_EXISTS_AND_IN_SUBQUERIES =
+    buildConf("spark.sql.optimizer.decorrelateExistsIn.enabled")

Review Comment:
   Looks like for IN/EXISTS to be decorrelated with DecorrelateInnerQuery you need both flags enabled, which makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1345395781


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##########
@@ -283,6 +305,15 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
         } else {
           a
         }
+
+      case l @ Limit(_, _) if predicateMap.nonEmpty =>

Review Comment:
   don't we fail earlier in `CheckAnalysis`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1347540658


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala:
##########
@@ -461,6 +462,22 @@ object DecorrelateInnerQuery extends PredicateHelper {
       p.mapChildren(rewriteDomainJoins(outerPlan, _, conditions))
   }
 
+  private def isCountBugFree(aggregateExpressions: Seq[NamedExpression]): Boolean = {

Review Comment:
   Since this is only used in the new code path, it's fine to improve it when we consolidate the code later.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.

jchen5 commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1342986860


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3416,6 +3416,14 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DECORRELATE_EXISTS_AND_IN_SUBQUERIES =
+    buildConf("spark.sql.optimizer.decorrelateExistsIn.enabled")
+      .internal()
+      .doc("Decorrelate EXISTS and IN subqueries.")
+      .version("4.0.0")
+      .booleanConf
+      .createWithDefault(true)

Review Comment:
   I think there might be some correctness fixes here for COUNT bug in IN/EXISTS too, enabled by moving those to DecorrelateInnerQuery, right? I know not every query is fixed but there are some that are fixed, such as the example query you have added to one of your tests `select * from t1 where exists (select count(*) from t2 where t2.c1 = t1.c1);`



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3416,6 +3416,14 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DECORRELATE_EXISTS_AND_IN_SUBQUERIES =
+    buildConf("spark.sql.optimizer.decorrelateExistsIn.enabled")
+      .internal()
+      .doc("Decorrelate EXISTS and IN subqueries.")
+      .version("4.0.0")
+      .booleanConf
+      .createWithDefault(true)

Review Comment:
   I think there are some correctness fixes here for COUNT bug in IN/EXISTS too, enabled by moving those to DecorrelateInnerQuery. I know not every query is fixed but there are some that are fixed, such as the example query you have added to one of your tests `select * from t1 where exists (select count(*) from t2 where t2.c1 = t1.c1);`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.

jchen5 commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1342985201


##########
sql/core/src/test/resources/sql-tests/inputs/subquery/exists-subquery/exists-count-bug.sql:
##########
@@ -0,0 +1,21 @@
+create temporary view t1(c1, c2) as values (0, 1), (1, 2);
+create temporary view t2(c1, c2) as values (0, 2), (0, 3);
+create temporary view t3(c1, c2) as values (0, 3), (1, 4), (2, 5);
+
+select * from t1 where exists (select count(*) from t2 where t2.c1 = t1.c1);

Review Comment:
   These queries were runnable before - I just checked on master. They returned wrong results due to the COUNT bug.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "agubichev (via GitHub)" <gi...@apache.org>.

agubichev commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1345926575


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala:
##########
@@ -461,6 +462,23 @@ object DecorrelateInnerQuery extends PredicateHelper {
       p.mapChildren(rewriteDomainJoins(outerPlan, _, conditions))
   }
 
+  private def isCountBugFree(aggregateExpressions: Seq[NamedExpression]): Boolean = {
+    // The COUNT bug only appears if an aggregate expression returns a non-NULL result on an empty
+    // input.
+    // Typical example (hence the name) is COUNT(*) that returns 0 from an empty result.
+    // However, SUM(x) IS NULL is another case that returns 0, and in general any IS/NOT IS and CASE
+    // expressions are suspect (and the combination of those).
+    // For now we conservatively accept only those expressions that are guaranteed to be safe.
+    val exprsRejectEmptyInput = aggregateExpressions.map {

Review Comment:
   For exists and IN we did not detect the count bug before, hence the incorrect results.
   For scalar subqueries, there is some quite convoluted way of detecting a count bug as a post-processing of scalar subquery. I will refactor it to use this function in the future, as it seems easier and more straightforward. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] agubichev commented on a diff in pull request #43111: [SPARK-36112] [SQL] Support correlated exists subqueries using DecorrelateInnerQuery framework

Posted by "agubichev (via GitHub)" <gi...@apache.org>.

agubichev commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1341561643


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala:
##########
@@ -710,6 +711,12 @@ object DecorrelateInnerQuery extends PredicateHelper {
           case a @ Aggregate(groupingExpressions, aggregateExpressions, child) =>
             val outerReferences = collectOuterReferences(a.expressions)
             val newOuterReferences = parentOuterReferences ++ outerReferences
+            // Find all the aggregate expressions that are subject to the "COUNT bug",
+            // i.e. those that have non-None default result.
+            val countBugSusceptibleAggs = aggregateExpressions.flatMap(_.collect {
+              case a@AggregateExpression(function, _, _, _, _)
+                if function.defaultResult.nonEmpty => a

Review Comment:
   discussed it offline.
   These are scalar subqueries so outside the scope of the PR, filed https://issues.apache.org/jira/browse/SPARK-45381 to track OSS vs DBR difference in one of your examples



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.

jchen5 commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1343058747


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3416,6 +3416,14 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DECORRELATE_EXISTS_AND_IN_SUBQUERIES =
+    buildConf("spark.sql.optimizer.decorrelateExistsIn.enabled")

Review Comment:
   Looks like for IN/EXISTS to be decorrelated with DecorrelateInnerQuery you need both flags enabled. Either way makes sense to me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #43111: [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework
URL: https://github.com/apache/spark/pull/43111


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "agubichev (via GitHub)" <gi...@apache.org>.

agubichev commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1344491645


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3416,6 +3416,14 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DECORRELATE_EXISTS_AND_IN_SUBQUERIES =
+    buildConf("spark.sql.optimizer.decorrelateExistsIn.enabled")
+      .internal()
+      .doc("Decorrelate EXISTS and IN subqueries.")
+      .version("4.0.0")
+      .booleanConf
+      .createWithDefault(true)

Review Comment:
   changed the flag to reflect that there is some legacy behavior. Added tests for that behavior.



##########
sql/core/src/test/resources/sql-tests/inputs/subquery/exists-subquery/exists-count-bug.sql:
##########
@@ -0,0 +1,21 @@
+create temporary view t1(c1, c2) as values (0, 1), (1, 2);
+create temporary view t2(c1, c2) as values (0, 2), (0, 3);
+create temporary view t3(c1, c2) as values (0, 3), (1, 4), (2, 5);
+
+select * from t1 where exists (select count(*) from t2 where t2.c1 = t1.c1);

Review Comment:
   Added tests for the wrong results



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #43111: [SPARK-36112] [SQL] Support correlated exists subqueries using DecorrelateInnerQuery framework

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.

allisonwang-db commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1342014577


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##########
@@ -63,6 +64,18 @@ object RewritePredicateSubquery extends Rule[LogicalPlan] with PredicateHelper {
     Join(outerPlan, dedupSubplan, joinType, condition, JoinHint(None, subHint))
   }
 
+  private def removeDomainJoins(

Review Comment:
   nit: `maybeRewriteDomainJoins`



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3416,6 +3416,14 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DECORRELATE_EXISTS_AND_IN_SUBQUERIES =
+    buildConf("spark.sql.optimizer.decorrelateExistsIn.enabled")

Review Comment:
   Does this flag depend on this `DECORRELATE_INNER_QUERY_ENABLED` flag?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3416,6 +3416,14 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DECORRELATE_EXISTS_AND_IN_SUBQUERIES =
+    buildConf("spark.sql.optimizer.decorrelateExistsIn.enabled")
+      .internal()
+      .doc("Decorrelate EXISTS and IN subqueries.")
+      .version("4.0.0")
+      .booleanConf
+      .createWithDefault(true)

Review Comment:
   Will this have any query result changes if we enable this by default?



##########
sql/core/src/test/resources/sql-tests/inputs/subquery/exists-subquery/exists-count-bug.sql:
##########
@@ -0,0 +1,21 @@
+create temporary view t1(c1, c2) as values (0, 1), (1, 2);
+create temporary view t2(c1, c2) as values (0, 2), (0, 3);
+create temporary view t3(c1, c2) as values (0, 3), (1, 4), (2, 5);
+
+select * from t1 where exists (select count(*) from t2 where t2.c1 = t1.c1);

Review Comment:
   are results before and after this PR the same for these queries?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "agubichev (via GitHub)" <gi...@apache.org>.

agubichev commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1342801406


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3416,6 +3416,14 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DECORRELATE_EXISTS_AND_IN_SUBQUERIES =
+    buildConf("spark.sql.optimizer.decorrelateExistsIn.enabled")

Review Comment:
   no, those are independent. The DECORRELATE_INNER_QUERY_ENABLED flag is for scalar/lateral subqueries, and the current one is for IN/EXISTS.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "agubichev (via GitHub)" <gi...@apache.org>.

agubichev commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1344491251


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3416,6 +3416,14 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DECORRELATE_EXISTS_AND_IN_SUBQUERIES =
+    buildConf("spark.sql.optimizer.decorrelateExistsIn.enabled")

Review Comment:
   ack!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1345392903


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala:
##########
@@ -461,6 +462,23 @@ object DecorrelateInnerQuery extends PredicateHelper {
       p.mapChildren(rewriteDomainJoins(outerPlan, _, conditions))
   }
 
+  private def isCountBugFree(aggregateExpressions: Seq[NamedExpression]): Boolean = {
+    // The COUNT bug only appears if an aggregate expression returns a non-NULL result on an empty
+    // input.
+    // Typical example (hence the name) is COUNT(*) that returns 0 from an empty result.
+    // However, SUM(x) IS NULL is another case that returns 0, and in general any IS/NOT IS and CASE
+    // expressions are suspect (and the combination of those).
+    // For now we conservatively accept only those expressions that are guaranteed to be safe.
+    val exprsRejectEmptyInput = aggregateExpressions.map {
+      case _ : AttributeReference => true
+      case Alias(_: AttributeReference, _) => true
+      case Alias(_: Literal, _) => true
+      case Alias(a: AggregateExpression, _) if a.aggregateFunction.defaultResult == None => true
+      case _ => false
+    }
+    exprsRejectEmptyInput.forall(x => x == true)

Review Comment:
   nit:
   ```
   aggregateExpressions.forall ...
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1345392237


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala:
##########
@@ -461,6 +462,23 @@ object DecorrelateInnerQuery extends PredicateHelper {
       p.mapChildren(rewriteDomainJoins(outerPlan, _, conditions))
   }
 
+  private def isCountBugFree(aggregateExpressions: Seq[NamedExpression]): Boolean = {
+    // The COUNT bug only appears if an aggregate expression returns a non-NULL result on an empty
+    // input.
+    // Typical example (hence the name) is COUNT(*) that returns 0 from an empty result.
+    // However, SUM(x) IS NULL is another case that returns 0, and in general any IS/NOT IS and CASE
+    // expressions are suspect (and the combination of those).
+    // For now we conservatively accept only those expressions that are guaranteed to be safe.
+    val exprsRejectEmptyInput = aggregateExpressions.map {

Review Comment:
   is this new code? how do we detect count bug before?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "andylam-db (via GitHub)" <gi...@apache.org>.

andylam-db commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1358985912


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -5272,6 +5281,9 @@ class SQLConf extends Serializable with Logging with SqlApiConf {
 
   def decorrelateInnerQueryEnabled: Boolean = getConf(SQLConf.DECORRELATE_INNER_QUERY_ENABLED)
 
+  def decorrelateInnerQueryEnabledForExistsIn: Boolean =
+    !getConf(SQLConf.DECORRELATE_EXISTS_IN_SUBQUERY_LEGACY_INCORRECT_COUNT_HANDLING_ENABLED)

Review Comment:
   Should we check whether `decorrelateInnerQueryEnabled` is true here, too?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] agubichev commented on a diff in pull request #43111: [SPARK-36112] [SQL] Support correlated exists subqueries using DecorrelateInnerQuery framework

Posted by "agubichev (via GitHub)" <gi...@apache.org>.

agubichev commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1341562101


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3416,6 +3416,14 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DECORRELATE_EXISTS_AND_IN_SUBQUERIES =
+    buildConf("spark.sql.optimizer.decorrelateExistsIn.enabled")
+      .internal()
+      .doc("Decorrelate EXISTS and IN subqueries.")
+      .version("3.4.0")

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.

jchen5 commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1343058747


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3416,6 +3416,14 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DECORRELATE_EXISTS_AND_IN_SUBQUERIES =
+    buildConf("spark.sql.optimizer.decorrelateExistsIn.enabled")

Review Comment:
   Looks like for IN/EXISTS to be decorrelated you need both flags enabled, which makes sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "agubichev (via GitHub)" <gi...@apache.org>.

agubichev commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1345929714


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala:
##########
@@ -461,6 +462,23 @@ object DecorrelateInnerQuery extends PredicateHelper {
       p.mapChildren(rewriteDomainJoins(outerPlan, _, conditions))
   }
 
+  private def isCountBugFree(aggregateExpressions: Seq[NamedExpression]): Boolean = {
+    // The COUNT bug only appears if an aggregate expression returns a non-NULL result on an empty
+    // input.
+    // Typical example (hence the name) is COUNT(*) that returns 0 from an empty result.
+    // However, SUM(x) IS NULL is another case that returns 0, and in general any IS/NOT IS and CASE
+    // expressions are suspect (and the combination of those).
+    // For now we conservatively accept only those expressions that are guaranteed to be safe.
+    val exprsRejectEmptyInput = aggregateExpressions.map {
+      case _ : AttributeReference => true
+      case Alias(_: AttributeReference, _) => true
+      case Alias(_: Literal, _) => true
+      case Alias(a: AggregateExpression, _) if a.aggregateFunction.defaultResult == None => true
+      case _ => false
+    }
+    exprsRejectEmptyInput.forall(x => x == true)

Review Comment:
   neat, thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1347469235


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala:
##########
@@ -461,6 +462,22 @@ object DecorrelateInnerQuery extends PredicateHelper {
       p.mapChildren(rewriteDomainJoins(outerPlan, _, conditions))
   }
 
+  private def isCountBugFree(aggregateExpressions: Seq[NamedExpression]): Boolean = {

Review Comment:
   I think the existing way to detect the count bug is better. It evaluates the `Aggregate` operator with empty input and see if the result is null or not. It's more accurate than a static analysis.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #43111:
URL: https://github.com/apache/spark/pull/43111#issuecomment-1749051836

   thanks, merging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] jchen5 commented on a diff in pull request #43111: [SPARK-36112] [SQL] Support correlated exists subqueries using DecorrelateInnerQuery framework

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.

jchen5 commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1341449925


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala:
##########
@@ -710,6 +711,12 @@ object DecorrelateInnerQuery extends PredicateHelper {
           case a @ Aggregate(groupingExpressions, aggregateExpressions, child) =>
             val outerReferences = collectOuterReferences(a.expressions)
             val newOuterReferences = parentOuterReferences ++ outerReferences
+            // Find all the aggregate expressions that are subject to the "COUNT bug",
+            // i.e. those that have non-None default result.
+            val countBugSusceptibleAggs = aggregateExpressions.flatMap(_.collect {
+              case a@AggregateExpression(function, _, _, _, _)
+                if function.defaultResult.nonEmpty => a

Review Comment:
   Discussed, the logic should work because it'll add count bug handling in the lower subqueries.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] jchen5 commented on a diff in pull request #43111: [SPARK-36112] [SQL] Support correlated exists subqueries using DecorrelateInnerQuery framework

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.

jchen5 commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1340545294


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##########
@@ -1397,6 +1400,10 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB
           failOnInvalidOuterReference(l)
           checkPlan(input, aggregated, canContainOuter)
 
+        case o @ Offset(_, input) =>

Review Comment:
   How does this change relate, or is it a separate change to enable offset?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -3416,6 +3416,14 @@ object SQLConf {
       .booleanConf
       .createWithDefault(true)
 
+  val DECORRELATE_EXISTS_AND_IN_SUBQUERIES =
+    buildConf("spark.sql.optimizer.decorrelateExistsIn.enabled")
+      .internal()
+      .doc("Decorrelate EXISTS and IN subqueries.")
+      .version("3.4.0")

Review Comment:
   I think we're on 4.0.0 now.



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala:
##########
@@ -1134,8 +1134,11 @@ trait CheckAnalysis extends PredicateHelper with LookupCatalog with QueryErrorsB
       isLateral: Boolean = false): Unit = {
     // Some query shapes are only supported with the DecorrelateInnerQuery framework.
     // Currently we only use this new framework for scalar and lateral subqueries.

Review Comment:
   Delete this line



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/DecorrelateInnerQuery.scala:
##########
@@ -710,6 +711,12 @@ object DecorrelateInnerQuery extends PredicateHelper {
           case a @ Aggregate(groupingExpressions, aggregateExpressions, child) =>
             val outerReferences = collectOuterReferences(a.expressions)
             val newOuterReferences = parentOuterReferences ++ outerReferences
+            // Find all the aggregate expressions that are subject to the "COUNT bug",
+            // i.e. those that have non-None default result.
+            val countBugSusceptibleAggs = aggregateExpressions.flatMap(_.collect {
+              case a@AggregateExpression(function, _, _, _, _)
+                if function.defaultResult.nonEmpty => a

Review Comment:
   Just checking the function's default result doesn't work for some more complicated cases, such as where there are nested subqueries:
   ```
   select (
     select sum(cnt)
     from (select count(*) cnt from t2 where t1.c1 = t2.c1)
   ) from t1
   ```
   or:
   ```
   select (
      select sum(a) from (
        select a from t2 where t1.c1 = t2.c1 UNION ALL select 1 as a
     )
   ) from t1
   ```
   The subquery is subject to the count bug even though the sum expression at the top defaults to NULL.
   
   We have logic for this at `evalSubqueryOnZeroTups` and `evalAggExprOnZeroTups` used below.
   
   But I'm not sure how that fits into the context here - what case motivated this change?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "agubichev (via GitHub)" <gi...@apache.org>.

agubichev commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1342795994


##########
sql/core/src/test/resources/sql-tests/inputs/subquery/exists-subquery/exists-count-bug.sql:
##########
@@ -0,0 +1,21 @@
+create temporary view t1(c1, c2) as values (0, 1), (1, 2);
+create temporary view t2(c1, c2) as values (0, 2), (0, 3);
+create temporary view t3(c1, c2) as values (0, 3), (1, 4), (2, 5);
+
+select * from t1 where exists (select count(*) from t2 where t2.c1 = t1.c1);

Review Comment:
   before this PR, these queries failed because aggregations were not allowed in correlated EXISTS/IN subqueries



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-36112] [SQL] Support correlated EXISTS and IN subqueries using DecorrelateInnerQuery framework [spark]

Posted by "agubichev (via GitHub)" <gi...@apache.org>.

agubichev commented on code in PR #43111:
URL: https://github.com/apache/spark/pull/43111#discussion_r1358990343


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -5272,6 +5281,9 @@ class SQLConf extends Serializable with Logging with SqlApiConf {
 
   def decorrelateInnerQueryEnabled: Boolean = getConf(SQLConf.DECORRELATE_INNER_QUERY_ENABLED)
 
+  def decorrelateInnerQueryEnabledForExistsIn: Boolean =
+    !getConf(SQLConf.DECORRELATE_EXISTS_IN_SUBQUERY_LEGACY_INCORRECT_COUNT_HANDLING_ENABLED)

Review Comment:
   the caller checks it:
   https://github.com/search?q=repo%3Aapache%2Fspark%20decorrelateInnerQueryEnabledForExistsIn&type=code
   
   (first check of the `decorrelate` function, explicit check in CheckAnalysis)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org