You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "jchen5 (via GitHub)" <gi...@apache.org> on 2023/04/16 21:18:32 UTC

[GitHub] [spark] jchen5 opened a new pull request, #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

jchen5 opened a new pull request, #40811:
URL: https://github.com/apache/spark/pull/40811

   ### What changes were proposed in this pull request?
   Fix a correctness bug for scalar subqueries with COUNT and a GROUP BY clause, for example:
   ```
   create view t1(c1, c2) as values (0, 1), (1, 2);
   create view t2(c1, c2) as values (0, 2), (0, 3);
   
   select c1, c2, (select count(*) from t2 where t1.c1 = t2.c1 group by c1) from t1;
   
   -- Correct answer: [(0, 1, 2), (1, 2, null)]
   +---+---+------------------+
   |c1 |c2 |scalarsubquery(c1)|
   +---+---+------------------+
   |0  |1  |2                 |
   |1  |2  |0                 |
   +---+---+------------------+
   ```
   
   This is due to a bug in our "COUNT bug" handling for scalar subqueries. For a subquery with COUNT aggregate but no GROUP BY clause, 0 is the correct output on empty inputs, and we use the COUNT bug handling to construct the plan that  yields 0 when there were no matched rows. But when there is a GROUP BY clause then NULL is the correct output, but we still incorrectly construct the same plan as in the former case and therefore incorrectly output 0. Instead, we need to only apply the COUNT bug handling when the scalar subquery had no GROUP BY clause.
   
   To fix this, we need to track whether the scalar subquery has a GROUP BY, i.e. a non-empty groupingExpressions for the Aggregate node. This need to be checked before DecorrelateInnerQuery, because that adds the correlated outer refs to the group-by list so after that the group-by is always non-empty. We save it in a boolean in the ScalarSubquery node until later when we rewrite the subquery into a join in constructLeftJoins.
   
   ### Why are the changes needed?
   Fix a correctness bug.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Add SQL tests and unit tests. (Note that there were 2 existing unit tests for queries of this shape, which had the incorrect results as golden results.)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] jchen5 commented on a diff in pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.

jchen5 commented on code in PR #40811:
URL: https://github.com/apache/spark/pull/40811#discussion_r1169400435


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##########
@@ -325,9 +327,20 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
     }
 
     plan.transformExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION)) {
-      case ScalarSubquery(sub, children, exprId, conditions, hint) if children.nonEmpty =>
+      case ScalarSubquery(sub, children, exprId, conditions, hint, mayHaveCountBugOld)
+        if children.nonEmpty =>
         val (newPlan, newCond) = decorrelate(sub, plan)
-        ScalarSubquery(newPlan, children, exprId, getJoinCondition(newCond, conditions), hint)
+        val mayHaveCountBug = if (mayHaveCountBugOld.isDefined) {
+          // For idempotency, we must save this variable the first time this rule is run. Afterward,
+          // the information about whether the subquery has a GROUP BY clause is lost because a
+          // GROUP BY is introduced if one wasn't already present.
+          mayHaveCountBugOld.get
+        } else {
+          val (topPart, havingNode, aggNode) = splitSubquery(sub)
+          (aggNode.isDefined && aggNode.get.groupingExpressions.isEmpty)

Review Comment:
   This is the first main part of the fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] jchen5 commented on a diff in pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.

jchen5 commented on code in PR #40811:
URL: https://github.com/apache/spark/pull/40811#discussion_r1170166990


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##########
@@ -325,9 +327,20 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
     }
 
     plan.transformExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION)) {
-      case ScalarSubquery(sub, children, exprId, conditions, hint) if children.nonEmpty =>
+      case ScalarSubquery(sub, children, exprId, conditions, hint, mayHaveCountBugOld)
+        if children.nonEmpty =>
         val (newPlan, newCond) = decorrelate(sub, plan)
-        ScalarSubquery(newPlan, children, exprId, getJoinCondition(newCond, conditions), hint)
+        val mayHaveCountBug = if (mayHaveCountBugOld.isDefined) {
+          // For idempotency, we must save this variable the first time this rule is run. Afterward,
+          // the information about whether the subquery has a GROUP BY clause is lost because a
+          // GROUP BY is introduced if one wasn't already present.

Review Comment:
   @cloud-fan In both code paths, we add GROUP BY on the correlating outer columns - this is needed to decorrelate the subquery.
   
   In both DecorrelateInnerQuery and in pullOutCorrelatedPredicates it's under `case a @ Aggregate`.
   
   @viirya The value of mayHaveCountBug is based on the plan before decorrelation, so it reflects whether it originally had group by or not. If the GROUP BY is introduced by subquery decorrelation (in this rule) and it wasn't there before, the value will be true, and if the GROUP BY was already there, the value will be false.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] jchen5 commented on a diff in pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.

jchen5 commented on code in PR #40811:
URL: https://github.com/apache/spark/pull/40811#discussion_r1169400548


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##########
@@ -581,7 +594,10 @@ object RewriteCorrelatedScalarSubquery extends Rule[LogicalPlan] with AliasHelpe
           currentChild.output :+ origOutput,
           Join(currentChild, query, LeftOuter, conditions.reduceOption(And), joinHint))
 
-        if (resultWithZeroTups.isEmpty) {
+        if (Utils.isTesting) {
+          assert(mayHaveCountBug.isDefined)
+        }
+        if (resultWithZeroTups.isEmpty || !mayHaveCountBug.getOrElse(true)) {

Review Comment:
   And this is the second main part of the fix.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] jchen5 commented on pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.

jchen5 commented on PR #40811:
URL: https://github.com/apache/spark/pull/40811#issuecomment-1513296034

   Yes, this is a long-standing bug.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #40811:
URL: https://github.com/apache/spark/pull/40811#discussion_r1170717475


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/subquery.scala:
##########
@@ -254,13 +254,20 @@ object SubExprUtils extends PredicateHelper {
  * scalar subquery during planning.
  *
  * Note: `exprId` is used to have a unique name in explain string output.
+ *
+ * `mayHaveCountBug` is whether it's possible for the subquery to evaluate to non-null on
+ * empty input (zero tuples). It is false if the subquery has a GROUP BY clause, because in that
+ * case the subquery yields no row at all on empty input to the GROUP BY, which evaluates to NULL.
+ * It is set in PullupCorrelatedPredicates to true/false, before it is set its value is None.
+ * See constructLeftJoins in RewriteCorrelatedScalarSubquery for more details.
  */
 case class ScalarSubquery(
     plan: LogicalPlan,
     outerAttrs: Seq[Expression] = Seq.empty,
     exprId: ExprId = NamedExpression.newExprId,
     joinCond: Seq[Expression] = Seq.empty,
-    hint: Option[HintInfo] = None)
+    hint: Option[HintInfo] = None,
+    mayHaveCountBug: Option[Boolean] = None)

Review Comment:
   how about we make the naming easier to understand? `hasGlobalAggregate`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] jchen5 commented on pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.

jchen5 commented on PR #40811:
URL: https://github.com/apache/spark/pull/40811#issuecomment-1513568704

   FYI this bug affected both the current DecorrelateInnerQuery framework and the old code (with spark.sql.optimizer.decorrelateInnerQuery.enabled = false), and this PR fixes both.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #40811:
URL: https://github.com/apache/spark/pull/40811#discussion_r1169632592


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##########
@@ -325,9 +327,20 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
     }
 
     plan.transformExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION)) {
-      case ScalarSubquery(sub, children, exprId, conditions, hint) if children.nonEmpty =>
+      case ScalarSubquery(sub, children, exprId, conditions, hint, mayHaveCountBugOld)
+        if children.nonEmpty =>
         val (newPlan, newCond) = decorrelate(sub, plan)
-        ScalarSubquery(newPlan, children, exprId, getJoinCondition(newCond, conditions), hint)
+        val mayHaveCountBug = if (mayHaveCountBugOld.isDefined) {
+          // For idempotency, we must save this variable the first time this rule is run. Afterward,
+          // the information about whether the subquery has a GROUP BY clause is lost because a
+          // GROUP BY is introduced if one wasn't already present.

Review Comment:
   So it sounds like if GROUP BY wasn't present but is introduced later, we still treat it as possibly having count bug?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] jchen5 commented on a diff in pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.

jchen5 commented on code in PR #40811:
URL: https://github.com/apache/spark/pull/40811#discussion_r1171240575


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##########
@@ -325,9 +327,22 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
     }
 
     plan.transformExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION)) {
-      case ScalarSubquery(sub, children, exprId, conditions, hint) if children.nonEmpty =>
+      case ScalarSubquery(sub, children, exprId, conditions, hint, mayHaveCountBugOld)
+        if children.nonEmpty =>
         val (newPlan, newCond) = decorrelate(sub, plan)
-        ScalarSubquery(newPlan, children, exprId, getJoinCondition(newCond, conditions), hint)
+        val mayHaveCountBug = if (mayHaveCountBugOld.isEmpty) {
+          // Check whether the pre-rewrite subquery had empty groupingExpressions. If yes, it may
+          // be subject to the COUNT bug. If it has non-empty groupingExpressions, there is
+          // no COUNT bug.
+          val (topPart, havingNode, aggNode) = splitSubquery(sub)

Review Comment:
   Yes, I tried it but the problem is it introduces an extra DomainJoin (i.e. an extra left outer join with an extra copy of the outer table), so it has worse performance - in the future I think it's possible to get rid of that extra DomainJoin by making the logic smarter but it's nontrivial. Also, I wanted to make this more targeted fix first to reduce risk.
   
   I linked this jira to a jira for unifying the count bug handling https://issues.apache.org/jira/browse/SPARK-36113.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause
URL: https://github.com/apache/spark/pull/40811


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] jchen5 commented on pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.

jchen5 commented on PR #40811:
URL: https://github.com/apache/spark/pull/40811#issuecomment-1510490719

   @allisonwang-db @cloud-fan 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] jchen5 commented on a diff in pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Posted by "jchen5 (via GitHub)" <gi...@apache.org>.

jchen5 commented on code in PR #40811:
URL: https://github.com/apache/spark/pull/40811#discussion_r1170166990


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##########
@@ -325,9 +327,20 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
     }
 
     plan.transformExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION)) {
-      case ScalarSubquery(sub, children, exprId, conditions, hint) if children.nonEmpty =>
+      case ScalarSubquery(sub, children, exprId, conditions, hint, mayHaveCountBugOld)
+        if children.nonEmpty =>
         val (newPlan, newCond) = decorrelate(sub, plan)
-        ScalarSubquery(newPlan, children, exprId, getJoinCondition(newCond, conditions), hint)
+        val mayHaveCountBug = if (mayHaveCountBugOld.isDefined) {
+          // For idempotency, we must save this variable the first time this rule is run. Afterward,
+          // the information about whether the subquery has a GROUP BY clause is lost because a
+          // GROUP BY is introduced if one wasn't already present.

Review Comment:
   @cloud-fan In both code paths, we add GROUP BY on the correlating outer columns in order to decorrelate the subquery.
   
   In both DecorrelateInnerQuery and in pullOutCorrelatedPredicates it's under `case a @ Aggregate`.
   
   @viirya The value of mayHaveCountBug is based on the plan before decorrelation, so it reflects whether it originally had group by or not. If the GROUP BY is introduced by subquery decorrelation (in this rule) and it wasn't there before, the value will be true, and if the GROUP BY was already there, the value will be false.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #40811:
URL: https://github.com/apache/spark/pull/40811#discussion_r1169413430


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##########
@@ -325,9 +327,20 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
     }
 
     plan.transformExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION)) {
-      case ScalarSubquery(sub, children, exprId, conditions, hint) if children.nonEmpty =>
+      case ScalarSubquery(sub, children, exprId, conditions, hint, mayHaveCountBugOld)
+        if children.nonEmpty =>
         val (newPlan, newCond) = decorrelate(sub, plan)
-        ScalarSubquery(newPlan, children, exprId, getJoinCondition(newCond, conditions), hint)
+        val mayHaveCountBug = if (mayHaveCountBugOld.isDefined) {
+          // For idempotency, we must save this variable the first time this rule is run. Afterward,
+          // the information about whether the subquery has a GROUP BY clause is lost because a
+          // GROUP BY is introduced if one wasn't already present.

Review Comment:
   where do we introduce GROUP BY? There are two code paths here:
   ```
         if (SQLConf.get.decorrelateInnerQueryEnabled) {
           DecorrelateInnerQuery(sub, outer, handleCountBug)
         } else {
           pullOutCorrelatedPredicates(sub, outer)
         }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #40811:
URL: https://github.com/apache/spark/pull/40811#issuecomment-1512329066

   is this a long-standing bug? Also cc @viirya 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] allisonwang-db commented on a diff in pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Posted by "allisonwang-db (via GitHub)" <gi...@apache.org>.

allisonwang-db commented on code in PR #40811:
URL: https://github.com/apache/spark/pull/40811#discussion_r1170824254


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala:
##########
@@ -325,9 +327,22 @@ object PullupCorrelatedPredicates extends Rule[LogicalPlan] with PredicateHelper
     }
 
     plan.transformExpressionsWithPruning(_.containsPattern(PLAN_EXPRESSION)) {
-      case ScalarSubquery(sub, children, exprId, conditions, hint) if children.nonEmpty =>
+      case ScalarSubquery(sub, children, exprId, conditions, hint, mayHaveCountBugOld)
+        if children.nonEmpty =>
         val (newPlan, newCond) = decorrelate(sub, plan)
-        ScalarSubquery(newPlan, children, exprId, getJoinCondition(newCond, conditions), hint)
+        val mayHaveCountBug = if (mayHaveCountBugOld.isEmpty) {
+          // Check whether the pre-rewrite subquery had empty groupingExpressions. If yes, it may
+          // be subject to the COUNT bug. If it has non-empty groupingExpressions, there is
+          // no COUNT bug.
+          val (topPart, havingNode, aggNode) = splitSubquery(sub)

Review Comment:
   The new decorrelation framework also handles the count bug `decorrelate(innerPlan, outerPlan, handleCountBug)` by inserting new joins with the outer query. It's currently being used to handle the count bug for lateral subqueries only since it changes the plan for scalar subqueries. We might want to unify the way we handle the count bug in the future.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #40811: [SPARK-43098][SQL] Fix correctness COUNT bug when scalar subquery has group by clause

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #40811:
URL: https://github.com/apache/spark/pull/40811#issuecomment-1514015480

   The streaming failure is unrelated. Thanks, merging to master/3.4!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org