You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2021/04/15 04:21:48 UTC

[GitHub] [spark] allisonwang-db commented on pull request #32179: [SPARK-35080][SQL] Only allow a subset of correlated equality predicates when a subquery is aggregated

allisonwang-db commented on pull request #32179:
URL: https://github.com/apache/spark/pull/32179#issuecomment-820083789


   > BTW, is it necessary to be a subquery with aggregation? From the fix, I cannot tell how aggregation affects it.
   
   SPARK-17348 provides some explanations on why aggregate is causing the issue. Basically, when a correlated predicate is pulled up, all attributes from the inner query will be added as GROUP BY columns. When the mapping is not one-to-one, for instance `a + b = outer(c)` in the example above, both `a` and `b` will be added as group by columns, and the `count(*)` will count the number rows for each combination of (a, b), instead of (a + b).
   https://github.com/apache/spark/blob/7ff9d2e3eec514962e891420dbb3961e85826612/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L878-L901
   
   Pull up correlated predicates through Aggregate:
   https://github.com/apache/spark/blob/3e218ade9cf6becc5de8b20a4385e345021a509d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/subquery.scala#L258-L264


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org