You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/12 13:43:43 UTC

[GitHub] [spark] peter-toth opened a new pull request, #37496: [SPARK-39887][SQL][3.1] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

peter-toth opened a new pull request, #37496:
URL: https://github.com/apache/spark/pull/37496

   ### What changes were proposed in this pull request?
   Keep the output attributes of a `Union` node's first child in the `RemoveRedundantAliases` rule to avoid correctness issues.
   
   ### Why are the changes needed?
   To fix the result of the following query:
   ```
   SELECT a, b AS a FROM (
     SELECT a, a AS b FROM (SELECT a FROM VALUES (1) AS t(a))
     UNION ALL
     SELECT a, b FROM (SELECT a, b FROM VALUES (1, 2) AS t(a, b))
   )
   ```
   Before this PR the query returns the incorrect result: 
   ```
   +---+---+
   |  a|  a|
   +---+---+
   |  1|  1|
   |  2|  2|
   +---+---+
   ```
   After this PR it returns the expected result:
   ```
   +---+---+
   |  a|  a|
   +---+---+
   |  1|  1|
   |  1|  2|
   +---+---+
   ```
   
   ### Does this PR introduce _any_ user-facing change?
   Yes, fixes a correctness issue.
   
   ### How was this patch tested?
   Added new UTs.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #37496: [SPARK-39887][SQL][3.1] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

Posted by GitBox <gi...@apache.org>.
peter-toth commented on code in PR #37496:
URL: https://github.com/apache/spark/pull/37496#discussion_r944510782


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -455,6 +457,22 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
         })
         Join(newLeft, newRight, joinType, newCondition, hint)
 
+      case u: Union =>
+        var first = true
+        plan.mapChildren { child =>
+          if (first) {
+            first = false
+            // `Union` inherits its first child's outputs. We don't remove those aliases from the
+            // first child's tree that prevent aliased attributes to appear multiple times in the
+            // `Union`'s output. A parent projection node on the top of an `Union` with non-unique
+            // output attributes could return incorrect result.
+            removeRedundantAliases(child, excluded ++ child.outputSet)
+          } else {
+            // We don't need to exclude those attributes that `Union` inherits from its first child.

Review Comment:
   I've no idea yet why this didn't come up in newer versions, checking now...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on pull request #37496: [SPARK-39887][SQL][3.1] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

Posted by GitBox <gi...@apache.org>.
peter-toth commented on PR #37496:
URL: https://github.com/apache/spark/pull/37496#issuecomment-1213144688

   @cloud-fan, this one is a bit different to the 3.4, 3.3, 3.2.
   As you can see after the original change (https://github.com/apache/spark/pull/37496/commits/41acae298176640288208a5c4dd383a2afab6432) the plan regeneration (https://github.com/apache/spark/pull/37496/commits/ae87e3686bdff1834afbf829b7c056ef555fef57) didn't modify the expected `sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q14a.sf100/explain.txt` and `sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v1_4/q14a/explain.txt` but modified a lot of golden files under `sql/core/src/test/resources/tpcds-plan-stability/approved-plans-v2_7/`, that other versions didn't.
   
   It turned out that
   - the missing changes under `v1_4` are because in 3.1 we compare only the simplified golden files to detect changes.
   - the new changes under `v2_7` are because in those queries we have a parent `Union` whose first child is also an `Union` node. And the child `Union`'s children have intersecting output set. So the aliases in the 2nd+ child of the child `Union` are also kept. I've fixed this issue with this change: https://github.com/apache/spark/pull/37496/commits/b508ca23bcd0e96c7c5b664d67bbe626262317ac to not keep those attributes in 2nd+ childrens of an `Union` node that are intersecting with the 1st children. This fix allowed me to revert the plan regeneration: https://github.com/apache/spark/pull/37496/commits/90c2a8c7d6226cc0e8b6e03c6d6e237b0c72719e
   
   I think, we probably should land the fix commit https://github.com/apache/spark/pull/37496/commits/b508ca23bcd0e96c7c5b664d67bbe626262317ac in 3.4, 3.3 and 3.2 too.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #37496: [SPARK-39887][SQL][3.1] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

Posted by GitBox <gi...@apache.org>.
peter-toth commented on code in PR #37496:
URL: https://github.com/apache/spark/pull/37496#discussion_r944539140


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -455,6 +457,22 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
         })
         Join(newLeft, newRight, joinType, newCondition, hint)
 
+      case u: Union =>
+        var first = true
+        plan.mapChildren { child =>
+          if (first) {
+            first = false
+            // `Union` inherits its first child's outputs. We don't remove those aliases from the
+            // first child's tree that prevent aliased attributes to appear multiple times in the
+            // `Union`'s output. A parent projection node on the top of an `Union` with non-unique
+            // output attributes could return incorrect result.
+            removeRedundantAliases(child, excluded ++ child.outputSet)
+          } else {
+            // We don't need to exclude those attributes that `Union` inherits from its first child.

Review Comment:
   Yes we did combine `Unions` in newer versions and the issue didn't show up.
   
   Despite we didn't see extra aliases kept in 3.4, 3.3 and 3.2, shall I open a follow-up PRs with the fix commit?  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on pull request #37496: [SPARK-39887][SQL][3.1] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on PR #37496:
URL: https://github.com/apache/spark/pull/37496#issuecomment-1215029783

   the test failure is unrelated, merging to 3.1, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37496: [SPARK-39887][SQL][3.1] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37496:
URL: https://github.com/apache/spark/pull/37496#discussion_r945752379


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -455,6 +457,22 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
         })
         Join(newLeft, newRight, joinType, newCondition, hint)
 
+      case u: Union =>
+        var first = true
+        plan.mapChildren { child =>
+          if (first) {
+            first = false
+            // `Union` inherits its first child's outputs. We don't remove those aliases from the
+            // first child's tree that prevent aliased attributes to appear multiple times in the
+            // `Union`'s output. A parent projection node on the top of an `Union` with non-unique
+            // output attributes could return incorrect result.
+            removeRedundantAliases(child, excluded ++ child.outputSet)
+          } else {
+            // We don't need to exclude those attributes that `Union` inherits from its first child.

Review Comment:
   > shall I open a follow-up PRs with the fix commit?
   
   Yes please, to make the code more defensive.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #37496: [SPARK-39887][SQL][3.1] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

Posted by GitBox <gi...@apache.org>.
peter-toth commented on code in PR #37496:
URL: https://github.com/apache/spark/pull/37496#discussion_r944539140


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -455,6 +457,22 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
         })
         Join(newLeft, newRight, joinType, newCondition, hint)
 
+      case u: Union =>
+        var first = true
+        plan.mapChildren { child =>
+          if (first) {
+            first = false
+            // `Union` inherits its first child's outputs. We don't remove those aliases from the
+            // first child's tree that prevent aliased attributes to appear multiple times in the
+            // `Union`'s output. A parent projection node on the top of an `Union` with non-unique
+            // output attributes could return incorrect result.
+            removeRedundantAliases(child, excluded ++ child.outputSet)
+          } else {
+            // We don't need to exclude those attributes that `Union` inherits from its first child.

Review Comment:
   Yes we did combine `Union`s in newer versions and the issue didn't show up.
   
   Despite we didn't see extra aliases kept in 3.4, 3.3 and 3.2, shall I open a follow-up PRs with the fix commit?  



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #37496: [SPARK-39887][SQL][3.1] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

Posted by GitBox <gi...@apache.org>.
peter-toth commented on code in PR #37496:
URL: https://github.com/apache/spark/pull/37496#discussion_r944539140


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -455,6 +457,22 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
         })
         Join(newLeft, newRight, joinType, newCondition, hint)
 
+      case u: Union =>
+        var first = true
+        plan.mapChildren { child =>
+          if (first) {
+            first = false
+            // `Union` inherits its first child's outputs. We don't remove those aliases from the
+            // first child's tree that prevent aliased attributes to appear multiple times in the
+            // `Union`'s output. A parent projection node on the top of an `Union` with non-unique
+            // output attributes could return incorrect result.
+            removeRedundantAliases(child, excluded ++ child.outputSet)
+          } else {
+            // We don't need to exclude those attributes that `Union` inherits from its first child.

Review Comment:
   Yes we did combine `Union`s in newer versions and the issue didn't show up.
   
   Despite we didn't see extra aliases kept in 3.4, 3.3 and 3.2, shall I open a follow-up PRs with the fix commit?
   I mean in some case we might not be able to combine `Union`s (maybe they are not direct parent-child nodes, but there is a node in between them) but still the child (descendant) `Union` is in the parent's first child tree...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan closed pull request #37496: [SPARK-39887][SQL][3.1] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

Posted by GitBox <gi...@apache.org>.
cloud-fan closed pull request #37496: [SPARK-39887][SQL][3.1] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique
URL: https://github.com/apache/spark/pull/37496


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] cloud-fan commented on a diff in pull request #37496: [SPARK-39887][SQL][3.1] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

Posted by GitBox <gi...@apache.org>.
cloud-fan commented on code in PR #37496:
URL: https://github.com/apache/spark/pull/37496#discussion_r944509115


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -455,6 +457,22 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
         })
         Join(newLeft, newRight, joinType, newCondition, hint)
 
+      case u: Union =>
+        var first = true
+        plan.mapChildren { child =>
+          if (first) {
+            first = false
+            // `Union` inherits its first child's outputs. We don't remove those aliases from the
+            // first child's tree that prevent aliased attributes to appear multiple times in the
+            // `Union`'s output. A parent projection node on the top of an `Union` with non-unique
+            // output attributes could return incorrect result.
+            removeRedundantAliases(child, excluded ++ child.outputSet)
+          } else {
+            // We don't need to exclude those attributes that `Union` inherits from its first child.

Review Comment:
   it's not an issue in newer spark versions because we combine adjacent Unions?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #37496: [SPARK-39887][SQL][3.1] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

Posted by GitBox <gi...@apache.org>.
peter-toth commented on code in PR #37496:
URL: https://github.com/apache/spark/pull/37496#discussion_r945859683


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -455,6 +457,22 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
         })
         Join(newLeft, newRight, joinType, newCondition, hint)
 
+      case u: Union =>
+        var first = true
+        plan.mapChildren { child =>
+          if (first) {
+            first = false
+            // `Union` inherits its first child's outputs. We don't remove those aliases from the
+            // first child's tree that prevent aliased attributes to appear multiple times in the
+            // `Union`'s output. A parent projection node on the top of an `Union` with non-unique
+            // output attributes could return incorrect result.
+            removeRedundantAliases(child, excluded ++ child.outputSet)
+          } else {
+            // We don't need to exclude those attributes that `Union` inherits from its first child.

Review Comment:
   Sure, I will open the PRs tomorrow and let you know.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] [spark] peter-toth commented on a diff in pull request #37496: [SPARK-39887][SQL][3.1] RemoveRedundantAliases should keep aliases that make the output of projection nodes unique

Posted by GitBox <gi...@apache.org>.
peter-toth commented on code in PR #37496:
URL: https://github.com/apache/spark/pull/37496#discussion_r946458325


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala:
##########
@@ -455,6 +457,22 @@ object RemoveRedundantAliases extends Rule[LogicalPlan] {
         })
         Join(newLeft, newRight, joinType, newCondition, hint)
 
+      case u: Union =>
+        var first = true
+        plan.mapChildren { child =>
+          if (first) {
+            first = false
+            // `Union` inherits its first child's outputs. We don't remove those aliases from the
+            // first child's tree that prevent aliased attributes to appear multiple times in the
+            // `Union`'s output. A parent projection node on the top of an `Union` with non-unique
+            // output attributes could return incorrect result.
+            removeRedundantAliases(child, excluded ++ child.outputSet)
+          } else {
+            // We don't need to exclude those attributes that `Union` inherits from its first child.

Review Comment:
   Here is the follow-up PR: https://github.com/apache/spark/pull/37534



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org