You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/08 05:39:47 UTC

[GitHub] [spark] karenfeng opened a new pull request #35760: [SPARK-37865][SQL] Fix union deduplication correctness bug

karenfeng opened a new pull request #35760:
URL: https://github.com/apache/spark/pull/35760


   ### What changes were proposed in this pull request?
   
   Fixes a correctness bug in `Union` in the case that there are duplicate output columns. Previously, duplicate columns on one side of the union would result in a duplicate column being output on the other side of the union.
   
   To do so, we go through the union’s child’s output and find the duplicates. For each duplicate set, there is a first duplicate: this one is left alone. All following duplicates are aliased and given a tag; this tag is used to remove ambiguity during resolution.
   
   As the first duplicate is left alone, the user can still select it, avoiding a breaking change. As the later duplicates are given new expression IDs, this fixes the correctness bug.
   
   ### Why are the changes needed?
   
   Output of union with duplicate columns in the children was incorrect
   
   ### Does this PR introduce _any_ user-facing change?
   
   Example query:
   ```
   SELECT a, a FROM VALUES (1, 1), (1, 2) AS t1(a, b)
   UNION ALL SELECT c, d FROM VALUES (2, 2), (2, 3) AS t2(c, d)
   ```
   
   Result before:
   ```
   a | a
   _ | _
   1 | 1
   1 | 1
   2 | 2
   2 | 2
   ```
   
   Result after:
   ```
   a | a
   _ | _
   1 | 1
   1 | 2
   2 | 2
   2 | 3
   ```
   
   ### How was this patch tested?
   
   Unit tests


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #35760: [SPARK-37865][SQL] Fix union deduplication correctness bug

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #35760:
URL: https://github.com/apache/spark/pull/35760#issuecomment-1061524464


   > Result After should be ...
   
   Seems like a typo. Looking at the test, the result is corrected. @karenfeng can you double-check?
   
   > And why we should support select a from (SELECT a, a ...
   
   Yea it's ambiguous but it's already supported before this PR. We can't find a way to fix this without a breaking change so claiming that it selects the first `a` column looks acceptable. Better ideas are welcome!
   
   Note: we can fix this problem more completely at the master branch with a bit of breaking changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] viirya commented on a change in pull request #35760: [SPARK-37865][SQL] Fix union deduplication correctness bug

Posted by GitBox <gi...@apache.org>.

viirya commented on a change in pull request #35760:
URL: https://github.com/apache/spark/pull/35760#discussion_r821395163



##########
File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
##########
@@ -3127,6 +3127,69 @@ class DataFrameSuite extends QueryTest
       checkAnswer(df, Row(Seq("string2")) :: Row(Seq("string5")) :: Nil)
     }
   }
+
+  test("SPARK-37865: Do not deduplicate grouping expressions in union") {

Review comment:
       Is there grouping expressions? I guess you mean duplicate output columns?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] singhpk234 commented on a change in pull request #35760: [SPARK-37865][SQL] Fix union deduplication correctness bug

Posted by GitBox <gi...@apache.org>.

singhpk234 commented on a change in pull request #35760:
URL: https://github.com/apache/spark/pull/35760#discussion_r821494135



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -1380,6 +1380,31 @@ class Analyzer(override val catalogManager: CatalogManager)
         throw QueryCompilationErrors.invalidStarUsageError("explode/json_tuple/UDTF",
           extractStar(g.generator.children))
 
+      case u @ Union(children, _, _)
+        // if there are duplicate output columns, give them unique expr ids
+          if children.exists(c => c.output.map(_.exprId).distinct.length < c.output.length) =>
+        val newChildren = children.map { c =>
+          if (c.output.map(_.exprId).distinct.length < c.output.length) {
+            val existingExprIds = mutable.Set[ExprId]()
+            val projectList = c.output.map { attr =>
+              if (existingExprIds.contains(attr.exprId)) {
+                // replace non-first duplicates with aliases and tag them
+                val newMetadata = new MetadataBuilder().withMetadata(attr.metadata)
+                  .putNull("__is_duplicate").build()
+                Alias(attr, attr.name)(explicitMetadata = Some(newMetadata))
+              } else {
+                // leave first duplicate alone
+                existingExprIds.add(attr.exprId)
+                attr
+              }
+            }
+            Project(projectList, c)
+          } else {
+            c
+          }
+        }
+        u.copy(children = newChildren)

Review comment:
       [minor] can we use `withNewChildren` instead ? 

##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -1380,6 +1380,31 @@ class Analyzer(override val catalogManager: CatalogManager)
         throw QueryCompilationErrors.invalidStarUsageError("explode/json_tuple/UDTF",
           extractStar(g.generator.children))
 
+      case u @ Union(children, _, _)
+        // if there are duplicate output columns, give them unique expr ids
+          if children.exists(c => c.output.map(_.exprId).distinct.length < c.output.length) =>
+        val newChildren = children.map { c =>
+          if (c.output.map(_.exprId).distinct.length < c.output.length) {
+            val existingExprIds = mutable.Set[ExprId]()

Review comment:
       [nit] can we use mutable.HashSet instead ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] chasingegg commented on pull request #35760: [SPARK-37865][SQL] Fix union deduplication correctness bug

Posted by GitBox <gi...@apache.org>.

chasingegg commented on pull request #35760:
URL: https://github.com/apache/spark/pull/35760#issuecomment-1062571313


   > > And why we should support select a from (SELECT a, a ...
   > 
   > What are the behaviors of other engines such as PostgreSQL, MySQL?
   
   They will fail such queries. Currently this fix will not fail such queries to avoid breaking changes as discussed above.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #35760: [SPARK-37865][SQL] Fix union deduplication correctness bug

Posted by GitBox <gi...@apache.org>.

cloud-fan closed pull request #35760:
URL: https://github.com/apache/spark/pull/35760


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a change in pull request #35760: [SPARK-37865][SQL] Fix union deduplication correctness bug

Posted by GitBox <gi...@apache.org>.

dongjoon-hyun commented on a change in pull request #35760:
URL: https://github.com/apache/spark/pull/35760#discussion_r821389146



##########
File path: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala
##########
@@ -1380,6 +1380,31 @@ class Analyzer(override val catalogManager: CatalogManager)
         throw QueryCompilationErrors.invalidStarUsageError("explode/json_tuple/UDTF",
           extractStar(g.generator.children))
 
+      case u @ Union(children, _, _)
+        // if there are duplicate output columns, give them unique expr ids
+        if children.exists(c => c.output.map(_.exprId).distinct.length < c.output.length) =>

Review comment:
       nit. two more spaces since this is a continuation of the above `case` statement?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karenfeng commented on pull request #35760: [SPARK-37865][SQL] Fix union deduplication correctness bug

Posted by GitBox <gi...@apache.org>.

karenfeng commented on pull request #35760:
URL: https://github.com/apache/spark/pull/35760#issuecomment-1061425872


   FYI @cloud-fan, @LantaoJin, and @chasingegg; this PR builds off https://github.com/apache/spark/pull/35168.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #35760: [SPARK-37865][SQL] Fix union deduplication correctness bug

Posted by GitBox <gi...@apache.org>.

wangyum commented on pull request #35760:
URL: https://github.com/apache/spark/pull/35760#issuecomment-1066242186


   @karenfeng Could you fix the compilation issue on branch-3.0: https://github.com/apache/spark/runs/5473941519?check_suite_focus=true
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] LantaoJin commented on pull request #35760: [SPARK-37865][SQL] Fix union deduplication correctness bug

Posted by GitBox <gi...@apache.org>.

LantaoJin commented on pull request #35760:
URL: https://github.com/apache/spark/pull/35760#issuecomment-1062476004


    > And why we should support select a from (SELECT a, a ...
   
   What are the behaviors of other engines such as PostgreSQL, MySQL?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] chasingegg commented on pull request #35760: [SPARK-37865][SQL] Fix union deduplication correctness bug

Posted by GitBox <gi...@apache.org>.

chasingegg commented on pull request #35760:
URL: https://github.com/apache/spark/pull/35760#issuecomment-1061512536


   > ### What changes were proposed in this pull request?
   > Fixes a correctness bug in `Union` in the case that there are duplicate output columns. Previously, duplicate columns on one side of the union would result in a duplicate column being output on the other side of the union.
   > 
   > To do so, we go through the union’s child’s output and find the duplicates. For each duplicate set, there is a first duplicate: this one is left alone. All following duplicates are aliased and given a tag; this tag is used to remove ambiguity during resolution.
   > 
   > As the first duplicate is left alone, the user can still select it, avoiding a breaking change. As the later duplicates are given new expression IDs, this fixes the correctness bug.
   > 
   > ### Why are the changes needed?
   > Output of union with duplicate columns in the children was incorrect
   > 
   > ### Does this PR introduce _any_ user-facing change?
   > Example query:
   > 
   > ```
   > SELECT a, a FROM VALUES (1, 1), (1, 2) AS t1(a, b)
   > UNION ALL SELECT c, d FROM VALUES (2, 2), (2, 3) AS t2(c, d)
   > ```
   > 
   > Result before:
   > 
   > ```
   > a | a
   > _ | _
   > 1 | 1
   > 1 | 1
   > 2 | 2
   > 2 | 2
   > ```
   > 
   > Result after:
   > 
   > ```
   > a | a
   > _ | _
   > 1 | 1
   > 1 | 2
   > 2 | 2
   > 2 | 3
   > ```
   > 
   > ### How was this patch tested?
   > Unit tests
   
   Result After should be
   ```
    a | a
    _ | _
    1 | 1
    1 | 1
    2 | 2
    2 | 3
   ```
   ?
   And why we should support `select a from (SELECT a, a FROM VALUES (1, 1), (1, 2) AS t1(a, b)
   UNION ALL SELECT c, d FROM VALUES (2, 2), (2, 3) AS t2(c, d))`? The result after this fix is 
    a
    _ 
    1 
    1 
    2 
    2 
   I think it should be broken because it is ambiguous, instead of choosing the first column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] karenfeng commented on pull request #35760: [SPARK-37865][SQL] Fix union deduplication correctness bug

Posted by GitBox <gi...@apache.org>.

karenfeng commented on pull request #35760:
URL: https://github.com/apache/spark/pull/35760#issuecomment-1061549045


   @chasingegg Good catch, thanks! I fixed the PR description.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #35760: [SPARK-37865][SQL] Fix union deduplication correctness bug

Posted by GitBox <gi...@apache.org>.

cloud-fan commented on pull request #35760:
URL: https://github.com/apache/spark/pull/35760#issuecomment-1062466666


   thanks, merging to master/3.2/3.1/3.0!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org