You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/03/08 08:10:15 UTC

[GitHub] [spark] chasingegg commented on pull request #35760: [SPARK-37865][SQL] Fix union deduplication correctness bug

chasingegg commented on pull request #35760:
URL: https://github.com/apache/spark/pull/35760#issuecomment-1061512536


   > ### What changes were proposed in this pull request?
   > Fixes a correctness bug in `Union` in the case that there are duplicate output columns. Previously, duplicate columns on one side of the union would result in a duplicate column being output on the other side of the union.
   > 
   > To do so, we go through the union’s child’s output and find the duplicates. For each duplicate set, there is a first duplicate: this one is left alone. All following duplicates are aliased and given a tag; this tag is used to remove ambiguity during resolution.
   > 
   > As the first duplicate is left alone, the user can still select it, avoiding a breaking change. As the later duplicates are given new expression IDs, this fixes the correctness bug.
   > 
   > ### Why are the changes needed?
   > Output of union with duplicate columns in the children was incorrect
   > 
   > ### Does this PR introduce _any_ user-facing change?
   > Example query:
   > 
   > ```
   > SELECT a, a FROM VALUES (1, 1), (1, 2) AS t1(a, b)
   > UNION ALL SELECT c, d FROM VALUES (2, 2), (2, 3) AS t2(c, d)
   > ```
   > 
   > Result before:
   > 
   > ```
   > a | a
   > _ | _
   > 1 | 1
   > 1 | 1
   > 2 | 2
   > 2 | 2
   > ```
   > 
   > Result after:
   > 
   > ```
   > a | a
   > _ | _
   > 1 | 1
   > 1 | 2
   > 2 | 2
   > 2 | 3
   > ```
   > 
   > ### How was this patch tested?
   > Unit tests
   
   Result After should be
   ```
    a | a
    _ | _
    1 | 1
    1 | 1
    2 | 2
    2 | 3
   ```
   ?
   And why we should support `select a from (SELECT a, a FROM VALUES (1, 1), (1, 2) AS t1(a, b)
   UNION ALL SELECT c, d FROM VALUES (2, 2), (2, 3) AS t2(c, d))`? The result after this fix is 
    a
    _ 
    1 
    1 
    2 
    2 
   I think it should be broken because it is ambiguous, instead of choosing the first column.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org