You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Josh Rosen (Jira)" <ji...@apache.org> on 2022/07/06 20:44:00 UTC

[jira] [Updated] (SPARK-37865) Spark should not dedup the groupingExpressions when the first child of Union has duplicate columns

     [ https://issues.apache.org/jira/browse/SPARK-37865?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Josh Rosen updated SPARK-37865:
-------------------------------
    Labels: correctness  (was: )

> Spark should not dedup the groupingExpressions when the first child of Union has duplicate columns
> --------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-37865
>                 URL: https://issues.apache.org/jira/browse/SPARK-37865
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.3.0
>            Reporter: Chao Gao
>            Assignee: Karen Feng
>            Priority: Major
>              Labels: correctness
>             Fix For: 3.1.3, 3.0.4, 3.3.0, 3.2.2
>
>
> When the first child of Union has duplicate columns like select a, a from t1 union select a, b from t2, spark only use the first column to aggregate the results, which would make the results incorrect, and this behavior is inconsistent with other engines like PostgreSQL, MySQL. We could alias the attribute of the first child of union to resolve this, or you could argue that this is the feature of Spark SQL.
> sample query:
> select
> a,
> a
> from values (1, 1), (1, 2) as t1(a, b)
> UNION
> SELECT
> a,
> b
> from values (1, 1), (1, 2) as t2(a, b)
> result is
> (1,1)
> result from PostgreSQL and MySQL
> (1,1)
> (1,2)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org