You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Apache Spark (Jira)" <ji...@apache.org> on 2020/12/15 18:34:01 UTC
[jira] [Assigned] (SPARK-33726) Duplicate field names causes wrong
answers during aggregation
[ https://issues.apache.org/jira/browse/SPARK-33726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Apache Spark reassigned SPARK-33726:
------------------------------------
Assignee: (was: Apache Spark)
> Duplicate field names causes wrong answers during aggregation
> -------------------------------------------------------------
>
> Key: SPARK-33726
> URL: https://issues.apache.org/jira/browse/SPARK-33726
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.4, 3.0.1
> Reporter: Yian Liou
> Priority: Major
> Labels: correctness
>
> We saw this bug at Workday.
> Duplicate field names for different fields can cause org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch#allocate to return a fixed batch when it should have returned a variable batch leading to wrong results.
> This example produces wrong results in the spark shell:
> scala> sql("with T as (select id as a, -id as x from range(3)), U as (select id as b, cast(id as string) as x from range(3)) select T.x, U.x, min(a) as ma, min(b) as mb from T join U on a=b group by U.x, T.x").show
>
> |*x*|*x*|*ma*|*mb*|
> |-2|2|0|null|
> |-1|1|null|1|
> |0|0|0|0|
> instead of correct output :
> |*x*|*x*|*ma*|*mb*|
> |0|0|0|0|
> |-2|2|2|2|
> |-1|1|1|1|
> The issue can be solved by iterating over the fields themselves instead of field names.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org