You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Bruce Robbins (Jira)" <ji...@apache.org> on 2022/02/07 22:01:00 UTC

[jira] [Created] (SPARK-38133) Grouping by timestamp_ntz will sometimes corrupt the results

Bruce Robbins created SPARK-38133:
-------------------------------------

             Summary: Grouping by timestamp_ntz will sometimes corrupt the results
                 Key: SPARK-38133
                 URL: https://issues.apache.org/jira/browse/SPARK-38133
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.3.0
            Reporter: Bruce Robbins


Assume this data:
{noformat}
create or replace temp view v1 as
select * from values
  (1, timestamp_ntz'2012-01-01 00:00:00', 10000),
  (2, timestamp_ntz'2012-01-01 00:00:00', 20000),
  (1, timestamp_ntz'2012-01-01 00:00:00', 5000),
  (1, timestamp_ntz'2013-01-01 00:00:00', 48000),
  (2, timestamp_ntz'2013-01-01 00:00:00', 30000)
  as data(a, b, c);
{noformat}
Run the following query:
{noformat}
select *
from v1
pivot (
  sum(c)
  for a in (1, 2)
);
{noformat}
You get incorrect results for the group-by column:
{noformat}
2012-01-01 19:05:19.476736	15000	20000
2013-01-01 19:05:19.476736	48000	30000
Time taken: 2.65 seconds, Fetched 2 row(s)
{noformat}
Actually, _whenever_ the TungstenAggregationIterator is used to group by a timestamp_ntz column, you get incorrect results:
{noformat}
set spark.sql.codegen.wholeStage=false;
select a, b, sum(c) from v1 group by a, b;
{noformat}
This query produces
{noformat}
2	2012-01-01 09:32:39.738368	20000
1	2013-01-01 09:32:39.738368	48000
2	2013-01-01 09:32:39.738368	30000
Time taken: 1.927 seconds, Fetched 4 row(s)
{noformat}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org