You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Mark Sirek (JIRA)" <ji...@apache.org> on 2019/06/16 19:39:00 UTC
[jira] [Created] (SPARK-28067) Incorrect results in decimal aggregation with whole-stage code gen enabled

Mark Sirek created SPARK-28067:
----------------------------------

             Summary: Incorrect results in decimal aggregation with whole-stage code gen enabled
                 Key: SPARK-28067
                 URL: https://issues.apache.org/jira/browse/SPARK-28067
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.4.0, 2.3.0
         Environment: Ubuntu LTS 16.04

Oracle Java 1.8.0_201

spark-2.4.3-bin-without-hadoop

spark-shell
            Reporter: Mark Sirek


The following test case involving a join followed by a sum aggregation returns the wrong answer for the sum:

val df = Seq(
(BigDecimal("10000000000000000000"), 1),
(BigDecimal("10000000000000000000"), 1),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")

val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))

scala> df2.show(40,false)
+---------------------------------------+
|sum(decNum) |
+---------------------------------------+
|40000000000000000000.000000000000000000|
+---------------------------------------+

 

The result should be 1040000000000000000000.0000000000000000.

It appears a partial sum is computed for each join key, as the result returned would be the answer for all rows matching intNum === 1.

If only the rows with intNum === 2 are included, the answer given is null:

scala> val df3 = df.filter($"intNum" === lit(2))
df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: decimal(38,18), intNum: int]

scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, "intNum").agg(sum("decNum"))
df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]

scala> df4.show(40,false)
+-----------+
|sum(decNum)|
+-----------+
|null |
+-----------+

 

The correct answer, 1000000000000000000000.0000000000000000, doesn't fit in the DataType picked for the result, decimal(38,18), so the overflow is converted to null.

The first example, which doesn't filter out the intNum === 1 values should also return null, indicating overflow, but it doesn't.  This may mislead the user to think a valid sum was computed.

If whole-stage code gen is turned off:

spark.conf.set("spark.sql.codegen.wholeStage", false)

... incorrect results are not returned because the overflow is caught as an exception:

java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38

 

 

 

 

 

 

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org