You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2020/01/22 20:38:00 UTC
[jira] [Comment Edited] (SPARK-28067) Incorrect results in decimal
aggregation with whole-stage code gen enabled
[ https://issues.apache.org/jira/browse/SPARK-28067?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17021471#comment-17021471 ]
Dongjoon Hyun edited comment on SPARK-28067 at 1/22/20 8:37 PM:
----------------------------------------------------------------
I reproduced this issue at 2.1.3, 2.2.3, 2.3.4 and 2.4.4. As the description says, it give different results based on `spark.sql.codegen.wholeStage` value.
{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)
val df = Seq(
(BigDecimal("10000000000000000000"), 1),
(BigDecimal("10000000000000000000"), 1),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
df2.show(40,false)
// Exiting paste mode, now interpreting.
+---------------------------------------+
|sum(decNum) |
+---------------------------------------+
|40000000000000000000.000000000000000000|
+---------------------------------------+
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
scala> spark.conf.set("spark.sql.codegen.wholeStage", false)
scala> df2.show(40,false)
20/01/22 20:18:10 ERROR Executor: Exception in task 1.0 in stage 4.0 (TID 20)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38
{code}
was (Author: dongjoon):
I reproduced this issue at 2.2.3, 2.3.4 and 2.4.4. As the description says, it give different results based on `spark.sql.codegen.wholeStage` value.
{code}
scala> :paste
// Entering paste mode (ctrl-D to finish)
val df = Seq(
(BigDecimal("10000000000000000000"), 1),
(BigDecimal("10000000000000000000"), 1),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2),
(BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
df2.show(40,false)
// Exiting paste mode, now interpreting.
+---------------------------------------+
|sum(decNum) |
+---------------------------------------+
|40000000000000000000.000000000000000000|
+---------------------------------------+
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
scala> spark.conf.set("spark.sql.codegen.wholeStage", false)
scala> df2.show(40,false)
20/01/22 20:18:10 ERROR Executor: Exception in task 1.0 in stage 4.0 (TID 20)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38
{code}
> Incorrect results in decimal aggregation with whole-stage code gen enabled
> --------------------------------------------------------------------------
>
> Key: SPARK-28067
> URL: https://issues.apache.org/jira/browse/SPARK-28067
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.2.3, 2.3.4, 2.4.4
> Reporter: Mark Sirek
> Priority: Minor
> Labels: correctness
>
> The following test case involving a join followed by a sum aggregation returns the wrong answer for the sum:
>
> {code:java}
> val df = Seq(
> (BigDecimal("10000000000000000000"), 1),
> (BigDecimal("10000000000000000000"), 1),
> (BigDecimal("10000000000000000000"), 2),
> (BigDecimal("10000000000000000000"), 2),
> (BigDecimal("10000000000000000000"), 2),
> (BigDecimal("10000000000000000000"), 2),
> (BigDecimal("10000000000000000000"), 2),
> (BigDecimal("10000000000000000000"), 2),
> (BigDecimal("10000000000000000000"), 2),
> (BigDecimal("10000000000000000000"), 2),
> (BigDecimal("10000000000000000000"), 2),
> (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
> scala> df2.show(40,false)
> ---------------------------------------
> sum(decNum)
> ---------------------------------------
> 40000000000000000000.000000000000000000
> ---------------------------------------
>
> {code}
>
> The result should be 1040000000000000000000.0000000000000000.
> It appears a partial sum is computed for each join key, as the result returned would be the answer for all rows matching intNum === 1.
> If only the rows with intNum === 2 are included, the answer given is null:
>
> {code:java}
> scala> val df3 = df.filter($"intNum" === lit(2))
> df3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [decNum: decimal(38,18), intNum: int]
> scala> val df4 = df3.withColumnRenamed("decNum", "decNum2").join(df3, "intNum").agg(sum("decNum"))
> df4: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
> scala> df4.show(40,false)
> -----------
> sum(decNum)
> -----------
> null
> -----------
>
> {code}
>
> The correct answer, 1000000000000000000000.0000000000000000, doesn't fit in the DataType picked for the result, decimal(38,18), so an overflow occurs, which Spark then converts to null.
> The first example, which doesn't filter out the intNum === 1 values should also return null, indicating overflow, but it doesn't. This may mislead the user to think a valid sum was computed.
> If whole-stage code gen is turned off:
> spark.conf.set("spark.sql.codegen.wholeStage", false)
> ... incorrect results are not returned because the overflow is caught as an exception:
> java.lang.IllegalArgumentException: requirement failed: Decimal precision 39 exceeds max precision 38
>
>
>
>
>
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org