You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/02/18 21:55:40 UTC

[GitHub] [spark] skambha opened a new pull request #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception

skambha opened a new pull request #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception
URL: https://github.com/apache/spark/pull/27629
 
 
   ### What changes were proposed in this pull request?
   
   JIRA SPARK-28067:  Wrong results are returned for aggregate sum with decimals with whole stage codegen enabled
   
   **Repro:** 
   WholeStage enabled enabled ->  Wrong results
   WholeStage disabled -> Returns exception Decimal precision 39 exceeds max precision 38
   
   **Issues:** 
   1. Wrong results are returned which is bad 
   2. Inconsistency between whole stage enabled and disabled. 
   
   **Cause:**
   Sum does not take care of possibility of overflow for the intermediate steps.  ie the updateExpressions and mergeExpressions.  
   
   **This PR makes the following changes:**
   - Throw exception if there is an decimal overflow when computing the sum.
   - This will be consistent with how Spark behaves when whole stage codegen is disabled.
   
   **Pros:** 
   - No wrong results
   - Consistent behavior between wholestage enabled and disabled
   - DB’s have similar behavior, there is precedence
   
   **Before Fix:** - WRONG RESULTS
   ```
   scala> val df = Seq(
        |  (BigDecimal("10000000000000000000"), 1),
        |  (BigDecimal("10000000000000000000"), 1),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
   df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]
   
   scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
   df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
   
   scala> df2.show(40,false)
   +---------------------------------------+                                       
   |sum(decNum)                            |
   +---------------------------------------+
   |20000000000000000000.000000000000000000|
   +---------------------------------------+
   ```
   
   **After fix:**
   ```
   scala> val df = Seq(
        |  (BigDecimal("10000000000000000000"), 1),
        |  (BigDecimal("10000000000000000000"), 1),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2),
        |  (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
   df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]
   
   scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
   df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]
   
   scala> df2.show(40,false)
   20/02/18 13:36:19 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 9)    
   java.lang.ArithmeticException: Decimal(expanded,100000000000000000000.000000000000000000,39,18}) cannot be represented as Decimal(38, 18).
   ```
   
   ### Why are the changes needed?
   The changes are needed in order to fix the wrong results that are returned for decimal aggregate sum.
   
   ### Does this PR introduce any user-facing change?
   Prior to this fix, user would see wrong results on aggregate sum that involved decimal overflow, but now the user will see exception. This behavior is consistent as well with how Spark behaves when whole stage code gen is disabled.
   
   ### How was this patch tested?
   New test has been added and existing tests for sql, catalyst and hive suites were run ok.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mgaido91 commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception

Posted by GitBox <gi...@apache.org>.

mgaido91 commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception
URL: https://github.com/apache/spark/pull/27629#issuecomment-588495360
 
 
   This PR would introduce regressions. Checking every sum means that temporary overflows would cause an exception. Eg., if you sum MAX_INT, 10, -100, then MAX_INT + 10 would cause the exception. In the current code, this sum is handled properly and returns the correct result, because the temporary overflow is fixed by summing -100. So we would raise exceptions even when not needed. IIRC, other DBs treat this properly, so temporary overflow don't cause exceptions.
   
   The proper fix for this would be to use as buffer a larger data type than the returned one. I remember I had a PR for that (https://github.com/apache/spark/pull/25347). You can check the comments and history of it.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] skambha commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception

Posted by GitBox <gi...@apache.org>.

skambha commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception
URL: https://github.com/apache/spark/pull/27629#issuecomment-589484157
 
 
   @mgaido91, Since you worked on a lot of the overflow issues,  if you can review the two approaches listed here in [SPARK-28067](https://issues.apache.org/jira/browse/SPARK-28067?focusedCommentId=17039403&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17039403) and add your thoughts, I'd appreciate it. Thanks. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] skambha commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception

Posted by GitBox <gi...@apache.org>.

skambha commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception
URL: https://github.com/apache/spark/pull/27629#issuecomment-587924941
 
 
   Please see [my notes](https://issues.apache.org/jira/browse/SPARK-28067?focusedCommentId=17039403&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17039403) in this JIRA for the two approaches to fix this issue.  This is a implementation for  approach 1 fix.    This is simple and straightforward compared to the approach2 PR.  
   
   I have another [pr 27627](https://github.com/apache/spark/pull/27627) that takes approach 2 to fix this issue.  Both these will fix the incorrect results (which is good).   Each have their pros and cons as listed in my comment in the JIRA.   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] skambha commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception

Posted by GitBox <gi...@apache.org>.

skambha commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception
URL: https://github.com/apache/spark/pull/27629#issuecomment-589479357

> This PR would introduce regressions. Checking every sum means that temporary overflows would cause an exception. Eg., if you sum MAX_INT, 10, -100, then MAX_INT + 10 would cause the exception. In the current code, this sum is handled properly and returns the correct result, because the temporary overflow is fixed by summing -100. So we would raise exceptions even when not needed. IIRC, other DBs treat this properly, so temporary overflow don't cause exceptions.
>

I see what you are saying, but this PR is targeted to the Aggregate sum of the decimal type (result type is decimal type) only and not for int or long. Sum of ints is handled the same way as before and does not introduce any regressions for the above mentioned use case. [1]

This PR is trying to handle the use case regarding aggregate Sum for decimal:
- Sum of decimal type overflows and returns wrong results.
- Note, In the current code (without this PR also), the same operation of sum on decimal type will throw an exception when whole stage code gen is disabled.

(Furthermore, even if spark.sql.ansi.enabled is set to true, we do not return null. This conf property is to ensure that any overflows will return null.)

Here, we are dealing with a correctness issue. This pr's approach is to follow the result returned by the whole stage codegen codepath.

Actually this issue is mentioned in PR/SPARK-23179 [3] as a special case. SPARK-28224 partially addressed this.

fwiw, I checked this on MS SQL Server and it throws an error as well. [2]

> The proper fix for this would be to use as buffer a larger data type than the returned one. I remember I had a PR for that (#25347). You can check the comments and history of it.

Sure. I checked this (#25347), and this deals with increasing the datatype for the aggregate sum of long's to decimal to avoid temporary overflow. The decision was to not make the change because a) since it is not a correctness issue, and b) because of the performance hit and c) workaround exists - that if the user sees exception because of temporary overflow, they can cast it to a decimal. [4].

[1] —>[ SPARK-26218 ](https://issues.apache.org/jira/browse/SPARK-26218)Overflow on arithmetic operations returns incorrect result
[2] http://sqlfiddle.com/#!18/e7ecc/1
[3] —> [SPARK-23179 ](https://issues.apache.org/jira/browse/SPARK-23179)Support option to throw exception if overflow occurs during Decimal arithmetic
[4] https://github.com/apache/spark/pull/25347#issuecomment-525763443

Thanks for your comments.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception
URL: https://github.com/apache/spark/pull/27629#issuecomment-587922122
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception

Posted by GitBox <gi...@apache.org>.

AmplabJenkins commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception
URL: https://github.com/apache/spark/pull/27629#issuecomment-587923314
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] mgaido91 commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception

Posted by GitBox <gi...@apache.org>.

mgaido91 commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception
URL: https://github.com/apache/spark/pull/27629#issuecomment-589947439
 
 
   well' in this PR you are changing the logical plan, that's weird that the 2 executions mode return different results and we have to fix the plan for this.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] skambha edited a comment on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception

Posted by GitBox <gi...@apache.org>.

skambha edited a comment on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception
URL: https://github.com/apache/spark/pull/27629#issuecomment-589479357

(Furthermore, even if spark.sql.ansi.enabled is set to true, we do not return null. This conf property is to ensure that any overflows will return null.)

Here, we are dealing with a correctness issue. This pr's approach is to follow the result returned by the whole stage codegen disabled codepath.

Actually this issue is mentioned in PR/SPARK-23179 [3] as a special case. SPARK-28224 partially addressed this.

fwiw, I checked this on MS SQL Server and it throws an error as well. [2]

> The proper fix for this would be to use as buffer a larger data type than the returned one. I remember I had a PR for that (#25347). You can check the comments and history of it.

Thanks for your comments.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] AmplabJenkins removed a comment on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception

Posted by GitBox <gi...@apache.org>.

AmplabJenkins removed a comment on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception
URL: https://github.com/apache/spark/pull/27629#issuecomment-587922122
 
 
   Can one of the admins verify this patch?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] skambha commented on a change in pull request #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception

Posted by GitBox <gi...@apache.org>.

skambha commented on a change in pull request #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception
URL: https://github.com/apache/spark/pull/27629#discussion_r380968792
 
 

 ##########
 File path: sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
 ##########
 @@ -169,18 +169,14 @@ class DataFrameSuite extends QueryTest
       DecimalData(BigDecimal("1"* 20 + ".123"), BigDecimal("1"* 20 + ".123")) ::
         DecimalData(BigDecimal("9"* 20 + ".123"), BigDecimal("9"* 20 + ".123")) :: Nil).toDF()
 
-    Seq(true, false).foreach { ansiEnabled =>
-      withSQLConf((SQLConf.ANSI_ENABLED.key, ansiEnabled.toString)) {
+    Seq("true", "false").foreach { codegenEnabled =>
+      withSQLConf((SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, codegenEnabled)) {
         val structDf = largeDecimals.select("a").agg(sum("a"))
-        if (!ansiEnabled) {
 
 Review comment:
   SPARK-28224 took care of decimal overflow for sum only partially for 2 values.  In this test case that was added as part of SPARK-28224, if you add another row into the dataset, you will get incorrect results and not return null on overflow.   
   
   In this PR we address decimal overflow in aggregate sum by throwing an exception.   Hence this test has been modified.  

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception

Posted by GitBox <gi...@apache.org>.

HyukjinKwon commented on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception
URL: https://github.com/apache/spark/pull/27629#issuecomment-588096053
 
 
   cc @mgaido91 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org