You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/02/21 03:21:43 UTC

[GitHub] [spark] skambha edited a comment on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception

skambha edited a comment on issue #27629: [SPARK-28067][SQL]Fix incorrect results during aggregate sum for decimal overflow by throwing exception
URL: https://github.com/apache/spark/pull/27629#issuecomment-589479357

> This PR would introduce regressions. Checking every sum means that temporary overflows would cause an exception. Eg., if you sum MAX_INT, 10, -100, then MAX_INT + 10 would cause the exception. In the current code, this sum is handled properly and returns the correct result, because the temporary overflow is fixed by summing -100. So we would raise exceptions even when not needed. IIRC, other DBs treat this properly, so temporary overflow don't cause exceptions.
>

I see what you are saying, but this PR is targeted to the Aggregate sum of the decimal type (result type is decimal type) only and not for int or long. Sum of ints is handled the same way as before and does not introduce any regressions for the above mentioned use case. [1]

This PR is trying to handle the use case regarding aggregate Sum for decimal:
- Sum of decimal type overflows and returns wrong results.
- Note, In the current code (without this PR also), the same operation of sum on decimal type will throw an exception when whole stage code gen is disabled.

(Furthermore, even if spark.sql.ansi.enabled is set to true, we do not return null. This conf property is to ensure that any overflows will return null.)

Here, we are dealing with a correctness issue. This pr's approach is to follow the result returned by the whole stage codegen disabled codepath.

Actually this issue is mentioned in PR/SPARK-23179 [3] as a special case. SPARK-28224 partially addressed this.

fwiw, I checked this on MS SQL Server and it throws an error as well. [2]

> The proper fix for this would be to use as buffer a larger data type than the returned one. I remember I had a PR for that (#25347). You can check the comments and history of it.

Sure. I checked this (#25347), and this deals with increasing the datatype for the aggregate sum of long's to decimal to avoid temporary overflow. The decision was to not make the change because a) since it is not a correctness issue, and b) because of the performance hit and c) workaround exists - that if the user sees exception because of temporary overflow, they can cast it to a decimal. [4].

[1] —>[ SPARK-26218 ](https://issues.apache.org/jira/browse/SPARK-26218)Overflow on arithmetic operations returns incorrect result
[2] http://sqlfiddle.com/#!18/e7ecc/1
[3] —> [SPARK-23179 ](https://issues.apache.org/jira/browse/SPARK-23179)Support option to throw exception if overflow occurs during Decimal arithmetic
[4] https://github.com/apache/spark/pull/25347#issuecomment-525763443

Thanks for your comments.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org