You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by "cloud-fan (via GitHub)" <gi...@apache.org> on 2023/11/13 08:31:54 UTC

[PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

cloud-fan opened a new pull request, #43781:
URL: https://github.com/apache/spark/pull/43781

### What changes were proposed in this pull request?

This is kind of a followup of https://github.com/apache/spark/pull/20023 .

It's simply wrong to cut the decimal precision to 38 if a wider decimal type exceeds the max precision, which drops the integral digits and makes the decimal value very likely to overflow.

In https://github.com/apache/spark/pull/20023 , we fixed this issue for arithmetic operations, but many other operations suffer from the same issue: Union, binary comparison, in subquery, create_array, coalesce, etc.

This PR fixes all the remaining operators, without the min scale limitation, which should be applied to division and multiple only according to the SQL server doc: https://learn.microsoft.com/en-us/sql/t-sql/data-types/precision-scale-and-length-transact-sql?view=sql-server-ver15

### Why are the changes needed?

To produce reasonable wider decimal type.

### Does this PR introduce _any_ user-facing change?

Yes, the final data type of these operators will be changed if it's decimal type and its precision exceeds the max and the scale is not 0.

### How was this patch tested?

updated tests

### Was this patch authored or co-authored using generative AI tooling?

No

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #43781:
URL: https://github.com/apache/spark/pull/43781#issuecomment-1811881698

   > Is there reproducer that can be added as unit test to show the issue in e2e example? 
   
   I think the updated tests show the problem.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1390832585


##########
docs/sql-ref-ansi-compliance.md:
##########
@@ -240,6 +240,25 @@ The least common type resolution is used to:
 - Derive the result type for expressions such as the case expression.
 - Derive the element, key, or value types for array and map constructors.
 Special rules are applied if the least common type resolves to FLOAT. With float type values, if any of the types is INT, BIGINT, or DECIMAL the least common type is pushed to DOUBLE to avoid potential loss of digits.
+
+Decimal type is a bit more complicated here, as it's not a simple type but has parameters: precision and scale.
+A `decimal(precision, scale)` means the value can has at most `precision - scale` digits in the integral part and `scale` digits in the fractional part.
+A least common type between decimal types should have enough digits in both integral and fractional parts to represent all values.
+More precisely, a least common type between `decimal(p1, s1)` and `decimal(p2, s2)` has the scale of `max(s1, s2)` and precision of `max(s1, s2) + max(p1 - s1, p2 - s2)`.
+However, decimal types in Spark has a maximum precision: 38. If the final decimal type needs more precision, we must do truncation.
+Since the digits in the integral part are more significant, Spark truncates the digits in the fractional part first. For example, `decimal(48, 20)` will be reduced to `decimal(38, 10)`.
+
+Note, arithmetic operations have special rules to calculate the least common type for decimal inputs:
+
+| Operation  | Result precision                         | Result scale        |
+|------------|------------------------------------------|---------------------|
+| e1 + e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1  | max(s1, s2)         |
+| e1 - e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1	| max(s1, s2)         |
+| e1 * e2    | p1 + p2 + 1	                         | s1 + s2             |

Review Comment:
   ```suggestion
   | e1 * e2    | p1 + p2 + 1	                        | s1 + s2             |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1393379712


##########
docs/sql-ref-ansi-compliance.md:
##########
@@ -240,6 +240,25 @@ The least common type resolution is used to:
 - Derive the result type for expressions such as the case expression.
 - Derive the element, key, or value types for array and map constructors.
 Special rules are applied if the least common type resolves to FLOAT. With float type values, if any of the types is INT, BIGINT, or DECIMAL the least common type is pushed to DOUBLE to avoid potential loss of digits.
+
+Decimal type is a bit more complicated here, as it's not a simple type but has parameters: precision and scale.
+A `decimal(precision, scale)` means the value can has at most `precision - scale` digits in the integral part and `scale` digits in the fractional part.

Review Comment:
   ```suggestion
   A `decimal(precision, scale)` means the value can have at most `precision - scale` digits in the integral part and `scale` digits in the fractional part.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1393380697


##########
docs/sql-ref-ansi-compliance.md:
##########
@@ -240,6 +240,25 @@ The least common type resolution is used to:
 - Derive the result type for expressions such as the case expression.
 - Derive the element, key, or value types for array and map constructors.
 Special rules are applied if the least common type resolves to FLOAT. With float type values, if any of the types is INT, BIGINT, or DECIMAL the least common type is pushed to DOUBLE to avoid potential loss of digits.
+
+Decimal type is a bit more complicated here, as it's not a simple type but has parameters: precision and scale.
+A `decimal(precision, scale)` means the value can has at most `precision - scale` digits in the integral part and `scale` digits in the fractional part.
+A least common type between decimal types should have enough digits in both integral and fractional parts to represent all values.
+More precisely, a least common type between `decimal(p1, s1)` and `decimal(p2, s2)` has the scale of `max(s1, s2)` and precision of `max(s1, s2) + max(p1 - s1, p2 - s2)`.
+However, decimal types in Spark has a maximum precision: 38. If the final decimal type needs more precision, we must do truncation.

Review Comment:
   ```suggestion
   However, decimal types in Spark have a maximum precision: 38. If the final decimal type need more precision, we must do truncation.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #43781:
URL: https://github.com/apache/spark/pull/43781#issuecomment-1809470848

   cc @gengliangwang @viirya @wangyum 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "ryan-johnson-databricks (via GitHub)" <gi...@apache.org>.

ryan-johnson-databricks commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1391756323


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -5425,7 +5434,7 @@ class SQLConf extends Serializable with Logging with SqlApiConf {
   }
 
   def legacyRaiseErrorWithoutErrorClass: Boolean =
-      getConf(SQLConf.LEGACY_RAISE_ERROR_WITHOUT_ERROR_CLASS)
+    getConf(SQLConf.LEGACY_RAISE_ERROR_WITHOUT_ERROR_CLASS)

Review Comment:
   noise? (best for a non-bugfix PR?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1390832381


##########
docs/sql-ref-ansi-compliance.md:
##########
@@ -240,6 +240,25 @@ The least common type resolution is used to:
 - Derive the result type for expressions such as the case expression.
 - Derive the element, key, or value types for array and map constructors.
 Special rules are applied if the least common type resolves to FLOAT. With float type values, if any of the types is INT, BIGINT, or DECIMAL the least common type is pushed to DOUBLE to avoid potential loss of digits.
+
+Decimal type is a bit more complicated here, as it's not a simple type but has parameters: precision and scale.
+A `decimal(precision, scale)` means the value can has at most `precision - scale` digits in the integral part and `scale` digits in the fractional part.
+A least common type between decimal types should have enough digits in both integral and fractional parts to represent all values.
+More precisely, a least common type between `decimal(p1, s1)` and `decimal(p2, s2)` has the scale of `max(s1, s2)` and precision of `max(s1, s2) + max(p1 - s1, p2 - s2)`.
+However, decimal types in Spark has a maximum precision: 38. If the final decimal type needs more precision, we must do truncation.
+Since the digits in the integral part are more significant, Spark truncates the digits in the fractional part first. For example, `decimal(48, 20)` will be reduced to `decimal(38, 10)`.
+
+Note, arithmetic operations have special rules to calculate the least common type for decimal inputs:
+
+| Operation  | Result precision                         | Result scale        |
+|------------|------------------------------------------|---------------------|
+| e1 + e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1  | max(s1, s2)         |
+| e1 - e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1	| max(s1, s2)         |
+| e1 * e2    | p1 + p2 + 1	                            | s1 + s2             |

Review Comment:
   ```suggestion
   | e1 * e2    | p1 + p2 + 1	                         | s1 + s2             |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.

gengliangwang commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1393331222


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala:
##########
@@ -64,7 +65,11 @@ object DecimalPrecision extends TypeCoercionRule {
   def widerDecimalType(p1: Int, s1: Int, p2: Int, s2: Int): DecimalType = {
     val scale = max(s1, s2)
     val range = max(p1 - s1, p2 - s2)
-    DecimalType.bounded(range + scale, scale)
+    if (conf.getConf(SQLConf.LEGACY_RETAIN_FRACTION_DIGITS_FIRST)) {
+      DecimalType.bounded(range + scale, scale)

Review Comment:
   There are many usages of `DecimalType.bounded`. 
   What is the reason why we only change the behavior here?



##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala:
##########
@@ -64,7 +65,11 @@ object DecimalPrecision extends TypeCoercionRule {
   def widerDecimalType(p1: Int, s1: Int, p2: Int, s2: Int): DecimalType = {
     val scale = max(s1, s2)
     val range = max(p1 - s1, p2 - s2)
-    DecimalType.bounded(range + scale, scale)
+    if (conf.getConf(SQLConf.LEGACY_RETAIN_FRACTION_DIGITS_FIRST)) {
+      DecimalType.bounded(range + scale, scale)

Review Comment:
   There are many usages of `DecimalType.bounded`. 
   Why we only change the behavior here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1393576370


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala:
##########
@@ -64,7 +65,11 @@ object DecimalPrecision extends TypeCoercionRule {
   def widerDecimalType(p1: Int, s1: Int, p2: Int, s2: Int): DecimalType = {
     val scale = max(s1, s2)
     val range = max(p1 - s1, p2 - s2)
-    DecimalType.bounded(range + scale, scale)
+    if (conf.getConf(SQLConf.LEGACY_RETAIN_FRACTION_DIGITS_FIRST)) {
+      DecimalType.bounded(range + scale, scale)

Review Comment:
   To limit the scope to type coercion only. Some arithmetic operations also call it to determine the result decimal type and I don't want to change that part.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1391930638


##########
docs/sql-ref-ansi-compliance.md:
##########
@@ -240,6 +240,25 @@ The least common type resolution is used to:
 - Derive the result type for expressions such as the case expression.
 - Derive the element, key, or value types for array and map constructors.
 Special rules are applied if the least common type resolves to FLOAT. With float type values, if any of the types is INT, BIGINT, or DECIMAL the least common type is pushed to DOUBLE to avoid potential loss of digits.
+
+Decimal type is a bit more complicated here, as it's not a simple type but has parameters: precision and scale.
+A `decimal(precision, scale)` means the value can has at most `precision - scale` digits in the integral part and `scale` digits in the fractional part.
+A least common type between decimal types should have enough digits in both integral and fractional parts to represent all values.
+More precisely, a least common type between `decimal(p1, s1)` and `decimal(p2, s2)` has the scale of `max(s1, s2)` and precision of `max(s1, s2) + max(p1 - s1, p2 - s2)`.
+However, decimal types in Spark has a maximum precision: 38. If the final decimal type needs more precision, we must do truncation.
+Since the digits in the integral part are more significant, Spark truncates the digits in the fractional part first. For example, `decimal(48, 20)` will be reduced to `decimal(38, 10)`.
+
+Note, arithmetic operations have special rules to calculate the least common type for decimal inputs:
+
+| Operation  | Result precision                         | Result scale        |
+|------------|------------------------------------------|---------------------|
+| e1 + e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1  | max(s1, s2)         |
+| e1 - e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1	| max(s1, s2)         |
+| e1 * e2    | p1 + p2 + 1	                        | s1 + s2             |
+| e1 / e2    | p1 - s1 + s2 + max(6, s1 + p2 + 1)       | max(6, s1 + p2 + 1) |
+| e1 % e2    | min(p1 - s1, p2 - s2) + max(s1, s2)      | max(s1, s2)         |

Review Comment:
   which one does not follow? The final decimal type can be different as there is one more truncation step.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1393703751


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4541,6 +4541,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val LEGACY_RETAIN_FRACTION_DIGITS_FIRST =
+    buildConf("spark.sql.legacy.decimal.retainFractionDigitsOnTruncate")
+      .internal()
+      .doc("When set to true, we will try to retain the fraction digits first rather than " +
+        "integral digits as prior Spark 4.0, when getting a least common type between decimal " +
+        "types, and the result decimal precision exceeds the max precision.")

Review Comment:
   Please add version, @cloud-fan . IIUC, this targets only Apache Spark 4.0.0?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1391930958


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -5425,7 +5434,7 @@ class SQLConf extends Serializable with Logging with SqlApiConf {
   }
 
   def legacyRaiseErrorWithoutErrorClass: Boolean =
-      getConf(SQLConf.LEGACY_RAISE_ERROR_WITHOUT_ERROR_CLASS)
+    getConf(SQLConf.LEGACY_RAISE_ERROR_WITHOUT_ERROR_CLASS)

Review Comment:
   since I touched this file, I just fixed the wrong indentation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1393383832


##########
docs/sql-ref-ansi-compliance.md:
##########
@@ -240,6 +240,25 @@ The least common type resolution is used to:
 - Derive the result type for expressions such as the case expression.
 - Derive the element, key, or value types for array and map constructors.
 Special rules are applied if the least common type resolves to FLOAT. With float type values, if any of the types is INT, BIGINT, or DECIMAL the least common type is pushed to DOUBLE to avoid potential loss of digits.
+
+Decimal type is a bit more complicated here, as it's not a simple type but has parameters: precision and scale.
+A `decimal(precision, scale)` means the value can has at most `precision - scale` digits in the integral part and `scale` digits in the fractional part.
+A least common type between decimal types should have enough digits in both integral and fractional parts to represent all values.
+More precisely, a least common type between `decimal(p1, s1)` and `decimal(p2, s2)` has the scale of `max(s1, s2)` and precision of `max(s1, s2) + max(p1 - s1, p2 - s2)`.
+However, decimal types in Spark has a maximum precision: 38. If the final decimal type needs more precision, we must do truncation.
+Since the digits in the integral part are more significant, Spark truncates the digits in the fractional part first. For example, `decimal(48, 20)` will be reduced to `decimal(38, 10)`.
+
+Note, arithmetic operations have special rules to calculate the least common type for decimal inputs:
+
+| Operation  | Result precision                         | Result scale        |
+|------------|------------------------------------------|---------------------|
+| e1 + e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1  | max(s1, s2)         |
+| e1 - e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1	| max(s1, s2)         |
+| e1 * e2    | p1 + p2 + 1	                        | s1 + s2             |
+| e1 / e2    | p1 - s1 + s2 + max(6, s1 + p2 + 1)       | max(6, s1 + p2 + 1) |
+| e1 % e2    | min(p1 - s1, p2 - s2) + max(s1, s2)      | max(s1, s2)         |
+
+The truncation rule is also different for arithmetic operations: they retain at least 6 digits in the fractional part, which means we can only reduce `scale` to 6.

Review Comment:
   Should we mention what happens if we cannot truncate fractional part to make it fit into maximum precision? Overflow?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1393385818


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4541,6 +4541,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val LEGACY_RETAIN_FRACTION_DIGITS_FIRST =
+    buildConf("spark.sql.legacy.decimalLeastCommonType.retainFractionDigitsFirst")
+      .internal()
+      .doc("When set to true, we will try to retain the fraction digits first rather than " +
+        "integral digits, when getting a least common type between decimal types, and the " +
+        "result decimal precision exceeds the max precision.")

Review Comment:
   ```suggestion
         .doc("When set to true, we will try to retain the fraction digits first rather than " +
           "integral digits as prior Spark 4.0, when getting a least common type between decimal types, and the " +
           "result decimal precision exceeds the max precision.")
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1392065605


##########
docs/sql-ref-ansi-compliance.md:
##########
@@ -240,6 +240,25 @@ The least common type resolution is used to:
 - Derive the result type for expressions such as the case expression.
 - Derive the element, key, or value types for array and map constructors.
 Special rules are applied if the least common type resolves to FLOAT. With float type values, if any of the types is INT, BIGINT, or DECIMAL the least common type is pushed to DOUBLE to avoid potential loss of digits.
+
+Decimal type is a bit more complicated here, as it's not a simple type but has parameters: precision and scale.
+A `decimal(precision, scale)` means the value can has at most `precision - scale` digits in the integral part and `scale` digits in the fractional part.
+A least common type between decimal types should have enough digits in both integral and fractional parts to represent all values.
+More precisely, a least common type between `decimal(p1, s1)` and `decimal(p2, s2)` has the scale of `max(s1, s2)` and precision of `max(s1, s2) + max(p1 - s1, p2 - s2)`.
+However, decimal types in Spark has a maximum precision: 38. If the final decimal type needs more precision, we must do truncation.
+Since the digits in the integral part are more significant, Spark truncates the digits in the fractional part first. For example, `decimal(48, 20)` will be reduced to `decimal(38, 10)`.
+
+Note, arithmetic operations have special rules to calculate the least common type for decimal inputs:
+
+| Operation  | Result precision                         | Result scale        |
+|------------|------------------------------------------|---------------------|
+| e1 + e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1  | max(s1, s2)         |
+| e1 - e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1	| max(s1, s2)         |
+| e1 * e2    | p1 + p2 + 1	                        | s1 + s2             |
+| e1 / e2    | p1 - s1 + s2 + max(6, s1 + p2 + 1)       | max(6, s1 + p2 + 1) |
+| e1 % e2    | min(p1 - s1, p2 - s2) + max(s1, s2)      | max(s1, s2)         |

Review Comment:
   This is not the Spark SQL multiple. Please take a look at `Multiple#resultDecimalType`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1393704651


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4541,6 +4541,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val LEGACY_RETAIN_FRACTION_DIGITS_FIRST =
+    buildConf("spark.sql.legacy.decimal.retainFractionDigitsOnTruncate")
+      .internal()
+      .doc("When set to true, we will try to retain the fraction digits first rather than " +
+        "integral digits as prior Spark 4.0, when getting a least common type between decimal " +
+        "types, and the result decimal precision exceeds the max precision.")

Review Comment:
   oh sorry forgot it, yea it targets 4.0.0
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan closed pull request #43781: [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first
URL: https://github.com/apache/spark/pull/43781


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "viirya (via GitHub)" <gi...@apache.org>.

viirya commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1393380697


##########
docs/sql-ref-ansi-compliance.md:
##########
@@ -240,6 +240,25 @@ The least common type resolution is used to:
 - Derive the result type for expressions such as the case expression.
 - Derive the element, key, or value types for array and map constructors.
 Special rules are applied if the least common type resolves to FLOAT. With float type values, if any of the types is INT, BIGINT, or DECIMAL the least common type is pushed to DOUBLE to avoid potential loss of digits.
+
+Decimal type is a bit more complicated here, as it's not a simple type but has parameters: precision and scale.
+A `decimal(precision, scale)` means the value can has at most `precision - scale` digits in the integral part and `scale` digits in the fractional part.
+A least common type between decimal types should have enough digits in both integral and fractional parts to represent all values.
+More precisely, a least common type between `decimal(p1, s1)` and `decimal(p2, s2)` has the scale of `max(s1, s2)` and precision of `max(s1, s2) + max(p1 - s1, p2 - s2)`.
+However, decimal types in Spark has a maximum precision: 38. If the final decimal type needs more precision, we must do truncation.

Review Comment:
   ```suggestion
   However, decimal types in Spark have a maximum precision: 38. If the final decimal type needs more precision, we must do truncation.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1393704857


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4541,6 +4541,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val LEGACY_RETAIN_FRACTION_DIGITS_FIRST =
+    buildConf("spark.sql.legacy.decimal.retainFractionDigitsOnTruncate")
+      .internal()
+      .doc("When set to true, we will try to retain the fraction digits first rather than " +
+        "integral digits as prior Spark 4.0, when getting a least common type between decimal " +
+        "types, and the result decimal precision exceeds the max precision.")
+      .booleanConf

Review Comment:
   ```suggestion
         .version("4.0.0")
         .booleanConf
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "dongjoon-hyun (via GitHub)" <gi...@apache.org>.

dongjoon-hyun commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1393707711


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4541,6 +4541,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val LEGACY_RETAIN_FRACTION_DIGITS_FIRST =
+    buildConf("spark.sql.legacy.decimal.retainFractionDigitsOnTruncate")
+      .internal()
+      .doc("When set to true, we will try to retain the fraction digits first rather than " +
+        "integral digits as prior Spark 4.0, when getting a least common type between decimal " +
+        "types, and the result decimal precision exceeds the max precision.")

Review Comment:
   Thank you!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "gengliangwang (via GitHub)" <gi...@apache.org>.

gengliangwang commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1393339460


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -4541,6 +4541,15 @@ object SQLConf {
       .booleanConf
       .createWithDefault(false)
 
+  val LEGACY_RETAIN_FRACTION_DIGITS_FIRST =
+    buildConf("spark.sql.legacy.decimalLeastCommonType.retainFractionDigitsFirst")

Review Comment:
   TBH the conf is a bit long. 
   How about `spark.sql.legacy.decimal.retainFractionDigitsOnTruncate`. But the naming really depends on the scrope of the behavior change as I asked in https://github.com/apache/spark/pull/43781/files#r1393331222
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "ryan-johnson-databricks (via GitHub)" <gi...@apache.org>.

ryan-johnson-databricks commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1391756323


##########
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##########
@@ -5425,7 +5434,7 @@ class SQLConf extends Serializable with Logging with SqlApiConf {
   }
 
   def legacyRaiseErrorWithoutErrorClass: Boolean =
-      getConf(SQLConf.LEGACY_RAISE_ERROR_WITHOUT_ERROR_CLASS)
+    getConf(SQLConf.LEGACY_RAISE_ERROR_WITHOUT_ERROR_CLASS)

Review Comment:
   noise? (whitespace changes best made in a non-bugfix PR?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1392060678


##########
docs/sql-ref-ansi-compliance.md:
##########
@@ -240,6 +240,25 @@ The least common type resolution is used to:
 - Derive the result type for expressions such as the case expression.
 - Derive the element, key, or value types for array and map constructors.
 Special rules are applied if the least common type resolves to FLOAT. With float type values, if any of the types is INT, BIGINT, or DECIMAL the least common type is pushed to DOUBLE to avoid potential loss of digits.
+
+Decimal type is a bit more complicated here, as it's not a simple type but has parameters: precision and scale.
+A `decimal(precision, scale)` means the value can has at most `precision - scale` digits in the integral part and `scale` digits in the fractional part.
+A least common type between decimal types should have enough digits in both integral and fractional parts to represent all values.
+More precisely, a least common type between `decimal(p1, s1)` and `decimal(p2, s2)` has the scale of `max(s1, s2)` and precision of `max(s1, s2) + max(p1 - s1, p2 - s2)`.
+However, decimal types in Spark has a maximum precision: 38. If the final decimal type needs more precision, we must do truncation.
+Since the digits in the integral part are more significant, Spark truncates the digits in the fractional part first. For example, `decimal(48, 20)` will be reduced to `decimal(38, 10)`.
+
+Note, arithmetic operations have special rules to calculate the least common type for decimal inputs:
+
+| Operation  | Result precision                         | Result scale        |
+|------------|------------------------------------------|---------------------|
+| e1 + e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1  | max(s1, s2)         |
+| e1 - e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1	| max(s1, s2)         |
+| e1 * e2    | p1 + p2 + 1	                        | s1 + s2             |
+| e1 / e2    | p1 - s1 + s2 + max(6, s1 + p2 + 1)       | max(6, s1 + p2 + 1) |
+| e1 % e2    | min(p1 - s1, p2 - s2) + max(s1, s2)      | max(s1, s2)         |

Review Comment:
   `*` AND `/`.
   For example:
   ```
   val a = Decimal(100) // p: 10, s: 0
   val b= Decimal(-100) // p: 10, s: 0
   val c = a * b  // Decimal(-10000) p: 5, s: 0
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "beliefer (via GitHub)" <gi...@apache.org>.

beliefer commented on code in PR #43781:
URL: https://github.com/apache/spark/pull/43781#discussion_r1391880151


##########
docs/sql-ref-ansi-compliance.md:
##########
@@ -240,6 +240,25 @@ The least common type resolution is used to:
 - Derive the result type for expressions such as the case expression.
 - Derive the element, key, or value types for array and map constructors.
 Special rules are applied if the least common type resolves to FLOAT. With float type values, if any of the types is INT, BIGINT, or DECIMAL the least common type is pushed to DOUBLE to avoid potential loss of digits.
+
+Decimal type is a bit more complicated here, as it's not a simple type but has parameters: precision and scale.
+A `decimal(precision, scale)` means the value can has at most `precision - scale` digits in the integral part and `scale` digits in the fractional part.
+A least common type between decimal types should have enough digits in both integral and fractional parts to represent all values.
+More precisely, a least common type between `decimal(p1, s1)` and `decimal(p2, s2)` has the scale of `max(s1, s2)` and precision of `max(s1, s2) + max(p1 - s1, p2 - s2)`.
+However, decimal types in Spark has a maximum precision: 38. If the final decimal type needs more precision, we must do truncation.
+Since the digits in the integral part are more significant, Spark truncates the digits in the fractional part first. For example, `decimal(48, 20)` will be reduced to `decimal(38, 10)`.
+
+Note, arithmetic operations have special rules to calculate the least common type for decimal inputs:
+
+| Operation  | Result precision                         | Result scale        |
+|------------|------------------------------------------|---------------------|
+| e1 + e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1  | max(s1, s2)         |
+| e1 - e2    | max(s1, s2) + max(p1 - s1, p2 - s2) + 1	| max(s1, s2)         |
+| e1 * e2    | p1 + p2 + 1	                        | s1 + s2             |
+| e1 / e2    | p1 - s1 + s2 + max(6, s1 + p2 + 1)       | max(6, s1 + p2 + 1) |
+| e1 % e2    | min(p1 - s1, p2 - s2) + max(s1, s2)      | max(s1, s2)         |

Review Comment:
   AFAIK, the arithmetic operations did not strictly follow this rule.



##########
sql/api/src/main/scala/org/apache/spark/sql/types/DecimalType.scala:
##########
@@ -146,6 +146,18 @@ object DecimalType extends AbstractDataType {
     DecimalType(min(precision, MAX_PRECISION), min(scale, MAX_SCALE))
   }
 
+  private[sql] def boundedPreferIntegralDigits(precision: Int, scale: Int): DecimalType = {
+    if (precision <= MAX_PRECISION) {
+      DecimalType(precision, scale)
+    } else {
+      // If we have to reduce the precision, we should retain the digits in the integral part first,
+      // as they are more significant to the value. Here we reduce the scale as well to drop the
+      // digits in the fractional part.

Review Comment:
   Looks good.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

Re: [PR] [SPARK-45905][SQL] Least common type between decimal types should retain integral digits first [spark]

Posted by "cloud-fan (via GitHub)" <gi...@apache.org>.

cloud-fan commented on PR #43781:
URL: https://github.com/apache/spark/pull/43781#issuecomment-1812465509

   The failure is unrelated
   ```
   Extension error:
   Could not import extension sphinx_copybutton (exception: No module named 'sphinx_copybutton')
   make: *** [Makefile:35: html] Error 2
   ```
   
   I'm merging it to master, thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org