You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/30 16:35:37 UTC
[GitHub] [spark] SelfImpr001 opened a new pull request, #37732: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…
SelfImpr001 opened a new pull request, #37732:
URL: https://github.com/apache/spark/pull/37732
### What changes were proposed in this pull request?
Solve the problem that when spark2.4.3 relies on hive1.2.1, after writing a specific value of 0.00, the query is abnormal
### Why are the changes needed?
This problem will lose the precision of 0.00, there is a problem of inaccurate data, and a program blocking exception will occur when querying
### Does this PR introduce _any_ user-facing change?
create table testgg as select 0.00 as gg;
select * from testgg;
Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 limit: 0
The performance of abnormal data in hdfs
In the display of orc data on HDFS, it can be seen that among the abnormally written data, in the stripes data of orc, the length of column1 is 0
```
Rows: 1Compression: SNAPPYCompression size: 262144
Type: struct<gg:decimal(2,2)>
Stripe Statistics:
Stripe 1: Column 0: count: 0 hasNull: false
Column 1: count: 0 hasNull: true
File Statistics: Column 0: count: 0 hasNull: false Column 1: count: 0 hasNull: true
Stripes: Stripe: offset: 3 data: 5 rows: 1 tail: 64 index: 35
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 24
Stream: column 1 section PRESENT start: 38 length 5
Stream: column 1 section DATA start: 43 length 0
Stream: column 1 section SECONDARY start: 43 length 0
Encoding column 0: DIRECT
Encoding column 1: DIRECT_V2File length: 213 bytesPadding length: 0 bytesPadding ratio: 0%
```
### How was this patch tested?
```
spark-sql (default)>
>
> create table testgg as select 0.00 as gg;select * from testgg;
2022-08-30 17:13:04,014 INFO (main) [Logging.scala:logInfo(54)] - Parsing command: create table testgg as select 0.00 as gg
2022-08-30 17:13:08,662 INFO (main) [Logging.scala:logInfo(54)] - Parsing command: `default`.`testgg`
22/08/30 17:13:08 INFO SparkSqlParser: Parsing command: `default`.`testgg`
22/08/30 17:13:08 INFO CatalystSqlParser: Parsing command: decimal(3,2)
22/08/30 17:13:09 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool
22/08/30 17:13:09 INFO DAGScheduler: ResultStage 1 (processCmd at CliDriver.java:376) finished in 0.408 s
22/08/30 17:13:09 INFO DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 0.413011 s
gg
0
Time taken: 0.767 seconds, Fetched 1 row(s)
22/08/30 17:13:09 INFO SparkSQLCLIDriver: Time taken: 0.767 seconds, Fetched 1 row(s)
```
After the repair, the correct result is obtained. At this time, the hdfs orc file information is as follows:
```
Rows: 1Compression: SNAPPY
Compression size: 262144
Type: struct<gg:decimal(3,2)>
Stripe Statistics: Stripe 1: Column 0: count: 0 hasNull: false
Column 1: count: 1 hasNull: false min: 1 max: 1 sum: 1
File Statistics: Column 0: count: 0 hasNull: false
Column 1: count: 1 hasNull: false min: 1 max: 1 sum: 1
Stripes: Stripe: offset: 3 data: 10 rows: 1 tail: 58 index: 40
Stream: column 0 section ROW_INDEX start: 3 length 11
Stream: column 1 section ROW_INDEX start: 14 length 29
Stream: column 1 section DATA start: 43 length 4
Stream: column 1 section SECONDARY start: 47 length 6
Encoding column 0: DIRECT Encoding column 1: DIRECT_V2File length: 229 bytesPadding length: 0 bytesPadding ratio: 0%
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum commented on a diff in pull request #37732: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…
Posted by GitBox <gi...@apache.org>.
wangyum commented on code in PR #37732:
URL: https://github.com/apache/spark/pull/37732#discussion_r962452438
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -76,8 +76,19 @@ object Literal {
val decimal = Decimal(d)
Literal(decimal, DecimalType.fromDecimal(decimal))
case d: JavaBigDecimal =>
- val decimal = Decimal(d)
- Literal(decimal, DecimalType.fromDecimal(decimal))
+ Literal(Decimal(d), DecimalType(Math.max(d.precision, d.scale), d.scale()))
+ if (d.abs().compareTo(new JavaBigDecimal("0")) == 0) {
Review Comment:
@SelfImpr001 Please upgrade your Spark version to the latest Spark version. 2.4.8 is the last release and no more 2.4.x releases should be expected even for bug fixes. Please see: https://spark.apache.org/versioning-policy.html
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SelfImpr001 commented on a diff in pull request #37732: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…
Posted by GitBox <gi...@apache.org>.
SelfImpr001 commented on code in PR #37732:
URL: https://github.com/apache/spark/pull/37732#discussion_r962252317
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -76,8 +76,19 @@ object Literal {
val decimal = Decimal(d)
Literal(decimal, DecimalType.fromDecimal(decimal))
case d: JavaBigDecimal =>
- val decimal = Decimal(d)
- Literal(decimal, DecimalType.fromDecimal(decimal))
+ Literal(Decimal(d), DecimalType(Math.max(d.precision, d.scale), d.scale()))
+ if (d.abs().compareTo(new JavaBigDecimal("0")) == 0) {
Review Comment:
Yes, it is found that it only exists in Spark 2.4 + Hive 1.x, but there are still many people using 2.4 +, it is recommended to provide a patch solution
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37732: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…
Posted by GitBox <gi...@apache.org>.
AmplabJenkins commented on PR #37732:
URL: https://github.com/apache/spark/pull/37732#issuecomment-1233284400
Can one of the admins verify this patch?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on a diff in pull request #37732: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…
Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #37732:
URL: https://github.com/apache/spark/pull/37732#discussion_r959815532
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -76,8 +76,19 @@ object Literal {
val decimal = Decimal(d)
Literal(decimal, DecimalType.fromDecimal(decimal))
case d: JavaBigDecimal =>
- val decimal = Decimal(d)
- Literal(decimal, DecimalType.fromDecimal(decimal))
+ Literal(Decimal(d), DecimalType(Math.max(d.precision, d.scale), d.scale()))
+ if (d.abs().compareTo(new JavaBigDecimal("0")) == 0) {
Review Comment:
JavaBigDecimal.ZERO
But wait is this only a problem for Spark 2.4 + Hive 1.x? that's EOL now
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on a diff in pull request #37732: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…
Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #37732:
URL: https://github.com/apache/spark/pull/37732#discussion_r962314494
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -76,8 +76,19 @@ object Literal {
val decimal = Decimal(d)
Literal(decimal, DecimalType.fromDecimal(decimal))
case d: JavaBigDecimal =>
- val decimal = Decimal(d)
- Literal(decimal, DecimalType.fromDecimal(decimal))
+ Literal(Decimal(d), DecimalType(Math.max(d.precision, d.scale), d.scale()))
+ if (d.abs().compareTo(new JavaBigDecimal("0")) == 0) {
Review Comment:
OK, this is targeted at master though, and there will be no more 2.4 releases
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] srowen commented on a diff in pull request #37732: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…
Posted by GitBox <gi...@apache.org>.
srowen commented on code in PR #37732:
URL: https://github.com/apache/spark/pull/37732#discussion_r967060213
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -76,8 +76,19 @@ object Literal {
val decimal = Decimal(d)
Literal(decimal, DecimalType.fromDecimal(decimal))
case d: JavaBigDecimal =>
- val decimal = Decimal(d)
- Literal(decimal, DecimalType.fromDecimal(decimal))
+ Literal(Decimal(d), DecimalType(Math.max(d.precision, d.scale), d.scale()))
+ if (d.abs().compareTo(new JavaBigDecimal("0")) == 0) {
Review Comment:
We're saying the project no longer makes 2.4.x releases at all, so it wouldn't help anyone on 2.4
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SelfImpr001 commented on a diff in pull request #37732: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…
Posted by GitBox <gi...@apache.org>.
SelfImpr001 commented on code in PR #37732:
URL: https://github.com/apache/spark/pull/37732#discussion_r967056734
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -76,8 +76,19 @@ object Literal {
val decimal = Decimal(d)
Literal(decimal, DecimalType.fromDecimal(decimal))
case d: JavaBigDecimal =>
- val decimal = Decimal(d)
- Literal(decimal, DecimalType.fromDecimal(decimal))
+ Literal(Decimal(d), DecimalType(Math.max(d.precision, d.scale), d.scale()))
+ if (d.abs().compareTo(new JavaBigDecimal("0")) == 0) {
Review Comment:
@srowen @wangyum
Thank you for your answer, because there are many developers whose environment is still in 2.4.x, and upgrading to the latest version requires a lot of verification. It is difficult to upgrade in a short time. In order to avoid the influence of the old version on the developer, it is recommended to merge the PR to the 2.4.X version, so that Doing this will not affect the master branch. If it cannot be merged due to version management specifications, just close this PR directly.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] wangyum closed pull request #37732: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…
Posted by GitBox <gi...@apache.org>.
wangyum closed pull request #37732: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…
URL: https://github.com/apache/spark/pull/37732
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] [spark] SelfImpr001 commented on a diff in pull request #37732: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…
Posted by GitBox <gi...@apache.org>.
SelfImpr001 commented on code in PR #37732:
URL: https://github.com/apache/spark/pull/37732#discussion_r969072089
##########
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/literals.scala:
##########
@@ -76,8 +76,19 @@ object Literal {
val decimal = Decimal(d)
Literal(decimal, DecimalType.fromDecimal(decimal))
case d: JavaBigDecimal =>
- val decimal = Decimal(d)
- Literal(decimal, DecimalType.fromDecimal(decimal))
+ Literal(Decimal(d), DecimalType(Math.max(d.precision, d.scale), d.scale()))
+ if (d.abs().compareTo(new JavaBigDecimal("0")) == 0) {
Review Comment:
Ok, thank you for your answers, I will close the PR now
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org