You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/08/30 09:27:52 UTC

[GitHub] [spark] SelfImpr001 opened a new pull request, #37726: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…

SelfImpr001 opened a new pull request, #37726:
URL: https://github.com/apache/spark/pull/37726

   ### What changes were proposed in this pull request?
       Solve the problem that when spark2.4.3 relies on hive1.2.1, after writing a specific value of 0.00, the query is abnormal
   
   ### Why are the changes needed?
       This problem will lose the precision of 0.00, there is a problem of inaccurate data, and a program blocking exception will occur when querying
   
   ### Does this PR introduce _any_ user-facing change?
   ```
     create table testgg as select 0.00 as gg;
     select * from testgg;
   ```
   
     Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 1 kind SECONDARY position: 0 length: 0 range: 0 offset: 0 limit: 0
   
       The performance of abnormal data in hdfs
       In the display of orc data on HDFS, it can be seen that among the abnormally written data, in the stripes data of orc, the length of column1 is 0
   ```
     Rows: 1Compression: SNAPPYCompression size: 262144
     Type: struct<gg:decimal(2,2)>
     Stripe Statistics: 
     Stripe 1:   Column 0: count: 0 hasNull: false   
     Column 1: count: 0 hasNull: true
     File Statistics: Column 0: count: 0 hasNull: false Column 1: count: 0 hasNull: true
     Stripes: Stripe: offset: 3 data: 5 rows: 1 tail: 64 index: 35   
     Stream: column 0 section ROW_INDEX start: 3 length 11   
     Stream: column 1 section ROW_INDEX start: 14 length 24   
     Stream: column 1 section PRESENT start: 38 length 5   
     Stream: column 1 section DATA start: 43 length 0   
     Stream: column 1 section SECONDARY start: 43 length 0   
     Encoding column 0: DIRECT   
     Encoding column 1: DIRECT_V2File length: 213 bytesPadding length: 0 bytesPadding ratio: 0%
   ```
   ### How was this patch tested?
   ```
   spark-sql (default)> 
                      > 
                      > create table testgg as select 0.00 as gg;select * from testgg;
   2022-08-30 17:13:04,014 INFO  (main) [Logging.scala:logInfo(54)] - Parsing command: create table testgg as select 0.00 as gg
   2022-08-30 17:13:08,662 INFO  (main) [Logging.scala:logInfo(54)] - Parsing command: `default`.`testgg`
   22/08/30 17:13:08 INFO SparkSqlParser: Parsing command: `default`.`testgg`
   22/08/30 17:13:08 INFO CatalystSqlParser: Parsing command: decimal(3,2)
   22/08/30 17:13:09 INFO YarnScheduler: Removed TaskSet 1.0, whose tasks have all completed, from pool 
   22/08/30 17:13:09 INFO DAGScheduler: ResultStage 1 (processCmd at CliDriver.java:376) finished in 0.408 s
   22/08/30 17:13:09 INFO DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 0.413011 s
   gg
   0
   Time taken: 0.767 seconds, Fetched 1 row(s)
   22/08/30 17:13:09 INFO SparkSQLCLIDriver: Time taken: 0.767 seconds, Fetched 1 row(s)
   ```
   After the repair, the correct result is obtained. At this time, the hdfs orc file information is as follows:
   ```
   Rows: 1Compression: SNAPPY
   Compression size: 262144
   Type: struct<gg:decimal(3,2)>
   Stripe Statistics: Stripe 1:   Column 0: count: 0 hasNull: false   
   Column 1: count: 1 hasNull: false min: 1 max: 1 sum: 1
   File Statistics: Column 0: count: 0 hasNull: false 
   Column 1: count: 1 hasNull: false min: 1 max: 1 sum: 1
   Stripes: Stripe: offset: 3 data: 10 rows: 1 tail: 58 index: 40   
   Stream: column 0 section ROW_INDEX start: 3 length 11   
   Stream: column 1 section ROW_INDEX start: 14 length 29   
   Stream: column 1 section DATA start: 43 length 4   
   Stream: column 1 section SECONDARY start: 47 length 6   
   Encoding column 0: DIRECT   Encoding column 1: DIRECT_V2File length: 229 bytesPadding length: 0 bytesPadding ratio: 0%
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SelfImpr001 closed pull request #37726: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…

Posted by GitBox <gi...@apache.org>.

SelfImpr001 closed pull request #37726: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…
URL: https://github.com/apache/spark/pull/37726


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SelfImpr001 commented on pull request #37726: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…

Posted by GitBox <gi...@apache.org>.

SelfImpr001 commented on PR #37726:
URL: https://github.com/apache/spark/pull/37726#issuecomment-1231720163

   Change PR to master branch


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] SelfImpr001 closed pull request #37726: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…

Posted by GitBox <gi...@apache.org>.

SelfImpr001 closed pull request #37726: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…
URL: https://github.com/apache/spark/pull/37726


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org

[GitHub] [spark] wangyum commented on pull request #37726: [SPARK-40253] [SQL] Fixed loss of precision for writing 0.00 specific…

Posted by GitBox <gi...@apache.org>.

wangyum commented on PR #37726:
URL: https://github.com/apache/spark/pull/37726#issuecomment-1231519361

   @SelfImpr001 Please creating the PR against master branch. You can change it by:
   <img width="1246" alt="image" src="https://user-images.githubusercontent.com/5399861/187421861-2163f484-8a1a-49fc-bc32-2e36b7910289.png">


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org