You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/10/10 22:46:17 UTC

[GitHub] [spark] dbtsai opened a new pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance

dbtsai opened a new pull request #26085: [SPARK-29434] [Core] Improve the MapStatuses Serialization Performance
URL: https://github.com/apache/spark/pull/26085
 
 
   ### What changes were proposed in this pull request?
   Instead of using GZIP for compressing the serialized `MapStatuses`, ZStd provides better compression rate and faster compression time.
   
   The original approach is serializing and writing data directly into `GZIPOutputStream` as one step; however, the compression time is faster if a bigger chuck of the data is processed by the codec at once. As a result, in this PR, the serialized data is written into an uncompressed byte array, and then the data will be compressed. For smaller `MapStatues`, we find that it gives 2x performance gain.
   
   Here is the benchmark result.
   
   #### 20k map outputs, and each has 500 blocks
   1. ZStd two steps in this PR: 0.402 ops/ms, 89066 bytes
   2. ZStd one step as the original approach: 0.370 ops/ms, 89069 bytes
   3. GZip: 0.092 ops/ms, 217345 bytes
   
   #### 20k map outputs, and each has 5 blocks
   1. ZStd two steps in this PR: 0.9 ops/ms, 75449 bytes
   2. ZStd one step as the original approach: 0.38 ops/ms, 75452 bytes
   3. GZip: 0.21 ops/ms, 160094 bytes
   
   ### Why are the changes needed?
   Decrease the time for serializing the `MapStatuses` in large scale job.
   
   ### Does this PR introduce any user-facing change?
   No.
   
   ### How was this patch tested?
   Existing tests.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org