You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/02/27 14:27:27 UTC

[GitHub] [spark] pan3793 commented on pull request #32385: [WIP][SPARK-35275][CORE] Add checksum for shuffle blocks and diagnose corruption

pan3793 commented on pull request #32385:
URL: https://github.com/apache/spark/pull/32385#issuecomment-1053571929


   Hi @Ngone51, thanks for providing the checksum feature for shuffle, have a thought about the following case.
   
   > if the corruption happens inside BufferReleasingInputStream, the reducer will throw the fetch failure immediately no matter what the cause is since the data has been partially consumed by downstream RDDs.
   
   I think the checksum could handle this case. Pick the idea from #28525, we can consume the input stream at the beginning and verify the checksum immediately, then
   
   > If it's a disk issue or unknown, it will throw fetch failure directly. If it's a network issue, it will re-fetch the block later. 
   
   To achieve this, we may need to update the shuffle network protocol to support passing checksums when fetching shuffle blocks.
   
   Compare to the existing `Utils.copyStreamUpTo(input, maxBytesInFlight / 3)`, this approach use less memory and can verify any size of blocks, but it introduce another overhead because it need to read the data 2 times.
   
   We encounter this issues in our production everyday, for some jobs, the performance overhead is acceptable comparing to stability.
   
   @Ngone51 WDYT?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org