You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2022/11/03 17:56:04 UTC

[GitHub] [spark] otterc commented on pull request #38333: [SPARK-40872] Fallback to original shuffle block when a push-merged shuffle chunk is zero-size

otterc commented on PR #38333:
URL: https://github.com/apache/spark/pull/38333#issuecomment-1302475841

   > If there are hardware issues which are causing failures - it is better to move the nodes to deny list and prevent them from getting used: we will keep seeing more failures, including for vanilla shuffle.
   > 
   > On other hand, I can also look at this as a data corruption issue - @otterc what was the plan around how we support shuffle corruption diagnosis for push based shuffle ([SPARK-36206](https://issues.apache.org/jira/browse/SPARK-36206), etc). Is the expectation that we fallback ? Or we diagnose + fail ?
   > 
   I think it is more efficient to fallback and fetch map outputs instead of failing the stage and regenerating the data of the partition. When the corrupt blocks are merged shuffle blocks or chunks we don't retry to fetch them anyways and fallback immediately.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org