You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/10/23 02:21:22 UTC

[GitHub] [spark] weixiuli commented on issue #26206: [SPARK-29551][CORE] Fix a bug about fetch failed when an executor is …

weixiuli commented on issue #26206: [SPARK-29551][CORE] Fix a bug about fetch failed when an executor is …
URL: https://github.com/apache/spark/pull/26206#issuecomment-545234149
 
 
   When  an executor lost with some reason (eg:. the external shuffle service  or  host lost on the executor's host  ),  while  the executor  loses time happens to be reduce stage  `fetch failed`  from it which is really bad, the previous only call  `mapOutputTracker.unregisterMapOutput(shuffleId, mapIndex, bmAddress)` to mark one map as broken in the map stage  at this time , but other maps on the executor are also not available which can only be resubmitted by a nest retry stage which is the regression.  
   
   So we should distinguish the failedEpoch of 'executor lost' from the fetchFailedEpochof 'fetch failed' to solve the above problem.
   
   As we all know that the previous will call `mapOutputTracker.removeOutputsOnHost(host) `or 
   `mapOutputTracker.removeOutputsOnExecutor(execId) ` when reduce stage fetches failed and the executor is active,  while it does NOT for the above problems.
   
   @cloud-fan 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org