You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2020/06/24 18:04:27 UTC

[GitHub] [spark] squito edited a comment on pull request #28848: [SPARK-32003][CORE] When external shuffle service is used, unregister outputs for executor on fetch failure after executor is lost

squito edited a comment on pull request #28848:
URL: https://github.com/apache/spark/pull/28848#issuecomment-648856169


   I've been thinking about this more, and in particular why there are two different epochs and what the meaning of both is.   I tried out another change -- I just moved the `failedEpoch(execId) = currentEpoch` in [`removeExecutorAndUnregisterOutputs`](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1943) inside the `if(fileLost)`.  With that change, all tests in `org.apache.spark.scheduler` pass, including your added test (and I confirmed that your test does indeed fail with the current code in master).
   
   The idea is this:
   the purpose of the epochs is to know what to do when you get a fetch failure.  Epochs tell you if you've already dealt with an "equivalent" failure or not, and whether or not to consider more shuffle output as being lost.  Without the shuffle service, you remove shuffle output when there is an executor failure OR there is a fetch failure.  With the shuffle service, you only do it when there is a fetch failure.  The code already has those two branches, controlled by `fileLost`.
   
   But, I think there is actually a problem there.  
   
   I think I'm mostly repeating stuff @wypoon and @attilapiros said before -- I hope that restating it might help others out.  but it also pointed out to me a lack of test coverage we have.  Particularly that `blockManagerMaster.removeExecutor()` is only called once for multiple fetch failures (or a mix of executorLost + fetch failures).
   
   I'm thinking we should rename the epochs to `blockManagerFailureEpoch` and `externalShuffleFailureEpoch`


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org