You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by GitBox <gi...@apache.org> on 2019/04/17 07:26:09 UTC

[GitHub] [spark] jiangxb1987 commented on issue #24375: [SPARK-27474][CORE] avoid retrying a task failed with CommitDeniedException many times

jiangxb1987 commented on issue #24375: [SPARK-27474][CORE] avoid retrying a task failed with CommitDeniedException many times
URL: https://github.com/apache/spark/pull/24375#issuecomment-483969940
 
 
   So currently there are three places that tracks the partition status:
   1. The MapStatusTracker keeps track of all shuffle partitions, it get updated by DAGScheduler.handleTaskCompletion(), and on FetchFailed it remove the corresponding shuffle partitions;
   2. The ShuffleMapStatus keeps track of `pendingPartitions` itself, it's actually a fork of MapStatusTracker, and get cleared on stage submitted;
   3. The TaskSetManager keeps track of pending and running tasks, when the number of successful tasks reaches the target number then it mark the TSM (eg. a stage attempt) as successful.
   
   It shall be ideal that we keep track of the pending partitions of each stage in a data structure and update it in a synchronous way. The major problem here is that the DAGScheduler rely on MapStatusTracker to read the shuffle partitions statuses, which is updated asynchronously.
   
   If we don't want to make major change to current infrastructure, the best approach I can think of is to just let DAGScheduler make all the final decision whether a stage has been completed, and all TSM shall update their own status according to that. The only shortcoming here is just there is a time window that some TSM(s) has finished all the tasks but the MapStatusTracker is not yet updated, in this case we shall see unnecessary tasks still running. To further avoid this case we can implement another approach that Wenchen suggested -- To have a status cache for TSM, when a task from zombie TSM completes, notify the active TSM immediately.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org