You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Thomas Graves (JIRA)" <ji...@apache.org> on 2017/03/31 13:48:41 UTC
[jira] [Commented] (SPARK-20178) Improve Scheduler fetch failures

    [ https://issues.apache.org/jira/browse/SPARK-20178?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15950917#comment-15950917 ] 

Thomas Graves commented on SPARK-20178:
---------------------------------------

Overall what I would like to accomplish is not throwing away work and making the failure case very performant. More and more people are running spark on larger clusters, this means failures are going to occur more.  We need those failures to be as fast as possible.  We need to be careful here and make sure we handle the node totally down case, the nodemanager totally down, and the nodemanager or node is just having intermittent issue.  Generally I see the last where the issue is just intermittent but some people recently have had more of the nodemanager totally down case in which case you want to fail all maps on that node quickly.  The decision on what to rerun is hard now because it could be very costly to rerun more, but at the same time it could be very costly to not rerun all immediately because you can fail all 4 stage attempts.  This really depends on how long the maps and reduces run.  A lot of discussion on https://github.com/apache/spark/pull/17088 related to that. 

- We should not kill the Reduce tasks on fetch failure.  Leave the Reduce tasks running since it could have done useful work already like fetching X number of map outputs.  It can simply fail that map output which would cause the map to be rerun and only that specific map output would need to be refetched.  This does require checking to make sure there are enough resource to run the map and if not possibly killing a reducer or getting more resources if dynamic allocation.
- Improve logic around deciding which node is actually bad when you get a fetch failures.  Was it really the node the reduce was on or the node the map was on.  You can do something here like a % of reducers failed to fetch from map output node.
- We should only rerun the maps that failed (or have been logic around how to make this decision), other maps could have already been fetch (with bullet one) so no need to rerun if all reducers already fetched.  Since the reduce tasks keep running, other fetch failures can happen in parallel and that would just cause other maps to be rerun.  At some point based on bullet 2 above we can decide entire node is bad.
- Improve the blacklisting based on the above improvements
- make sure to think about how this plays into the stage attempt max failures (4, now settable)
- try to not waste resources.  ie right now we can have 2 of the same reduce tasks running which is using twice the resources and there are a bunch of different conditions that can occur as to whether this work is actually useful.

Question:
- should we consider having it fetch all map output from a host at once (rather then per executor).  This could improve fetching times (but would have to test) as well as fetch failure handling. This could cause it to fail more maps which is somewhat contradictory to bullet 3 above, need to think about this more.
- Do we need pluggable interface or how do we not destabilize current scheduler?

Bonus or future:
- Decision on when and how many maps to rerun is cost based estimate.  If maps only take a few seconds to run could rerun all maps on the host immediately
- option to prestart reduce tasks so that they can start fetching while last few maps are failing (if you have long tail maps)

> Improve Scheduler fetch failures
> --------------------------------
>
>                 Key: SPARK-20178
>                 URL: https://issues.apache.org/jira/browse/SPARK-20178
>             Project: Spark
>          Issue Type: Epic
>          Components: Scheduler
>    Affects Versions: 2.1.0
>            Reporter: Thomas Graves
>
> We have been having a lot of discussions around improving the handling of fetch failures.  There are 4 jira currently related to this.  
> We should try to get a list of things we want to improve and come up with one cohesive design.
> SPARK-20163,  SPARK-20091,  SPARK-14649 , and SPARK-19753
> I will put my initial thoughts in a follow on comment.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org