You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@spark.apache.org by squito <gi...@git.apache.org> on 2017/03/15 21:28:11 UTC

[GitHub] spark issue #11254: [SPARK-13369] Make number of consecutive fetch failures ...

Github user squito commented on the issue:

    https://github.com/apache/spark/pull/11254
  
    sorry this has sat around so long, I agree this is useful following up on discussion here: https://github.com/apache/spark/pull/17088
    
    I'd reword the description to something more like this:
    
    The previously hardcoded max 4 retries per stage is not suitable for all cluster configurations.  Since spark retries a stage at the sign of the *first* fetch failure, you can easily end up with many stage retries to discover all the failures.  In particular, two scenarios this value should change are (1) if there are more than 4 executors per node; in that case, it may take 4 retries to discover the problem with each executor on the node and (2) during cluster maintenance on large clusters, where multiple machines are serviced at once, but you also cannot afford total cluster downtime.  By making this value configurable, cluster managers can tune this value to something more appropriate to their cluster configuration.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org