You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Gopal V (JIRA)" <ji...@apache.org> on 2014/03/21 00:35:46 UTC
[jira] [Created] (TEZ-965) Tez needs a "circuit-breaker" to avoid
mistaking network blips to task/node failures
Gopal V created TEZ-965:
---------------------------
Summary: Tez needs a "circuit-breaker" to avoid mistaking network blips to task/node failures
Key: TEZ-965
URL: https://issues.apache.org/jira/browse/TEZ-965
Project: Apache Tez
Issue Type: Bug
Environment: Flaky DNS cluster
Reporter: Gopal V
If DNS resolution fails for a period of 5-10 seconds, Tez restarts & contra-flows in the query triggering recovery of nearly everything it has run.
Nodes are getting marked as bad because they can't shuffle (dns resolution failed for all NMs), which results in log lines like
{code}
attempt_1394928384313_0234_1_25_000654_0 blamed for read error from attempt_1394928384313_0234_1_24_000366_0
{code}
And the tasks restart from an earlier vertex.
When a large number of such failures happen, the tasks shouldn't restart previous vertexes, but instead should flip a circuit & back-off till the network blip disappears.
--
This message was sent by Atlassian JIRA
(v6.2#6252)