You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@tez.apache.org by "Gopal V (JIRA)" <ji...@apache.org> on 2014/03/21 00:35:46 UTC

[jira] [Created] (TEZ-965) Tez needs a "circuit-breaker" to avoid mistaking network blips to task/node failures

Gopal V created TEZ-965:
---------------------------

             Summary: Tez needs a "circuit-breaker" to avoid mistaking network blips to task/node failures
                 Key: TEZ-965
                 URL: https://issues.apache.org/jira/browse/TEZ-965
             Project: Apache Tez
          Issue Type: Bug
         Environment: Flaky DNS cluster
            Reporter: Gopal V


If DNS resolution fails for a period of 5-10 seconds, Tez restarts & contra-flows in the query triggering recovery of nearly everything it has run.

Nodes are getting marked as bad because they can't shuffle (dns resolution failed for all NMs), which results in log lines like 

{code}
attempt_1394928384313_0234_1_25_000654_0 blamed for read error from attempt_1394928384313_0234_1_24_000366_0 
{code}

And the tasks restart from an earlier vertex.

When a large number of such failures happen, the tasks shouldn't restart previous vertexes, but instead should flip a circuit & back-off till the network blip disappears.



--
This message was sent by Atlassian JIRA
(v6.2#6252)