You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2015/02/21 00:34:11 UTC

[jira] [Updated] (YARN-3238) Connection timeouts to nodemanagers are retried at multiple levels

     [ https://issues.apache.org/jira/browse/YARN-3238?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated YARN-3238:
-----------------------------
    Attachment: YARN-3238.001.patch

Since the IPC layer is already retrying it doesn't make sense to also retry at the YARN layer.  Attaching a patch that removes socket connection timeouts from the list of errors we retry at the YARN layer.  An alternate approach would be to retry at the YARN layer but explicitly tell the IPC layer to _not_ retry socket timeouts when creating the proxy.  This change seemed simpler and is what we've been doing all along before YARN-2613.

> Connection timeouts to nodemanagers are retried at multiple levels
> ------------------------------------------------------------------
>
>                 Key: YARN-3238
>                 URL: https://issues.apache.org/jira/browse/YARN-3238
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: Jason Lowe
>            Priority: Blocker
>         Attachments: YARN-3238.001.patch
>
>
> The IPC layer will retry connection timeouts automatically (see Client.java), but we are also retrying them with YARN's RetryPolicy put in place when the NM proxy is created.  This causes a two-level retry mechanism where the IPC layer has already retried quite a few times (45 by default) for each YARN RetryPolicy error that is retried.  The end result is that NM clients can wait a very, very long time for the connection to finally fail.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)