You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2014/07/23 00:00:42 UTC

[jira] [Updated] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

     [ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jason Lowe updated YARN-2314:
-----------------------------

    Attachment: nmproxycachefix.prototype.patch

I was thinking along similar lines, but I am worried about the corner case where all RPCs are in use.  I think we need to handle this case even if it's rare.  An AM running on a node where it can see the RM but has a network cut to the rest of the cluster could go really bad really quick otherwise.  If we don't handle the corner case then we'll continue to grow the proxy cache beyond its boundaries as we do today, and that AM will explode with thousands of threads for what may be a temporary network outage.

While debugging this I wrote up a quick prototype patch to try to fix the cache so that it keeps the cache under the configured limit.  Attaching the patch for reference.  However as I mentioned above, simply keeping the NM proxy cache under its configured limit means nothing if we don't address the problems with connections remaining open in the IPC Client layer.

> ContainerManagementProtocolProxy can create thousands of threads for a large cluster
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2314
>                 URL: https://issues.apache.org/jira/browse/YARN-2314
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 2.1.0-beta
>            Reporter: Jason Lowe
>            Priority: Critical
>         Attachments: nmproxycachefix.prototype.patch
>
>
> ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable.  However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.2#6252)