You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2014/07/17 20:33:07 UTC

[jira] [Commented] (YARN-2314) ContainerManagementProtocolProxy can create thousands of threads for a large cluster

    [ https://issues.apache.org/jira/browse/YARN-2314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14065316#comment-14065316 ] 

Jason Lowe commented on YARN-2314:
----------------------------------

The problem is that the cache doesn't try very hard to remove proxies when the cache is at or beyond the maximum configured size.  When adding a new proxy to the cache and it should remove an entry, it simply grabs the least-recently-used proxy and tries to close it.  If the entry is currently in use then an entry isn't immediately removed and that means we're running with a cache larger than configured.

This can get far worse on a big cluster.  For example, if the least-recently-used proxy is currently performing a call that is stuck on socket connection retries, the LRU entry could take quite a while before it closes.  During that time each new proxy created will make the same attempt to close that proxy and fail to do so.  That means that the cache size is now N-1 larger than it should be when it finally does close where N is the number of proxies created while the LRU entry was busy.

On a large cluster with thousands of nodes a proxy hanging on one node could allow the cache to have thousands of more proxies in it than configured.  Since each proxy is a thread, that's thousands of threads, and all those thread stacks can blow container limits on the AM (or address limits if it's a 32-bit AM).

> ContainerManagementProtocolProxy can create thousands of threads for a large cluster
> ------------------------------------------------------------------------------------
>
>                 Key: YARN-2314
>                 URL: https://issues.apache.org/jira/browse/YARN-2314
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 2.1.0-beta
>            Reporter: Jason Lowe
>            Priority: Critical
>
> ContainerManagementProtocolProxy has a cache of NM proxies, and the size of this cache is configurable.  However the cache can grow far beyond the configured size when running on a large cluster and blow AM address/container limits.  More details in the first comment.



--
This message was sent by Atlassian JIRA
(v6.2#6252)