You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "SammiChen (JIRA)" <ji...@apache.org> on 2018/05/07 12:59:00 UTC
[jira] [Updated] (YARN-6661) Too much CLEANUP event hang ApplicationMasterLauncher thread pool

     [ https://issues.apache.org/jira/browse/YARN-6661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

SammiChen updated YARN-6661:
----------------------------
    Fix Version/s:     (was: 2.9.1)

> Too much CLEANUP event hang ApplicationMasterLauncher thread pool
> -----------------------------------------------------------------
>
>                 Key: YARN-6661
>                 URL: https://issues.apache.org/jira/browse/YARN-6661
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.7.2
>         Environment: hadoop 2.7.2 
>            Reporter: JackZhou
>            Priority: Major
>
> Some one else have already come up with the similar problem and fix it.
> We can look the jira(https://issues.apache.org/jira/browse/YARN-3809) for detail.
> But I think the fix have not solve the problem completely, blow was the problem I encountered:
> There is about 1000 nodes in my hadoop cluster, and I submit about 1800 apps.
> I failover my active rm and rm will failover all those 1800 apps.
> When a application failover, It will wait for AM container register itself. 
> But there is a bug in my AM (I do it intentionally), and it will not register itself.
> So the RM will wait for about 10mins for the AM expiration, and it will send a CLEANUP event to 
> ApplicationMasterLauncher thread pool. Because there is about 1800 apps, so it will hang the ApplicationMasterLauncher
> thread pool for a large time. I have already use the patch(https://issues.apache.org/jira/secure/attachment/12740804/YARN-3809.03.patch), so
> a CLEANUP event will hang a thread 10 * 20 = 200s. But I have 1800 apps, so for each of my thread, it will
> hang 1800 / 50 * 200s = 7200s=20min.
> Because the AM have register itself during 10mins， so it will retry and create a new application attempt. 
> The application attempt will accept a container from RM, and send a LAUNCH to ApplicationMasterLauncher thread pool.
> Because the 1800 CLEANUP will hang the 50 thread pools about 20mins. So the application attempt will not 
> start the AM container during 10min. 
> And it will expire, and send a CLEANUP event to ApplicationMasterLauncher thread pools too.
> As you can see, none of my application can really run it. 
> Each of them have 5 application attempts as follows, and each of them keep retrying.
> appattempt_1495786030132_4000_000005
> appattempt_1495786030132_4000_000004
> appattempt_1495786030132_4000_000003
> appattempt_1495786030132_4000_000002	
> appattempt_1495786030132_4000_000001
> So all of my apps have hang several hours, and none of them can really run. 
> I think this is a bug!!! We can treat CLEANUP and LAUNCH as different events.
> And use some other thread to deal with LAUNCH event or use other way.
> Sorry, I english is so poor. I don't know have I describe it clearly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org