You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Tao Yang (JIRA)" <ji...@apache.org> on 2019/03/28 13:00:00 UTC

[jira] [Created] (YARN-9423) Optimize AM launcher to avoid bottleneck when a large number of AM failover happen at the same time

Tao Yang created YARN-9423:
------------------------------

             Summary: Optimize AM launcher to avoid bottleneck when a large number of AM failover happen at the same time
                 Key: YARN-9423
                 URL: https://issues.apache.org/jira/browse/YARN-9423
             Project: Hadoop YARN
          Issue Type: Improvement
          Components: resourcemanager
    Affects Versions: 3.2.0
            Reporter: Tao Yang
            Assignee: Tao Yang


We have met a slow recovery for applications when many NM lost happen at the same time:
 # many NM shut down at the same time abnormally.
 # NM expired, then a large number of AM start failover.
 # AM containers are allocated but not launched for about half an hour.

Among this slow recovery, all ApplicationMasterLauncher threads were calling cleanup for containers on these lost nodes and keep retrying to communicate with NM for 3 minutes(retry policy is configured in NMProxy#createNMProxy) even though RM had known these NM are lost and probably can't be connected for a long time. Meanwhile many AM cleanup and launch operations were still waiting in queue (ApplicationMasterLauncher#masterEvents). Obviously AM launch operations were blocked by cleanup operations which are wasting 3 minutes. As a result, AM failover can be a very slow journey.

I think we can optimize AM launcher in two ways:
 # Modify type of ApplicationMasterLauncher#masterEvents from LinkedBlockingQueue to PriorityBlockingQueue, keep executing launch operations in front of cleanup operations.
 # Check node state first and skip cleanup AM containers on non-existent or unusable NM (because these NM probably can't be communicated for a long time) before communicating with NM in cleanup process(AMLauncher#cleanup).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org