You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Tao Yang (JIRA)" <ji...@apache.org> on 2019/03/28 13:02:00 UTC

[jira] [Updated] (YARN-9423) Optimize AM launcher to avoid bottleneck when a large number of AM failover happen at the same time

     [ https://issues.apache.org/jira/browse/YARN-9423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tao Yang updated YARN-9423:
---------------------------
    Description: 
We have met a slow recovery for applications after many NM lost:
 # many NM shut down at the same time abnormally.
 # NM expired, then a large number of AM start failover.
 # AM containers are allocated but not launched for about half an hour.

Among this slow recovery, all ApplicationMasterLauncher threads were calling cleanup for containers on these lost nodes and keep retrying to communicate with NM for 3 minutes(retry policy is configured in NMProxy#createNMProxy) even though RM had known these NM are lost and probably can't be connected for a long time. Meanwhile many AM cleanup and launch operations were still waiting in queue (ApplicationMasterLauncher#masterEvents). Obviously AM launch operations were blocked by cleanup operations which are wasting 3 minutes. As a result, AM failover can be a very slow journey.

I think we can optimize AM launcher in two ways:
 # Modify type of ApplicationMasterLauncher#masterEvents from LinkedBlockingQueue to PriorityBlockingQueue, keep executing launch operations in front of cleanup operations.
 # Check node state first and skip cleanup AM containers on non-existent or unusable NM (because these NM probably can't be communicated for a long time) before communicating with NM in cleanup process(AMLauncher#cleanup).

  was:
We have met a slow recovery for applications when many NM lost happen at the same time:
 # many NM shut down at the same time abnormally.
 # NM expired, then a large number of AM start failover.
 # AM containers are allocated but not launched for about half an hour.

Among this slow recovery, all ApplicationMasterLauncher threads were calling cleanup for containers on these lost nodes and keep retrying to communicate with NM for 3 minutes(retry policy is configured in NMProxy#createNMProxy) even though RM had known these NM are lost and probably can't be connected for a long time. Meanwhile many AM cleanup and launch operations were still waiting in queue (ApplicationMasterLauncher#masterEvents). Obviously AM launch operations were blocked by cleanup operations which are wasting 3 minutes. As a result, AM failover can be a very slow journey.

I think we can optimize AM launcher in two ways:
 # Modify type of ApplicationMasterLauncher#masterEvents from LinkedBlockingQueue to PriorityBlockingQueue, keep executing launch operations in front of cleanup operations.
 # Check node state first and skip cleanup AM containers on non-existent or unusable NM (because these NM probably can't be communicated for a long time) before communicating with NM in cleanup process(AMLauncher#cleanup).


> Optimize AM launcher to avoid bottleneck when a large number of AM failover happen at the same time
> ---------------------------------------------------------------------------------------------------
>
>                 Key: YARN-9423
>                 URL: https://issues.apache.org/jira/browse/YARN-9423
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 3.2.0
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Major
>
> We have met a slow recovery for applications after many NM lost:
>  # many NM shut down at the same time abnormally.
>  # NM expired, then a large number of AM start failover.
>  # AM containers are allocated but not launched for about half an hour.
> Among this slow recovery, all ApplicationMasterLauncher threads were calling cleanup for containers on these lost nodes and keep retrying to communicate with NM for 3 minutes(retry policy is configured in NMProxy#createNMProxy) even though RM had known these NM are lost and probably can't be connected for a long time. Meanwhile many AM cleanup and launch operations were still waiting in queue (ApplicationMasterLauncher#masterEvents). Obviously AM launch operations were blocked by cleanup operations which are wasting 3 minutes. As a result, AM failover can be a very slow journey.
> I think we can optimize AM launcher in two ways:
>  # Modify type of ApplicationMasterLauncher#masterEvents from LinkedBlockingQueue to PriorityBlockingQueue, keep executing launch operations in front of cleanup operations.
>  # Check node state first and skip cleanup AM containers on non-existent or unusable NM (because these NM probably can't be communicated for a long time) before communicating with NM in cleanup process(AMLauncher#cleanup).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org