You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "Tao Yang (JIRA)" <ji...@apache.org> on 2019/03/29 08:20:00 UTC

[jira] [Commented] (YARN-9423) Optimize AM launcher to avoid bottleneck when a large number of AM failover happen at the same time

    [ https://issues.apache.org/jira/browse/YARN-9423?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16804680#comment-16804680 ] 

Tao Yang commented on YARN-9423:
--------------------------------

Attached v1 patch for review.  Please feel free to give your suggestions. Thanks!

> Optimize AM launcher to avoid bottleneck when a large number of AM failover happen at the same time
> ---------------------------------------------------------------------------------------------------
>
>                 Key: YARN-9423
>                 URL: https://issues.apache.org/jira/browse/YARN-9423
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>    Affects Versions: 3.2.0
>            Reporter: Tao Yang
>            Assignee: Tao Yang
>            Priority: Major
>         Attachments: YARN-9423.001.patch
>
>
> We have met a slow recovery for applications after many NM lost:
>  # many NM shut down at the same time abnormally.
>  # NM expired, then a large number of AM start failover.
>  # AM containers were allocated but not launched and last for about half an hour.
> Among this slow recovery, all ApplicationMasterLauncher threads were calling cleanup for containers on these lost nodes and keep retrying to communicate with NM for 3 minutes(retry policy is configured in NMProxy#createNMProxy) even though RM had known these NM are lost and probably can't be connected for a long time. Meanwhile many AM cleanup and launch operations were still waiting in queue (ApplicationMasterLauncher#masterEvents). Obviously AM launch operations were blocked by cleanup operations which are wasting 3 minutes. As a result, AM failover can be a very slow journey.
> I think we can optimize AM launcher in two ways:
>  # Modify type of ApplicationMasterLauncher#masterEvents from LinkedBlockingQueue to PriorityBlockingQueue, keep executing launch operations in front of cleanup operations.
>  # Check node state first and skip cleanup AM containers on non-existent or inactive NM (including DECOMMISSIONED/LOST/REBOOTED/SHUTDOWN, these NM probably can't be communicated for a long time) before communicating with NM in cleanup process(AMLauncher#cleanup).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org