You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Tao Yang (JIRA)" <ji...@apache.org> on 2019/07/31 13:39:00 UTC
[jira] [Created] (YARN-9716) AM container might leak
Tao Yang created YARN-9716:
------------------------------
Summary: AM container might leak
Key: YARN-9716
URL: https://issues.apache.org/jira/browse/YARN-9716
Project: Hadoop YARN
Issue Type: Bug
Components: resourcemanager
Affects Versions: 3.3.0
Reporter: Tao Yang
Assignee: Tao Yang
There is a risk that AM container might leak when NM exits unexpected meanwhile AM container is localizing if AM expiry interval (conf-key: yarn.am.liveness-monitor.expiry-interval-ms) is less than NM expiry interval (conf-key: yarn.nm.liveness-monitor.expiry-interval-ms).
RMAppAttempt state changes as follows:
{noformat}
LAUNCHED/RUNNING – event:EXPIRED(FinalSavingTransition)
--> FINAL_SAVING – event:ATTEMPT_UPDATE_SAVED(FinalStateSavedTransition / ExpiredTransition: send AMLauncherEventType.CLEANUP ) --> FAILED
{noformat}
AMLauncherEventType.CLEANUP will be handled by AMLauncher#cleanup which internally call ContainerManagementProtocol#stopContainer to stop AM container via communicating with NM, if NM can't be connected, it just skip it without any logs.
I think in this case we can complete the AM container in scheduler when failed to stop it, so that it will have a chance to be stopped when NM reconnects with RM.
Hope to hear your thoughts? Thank you!
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)
---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org