You are viewing a plain text version of this content. The canonical link for it is here.

Posted to yarn-issues@hadoop.apache.org by "Jeongin Ju (Jira)" <ji...@apache.org> on 2021/08/24 09:51:00 UTC

[jira] [Updated] (YARN-10895) ContainerIdPBImpl objects still can be leaked in RMNodeImpl.completedContainers

     [ https://issues.apache.org/jira/browse/YARN-10895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeongin Ju updated YARN-10895:
------------------------------
    Attachment: YARN-10895.001.patch

> ContainerIdPBImpl objects still can be leaked in RMNodeImpl.completedContainers
> -------------------------------------------------------------------------------
>
>                 Key: YARN-10895
>                 URL: https://issues.apache.org/jira/browse/YARN-10895
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 3.1.2
>            Reporter: Jeongin Ju
>            Priority: Major
>         Attachments: YARN-10895.001.patch
>
>
> YARN-10467 fixed ContainerIdPBImpl Object Leakage in RMNodeImpl.completedContainers.
> After applying YARN-10467 patch and operating cluster with large number of nodes, we found similar heap leakage still exists.
> In heap dump which are dumped after failover, (so it is not active RM) about 4.5G is used by ContainerIDPBImpl on RMNodeImpl.completedContainers.
>  
> There are two cases.
>  
> 1. Apps with 'KeepContainersAcrossApplicationAttempts' 
> Even though 'KeepContainersAcrossApplicationAttempts' is set, we should clear RMAppAttemptImpl.justFinishedContainers.
> If app attempt is failed and retried by next attempt, we may not need to clear RMAppAttemptImpl.justFinishedContainers because related ContainerIDPBImpl will be handed over to next attempts and eventually cleared.
> However, when app is failed, there is no next attempt and heap leakage occur.
> (We found this case when Yarn Service Application failed over multiple attempts because of OOM in AM)
>  
> 2. Apps is killed explicitly by user
> When app is killed by user by 'yarn application -kill' CLI interface or WebUI interface,  RMAppAttemptImpl.amContainerFinished is not called because app and app attempt state is already changed.
>  
> To handle this, we added sendFinishedContainersToNMs for each RMAppAttemptImpl.finishedContainersSentToAm, RMAppAttemptImpl.justFinishedContainers when Attempt is set to 'KILLED'
>  
> We found and patched our cluster on 3.1.2 but it seems trunk still has the same problem.
> I attached patch based on the trunk.
>  
> Thanks!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org