You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Akira Ajisaka (Jira)" <ji...@apache.org> on 2021/09/01 01:50:00 UTC

[jira] [Updated] (YARN-10428) Zombie applications in the YARN queue using FAIR + sizebasedweight

     [ https://issues.apache.org/jira/browse/YARN-10428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Akira Ajisaka updated YARN-10428:
---------------------------------
    Priority: Critical  (was: Major)

> Zombie applications in the YARN queue using FAIR + sizebasedweight
> ------------------------------------------------------------------
>
>                 Key: YARN-10428
>                 URL: https://issues.apache.org/jira/browse/YARN-10428
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacityscheduler
>    Affects Versions: 2.8.5
>            Reporter: Guang Yang
>            Assignee: Andras Gyori
>            Priority: Critical
>             Fix For: 3.4.0
>
>         Attachments: YARN-10428.001.patch, YARN-10428.002.patch, YARN-10428.003.patch
>
>
> Seeing zombie jobs in the YARN queue that uses FAIR and size based weight ordering policy .
> *Detection:*
> The YARN UI shows incorrect number of "Num Schedulable Applications".
> *Impact:*
> The queue has an upper limit of number of running applications, with zombie job, it hits the limit even though the number of running applications is far less than the limit. 
> *Workaround:*
> **Fail-over and restart Resource Manager process.
> *Analysis:*
> **In the heap dump, we can find the zombie jobs in the `FairOderingPolicy#
> schedulableEntities` (see attachment). Take application "application_1599157165858_29429" for example, it is still in the  `FairOderingPolicy#schedulableEntities` set, however, if we check the log of resource manager, we can see RM already tried to remove the application:
>  
> ./yarn-yarn-resourcemanager-ip-172-21-153-252.log.2020-09-04-04:2020-09-04 04:32:19,730 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue (ResourceManager Event Processor): Application removed - appId: application_1599157165858_29429 user: svc_di_data_eng queue: core-data #user-pending-applications: -3 #user-active-applications: 7 #queue-pending-applications: 0 #queue-active-applications: 21
>  
> So it appears RM failed to removed the application from the set.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org