You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Sathish (JIRA)" <ji...@apache.org> on 2019/03/08 06:13:00 UTC

[jira] [Commented] (YARN-4741) RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue

    [ https://issues.apache.org/jira/browse/YARN-4741?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16787570#comment-16787570 ] 

Sathish commented on YARN-4741:
-------------------------------

I see the issue could be because of the Log aggregation failed for those jobs upon the incorrect permission of remote log directory

 

2016-02-18 01:39:34,164 WARN org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService: Remote Root Log Dir [/var/log/hadoop-yarn] already exist, but with incorrect permissions. Expected: [rwxrwxrwt], Found: [rwxrwxrwx]. The cluster may have problems with multiple users.

 

Since those were kept in your NM's local leveldb, NM was trying to replay those we are getting into this situation.

Have you tried fixing the permissions of those directories and made sure log aggregation was working fine from those nodemanagers ?

 

> RM is flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue
> -----------------------------------------------------------------------------------------------
>
>                 Key: YARN-4741
>                 URL: https://issues.apache.org/jira/browse/YARN-4741
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Sangjin Lee
>            Priority: Critical
>         Attachments: nm.log
>
>
> We had a pretty major incident with the RM where it was continually flooded with RMNodeFinishedContainersPulledByAMEvents in the async dispatcher event queue.
> In our setup, we had the RM HA or stateful restart *disabled*, but NM work-preserving restart *enabled*. Due to other issues, we did a cluster-wide NM restart.
> Some time during the restart (which took multiple hours), we started seeing the async dispatcher event queue building. Normally it would log 1,000. In this case, it climbed all the way up to tens of millions of events.
> When we looked at the RM log, it was full of the following messages:
> {noformat}
> 2016-02-18 01:47:29,530 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,535 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state
> 2016-02-18 01:47:29,535 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> 2016-02-18 01:47:29,538 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Can't handle this event at current state
> 2016-02-18 01:47:29,538 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmnode.RMNodeImpl: Invalid event FINISHED_CONTAINERS_PULLED_BY_AM on Node  worker-node-foo.bar.net:8041
> {noformat}
> And that node in question was restarted a few minutes earlier.
> When we inspected the RM heap, it was full of RMNodeFinishedContainersPulledByAMEvents.
> Suspecting the NM work-preserving restart, we disabled it and did another cluster-wide rolling restart. Initially that seemed to have helped reduce the queue size, but the queue built back up to several millions and continued for an extended period. We had to restart the RM to resolve the problem.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org