You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Botong Huang (JIRA)" <ji...@apache.org> on 2018/06/22 19:51:00 UTC

[jira] [Commented] (YARN-8451) Multiple NM heartbeat thread created when a slow NM resync with RM

    [ https://issues.apache.org/jira/browse/YARN-8451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16520727#comment-16520727 ] 

Botong Huang commented on YARN-8451:
------------------------------------

Here’s an example where more than one heartbeat thread is created: 
1. YarnRM master slave switch happens, when the new YarnRM comes up, it notifies the NM to resync (without killing its containers) upon first NM heartbeat. 
2. Every time NM heartbeats into RM and gets a resync signal, it dispatches an NodeManagerEventType.RESYNC event and move on. 
3. NodeManager.resyncWithRM() is the one listening to this event. 
4. When the NM dispatcher is running slow, by the time the first event is processed, the NM heartbeat thread has managed to heartbeat more and put more NodeManagerEventType.RESYNC events into the dispatcher event queue. 
5. Multiple threads are created inside NodeManager.resyncWithRM(), all of them are blocked at statusUpdater.join() inside NodeStatusUpdateImpl.rebootNodeStatusUpdaterAndRegisterWithRM(). 
6. When the previous heartbeat thread exits, every blocked thread gets released and creates a new heartbeat thread. 

> Multiple NM heartbeat thread created when a slow NM resync with RM
> ------------------------------------------------------------------
>
>                 Key: YARN-8451
>                 URL: https://issues.apache.org/jira/browse/YARN-8451
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Botong Huang
>            Assignee: Botong Huang
>            Priority: Major
>
> During a NM resync with RM (say RM did a master slave switch), if NM is running slow, more than one RESYNC event may be put into the NM dispatcher by the existing heartbeat thread before they are processed. As a result, multiple new heartbeat thread are later created and start to hb to RM concurrently with their own responseId. If at some point of time, one thread becomes more than one step behind others, RM will send back a resync signal in this heartbeat response, killing all containers in this NM. 
> See comments below for details on how this can happen. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org