You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Botong Huang (JIRA)" <ji...@apache.org> on 2018/06/22 19:56:00 UTC

[jira] [Updated] (YARN-8451) Multiple NM heartbeat thread created when a slow NM resync with RM

     [ https://issues.apache.org/jira/browse/YARN-8451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Botong Huang updated YARN-8451:
-------------------------------
    Attachment: YARN-8451.v1.patch

> Multiple NM heartbeat thread created when a slow NM resync with RM
> ------------------------------------------------------------------
>
>                 Key: YARN-8451
>                 URL: https://issues.apache.org/jira/browse/YARN-8451
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Botong Huang
>            Assignee: Botong Huang
>            Priority: Major
>         Attachments: YARN-8451.v1.patch
>
>
> During a NM resync with RM (say RM did a master slave switch), if NM is running slow, more than one RESYNC event may be put into the NM dispatcher by the existing heartbeat thread before they are processed. As a result, multiple new heartbeat thread are later created and start to hb to RM concurrently with their own responseId. If at some point of time, one thread becomes more than one step behind others, RM will send back a resync signal in this heartbeat response, killing all containers in this NM. 
> See comments below for details on how this can happen. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org