You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Eric Yang (JIRA)" <ji...@apache.org> on 2018/12/05 22:00:00 UTC

[jira] [Comment Edited] (YARN-9071) NM and service AM don't have updated status for reinitialized containers

    [ https://issues.apache.org/jira/browse/YARN-9071?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16710608#comment-16710608 ] 

Eric Yang edited comment on YARN-9071 at 12/5/18 9:59 PM:
----------------------------------------------------------

Readiness check and container launch are asynchronous.  There is a chance that readiness check reported READY before containers reinitialization fully completed for upgrade.  This is most observable for ENTRY_POINT based container where launch command is required to be changed.  Docker container must be tear down and relaunch.  Asynchronous readiness check started sooner than actual upgrade operations.  Asynchronous design is to ensure the launching and monitoring threads are independent of each other to avoid getting stuck during launch.  However, asynchronous call may report partial status until all operations are completed.  External system that depends on the final upgrade status should wait a few seconds or until container id changed to ensure the application status is up-to-date.  YARN-9084 is opened to improve the inflight upgrade status.

 

+1 for patch 006.


was (Author: eyang):
Readiness check and container launch are asynchronous.  There is a chance that readiness check reported READY before containers reinitialization fully completed for upgrade.  This is most observable for ENTRY_POINT based container where launch command is required to be changed.  Docker container must be tear down and relaunch.  Asynchronous readiness check started sooner than actual upgrade operations.  Asynchronous design is to ensure the launching and monitoring threads are independent of each other to avoid getting stuck during launch.  However, asynchronous call may report partial status until all operations are completed.  External system that depends on the final upgrade status should wait a few seconds or until container id changed to ensure the application status is up-to-date.

 

+1 for patch 005.

> NM and service AM don't have updated status for reinitialized containers
> ------------------------------------------------------------------------
>
>                 Key: YARN-9071
>                 URL: https://issues.apache.org/jira/browse/YARN-9071
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Billie Rinaldi
>            Assignee: Chandni Singh
>            Priority: Critical
>         Attachments: YARN-9071.001.patch, YARN-9071.002.patch, YARN-9071.003.patch, YARN-9071.004.patch, YARN-9071.005.patch, YARN-9071.006.patch, q.log
>
>
> Container resource monitoring is not stopped during the reinitialization process, and this prevents the NM from obtaining updated process tree information when the container starts running again. I observed a reinitialized container go from RUNNING to REINITIALIZING to REINITIALIZING_AWAITING_KILL to SCHEDULED to RUNNING. Container monitoring was then started for a second time, but since the trackingContainers entry had already been initialized for the container, ContainersMonitor skipped finding the new PID and IP for the container. A possible solution would be to stop the container monitoring in the reinitialization process so that the process tree information would be initialized properly when monitoring is restarted. When the same container was stopped by the NM later, the NM did not kill the container, and the service AM received an unexpected event (stop at reinitializing).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org