You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2016/03/08 15:47:41 UTC

[jira] [Commented] (YARN-4771) Some containers can be skipped during log aggregation after NM restart

    [ https://issues.apache.org/jira/browse/YARN-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15184991#comment-15184991 ] 

Jason Lowe commented on YARN-4771:
----------------------------------

The problem occurs because removeVeryOldStoppedContainersFromCache will remove containers from the state store that have completed at least yarn.nodemanager.duration-to-track-stopped-containers milliseconds ago.  Once the container state is removed from the state store there's nothing to recover for that container when the NM restarts.  With no information about that container to recover, the log aggregation service doesn't know it needs to aggregate the logs for that container, so the container is skipped during log aggregation.

> Some containers can be skipped during log aggregation after NM restart
> ----------------------------------------------------------------------
>
>                 Key: YARN-4771
>                 URL: https://issues.apache.org/jira/browse/YARN-4771
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.7.2
>            Reporter: Jason Lowe
>
> A container can be skipped during log aggregation after a work-preserving nodemanager restart if the following events occur:
> # Container completes more than yarn.nodemanager.duration-to-track-stopped-containers milliseconds before the restart
> # At least one other container completes after the above container and before the restart



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)