You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Eric Badger (JIRA)" <ji...@apache.org> on 2018/07/10 22:26:00 UTC

[jira] [Updated] (YARN-8515) container-executor can crash with SIGPIPE after nodemanager restart

     [ https://issues.apache.org/jira/browse/YARN-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Eric Badger updated YARN-8515:
------------------------------
    Labels: Docker  (was: )

> container-executor can crash with SIGPIPE after nodemanager restart
> -------------------------------------------------------------------
>
>                 Key: YARN-8515
>                 URL: https://issues.apache.org/jira/browse/YARN-8515
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jim Brennan
>            Assignee: Jim Brennan
>            Priority: Major
>              Labels: Docker
>
> When running with docker on large clusters, we have noticed that sometimes docker containers are not removed - they remain in the exited state, and the corresponding container-executor is no longer running.  Upon investigation, we noticed that this always seemed to happen after a nodemanager restart.   The sequence leading to the stranded docker containers is:
>  # Nodemanager restarts
>  # Containers are recovered and then run for a while
>  # Containers are killed for some (legitimate) reason
>  # Container-executor exits without removing the docker container.
> After reproducing this on a test cluster, we found that the container-executor was exiting due to a SIGPIPE.
> What is happening is that the shell command executor that is used to start container-executor has threads reading from c-e's stdout and stderr.  When the NM is restarted, these threads are killed.  Then when the container-executor continues executing after the container exits with error, it tries to write to stderr (ERRORFILE) and gets a SIGPIPE.  Since SIGPIPE is not handled, this crashes the container-executor before it can actually remove the docker container.
> We ran into this in branch 2.8.  The way docker containers are removed has been completely redesigned in trunk, so I don't think it will lead to this exact failure, but after an NM restart, potentially any write to stderr or stdout in the container-executor could cause it to crash.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org