You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Qian Zhang (JIRA)" <ji...@apache.org> on 2018/10/19 10:17:00 UTC

[jira] [Commented] (MESOS-9334) Container stuck at ISOLATING state due to libevent poll never returns

    [ https://issues.apache.org/jira/browse/MESOS-9334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656592#comment-16656592 ] 

Qian Zhang commented on MESOS-9334:
-----------------------------------

I added some logs into Mesos agent and libprocess, and then found the CNI isolator may hang at two places when isolating a container:
 # Wait for a CNI plugin (see [this code|https://github.com/apache/mesos/blob/1.7.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L1327] for details). When this issue occurred, I found actually the CNI plugin has finished its job and exited, but the CNI isolator hung at reading the CNI plugin's stdout/stderr.
 # Wait for `NetworkCniIsolatorSetup` (see [this code|https://github.com/apache/mesos/blob/1.7.0/src/slave/containerizer/mesos/isolators/network/cni/cni.cpp#L1154] for details). Similarly when this issue occurred, I found actually `NetworkCniIsolatorSetup` has finished its job and exited, but CNI isolator hung at reading its stderr.

For both of the above two cases, I found it was [libevent poll|https://github.com/apache/mesos/blob/1.7.0/3rdparty/libprocess/src/posix/libevent/libevent_poll.cpp#L75] never return for the fd to read stdout/stderr, and it seems there was no fd leak when this issue occurred.

> Container stuck at ISOLATING state due to libevent poll never returns
> ---------------------------------------------------------------------
>
>                 Key: MESOS-9334
>                 URL: https://issues.apache.org/jira/browse/MESOS-9334
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Qian Zhang
>            Priority: Critical
>
> We found UCR container may be stuck at `ISOLATING` state:
>  
> {code:java}
> 2018-10-03 09:13:23: I1003 09:13:23.274561 2355 containerizer.cpp:3122] Transitioning the state of container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 from PREPARING to ISOLATING
> 2018-10-03 09:13:23: I1003 09:13:23.279223 2354 cni.cpp:962] Bind mounted '/proc/5244/ns/net' to '/run/mesos/isolators/network/cni/1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54/ns' for container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54
> 2018-10-03 09:23:22: I1003 09:23:22.879868 2354 containerizer.cpp:2459] Destroying container 1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54 in ISOLATING state
> {code}
>  
> In the above logs, the state of container `1e5b8fc3-5c9e-4159-a0b9-3d46595a5b54` was transitioned to `ISOLATING` at 09:13:23, but did not transitioned to any other states until it was destroyed due to the executor registration timeout (10 mins). And the destroy can never complete since it needs to wait for the container to finish isolating.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)