You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Andrei Budnik (JIRA)" <ji...@apache.org> on 2019/04/09 13:25:00 UTC

[jira] [Comment Edited] (MESOS-9709) Docker executor can become stuck terminating

    [ https://issues.apache.org/jira/browse/MESOS-9709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813335#comment-16813335 ] 

Andrei Budnik edited comment on MESOS-9709 at 4/9/19 1:24 PM:
--------------------------------------------------------------

This agent responds on polling `/state` endpoint, but hangs on polling `/containers` and `/__processes__`.

GDB can't attach to a running agent - it hangs.

top -H -p `pidof mesos-agent` shows that one thread stuck in D state.

Here is a stack trace of an agent's hanging thread:
{code:java}
[<ffffffff895e20d2>] copy_net_ns+0xa2/0x180
[<ffffffff890c01b9>] create_new_namespaces+0xf9/0x180
[<ffffffff890c035e>] copy_namespaces+0x8e/0xd0
[<ffffffff8908f996>] copy_process+0xb66/0x1a40
[<ffffffff89090a21>] do_fork+0x91/0x320
[<ffffffff89090d36>] SyS_clone+0x16/0x20
[<ffffffff89720c14>] stub_clone+0x44/0x70
[<ffffffffffffffff>] 0xffffffffffffffff{code}


was (Author: abudnik):
This agent responds on polling `/state` endpoint, but hangs on polling `/containers` and `/__processes__`.

GDB can't attach to a running agent - it hangs.

Here is a stack trace of an agent's hanging thread:
{code:java}
[<ffffffff895e20d2>] copy_net_ns+0xa2/0x180
[<ffffffff890c01b9>] create_new_namespaces+0xf9/0x180
[<ffffffff890c035e>] copy_namespaces+0x8e/0xd0
[<ffffffff8908f996>] copy_process+0xb66/0x1a40
[<ffffffff89090a21>] do_fork+0x91/0x320
[<ffffffff89090d36>] SyS_clone+0x16/0x20
[<ffffffff89720c14>] stub_clone+0x44/0x70
[<ffffffffffffffff>] 0xffffffffffffffff{code}

> Docker executor can become stuck terminating
> --------------------------------------------
>
>                 Key: MESOS-9709
>                 URL: https://issues.apache.org/jira/browse/MESOS-9709
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>    Affects Versions: 1.8.0
>            Reporter: Greg Mann
>            Priority: Major
>              Labels: containerization, mesosphere
>         Attachments: docker-executor-stuck.txt
>
>
> See attached agent log; the executor container ID is {{d2bfec33-f6bd-44ee-9345-b5710780bb59}} and the executor ID contains the string {{819f7ef7-4f42-11e9-a566-72ec67496045}}.
> After launching the executor, we see
> {code}
> Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re mesos-agent[10238]: I0329 18:23:36.967316 10257 slave.cpp:3550] Launching container d2bfec33-f6bd-44ee-9345-b5710780bb59 for executor 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321-0000
> Mar 29 18:23:36 int-mountvolumeagent9-soak113s.testing.mesosphe.re mesos-agent[10238]: I0329 18:23:36.968968 10253 docker.cpp:1161] No container info found, skipping launch
> {code}
> I'm not sure why the container info was not set. Once the executor reregistration timeout elapses, the agent attempts to terminate the executor but it does not seem to be successful. The scheduler continues to try to kill the task but we repeatedly see
> {code}
> Mar 29 18:35:19 int-mountvolumeagent9-soak113s.testing.mesosphe.re mesos-agent[10238]: W0329 18:35:19.855063 10253 slave.cpp:3823] Ignoring kill task datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339 because the executor 'datastax-dse.instance-819f7ef7-4f42-11e9-a566-72ec67496045._app.339' of framework a221eeb3-b9c0-4e92-ae20-1e1d4af25321-0000 is terminating
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)