You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Qian Zhang (JIRA)" <ji...@apache.org> on 2018/08/20 01:37:00 UTC
[jira] [Comment Edited] (MESOS-9131) Health checks launching nested containers while a container is being destroyed lead to unkillable tasks

    [ https://issues.apache.org/jira/browse/MESOS-9131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16585328#comment-16585328 ] 

Qian Zhang edited comment on MESOS-9131 at 8/20/18 1:36 AM:
------------------------------------------------------------

I found a way to steadily reproduce this issue:

1. Start Mesos master.
{code:java}
$ sudo ./bin/mesos-master.sh --work_dir=/opt/mesos --ip=192.168.56.5
{code}
2. Start Mesos agent.
{code:java}
$ sudo ./bin/mesos-slave.sh --master=192.168.56.5:5050 --containerizers=mesos,docker --image_providers=docker --isolation=filesystem/linux,docker/runtime,network/cni,cgroups/cpu,cgroups/mem --network_cni_config_dir=/opt/cni/net_configs --network_cni_plugins_dir=/opt/cni/plugins --work_dir=/opt/mesos --docker_store_dir=/opt/mesos/store/docker --executor_registration_timeout=60mins --ip=192.168.56.5 --resources='cpus:2;mem:4096;disk:10240'
{code}
3. Launch a task group which has one task.
{code:java}
$ cat task_group.json 
{
  "tasks":[
    {
      "name" : "test1",
      "task_id" : {"value" : "test1"},
      "agent_id": {"value" : ""},
      "resources": [
        {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
        {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
      ],
      "command": {
        "value": "sleep 1000"
      }
    }
  ]
}

$ mesos-execute --master=192.168.56.5:5050 --task_group=file:///home/stack/workspace/config/task_group.json
I0820 09:06:03.655900 16130 scheduler.cpp:188] Version: 1.5.1
I0820 09:06:03.738003 16146 scheduler.cpp:311] Using default 'basic' HTTP authenticatee
I0820 09:06:03.739403 16150 scheduler.cpp:494] New master detected at master@192.168.56.5:5050
Subscribed with ID 9a871ea9-68aa-40ad-ae2d-f77cab3b63c0-0000
Submitted task group with tasks [ test1 ] to agent '9a871ea9-68aa-40ad-ae2d-f77cab3b63c0-S0'
Received status update TASK_STARTING for task 'test1'
source: SOURCE_EXECUTOR
Received status update TASK_RUNNING for task 'test1'
source: SOURCE_EXECUTOR{code}
4. Run the `dcos` command below to kill the process of the nested container (task) launched in step 3. This issue will not be reproduced if `-t` option is specified in the command below or `-i` is not specified.
{code:java}
$ dcos task exec -i test1 bash
kill -9 <PID of the container process>
{code}
Now in the agent log, we can see the nested container launched in step 3 has exited (since it got killed by the `dcos` command in step 4), and then agent tried to destroy its child container (i.e., the nested container launched in step 4 by the `dcos` command), but the destroy never completes. As a result the nested container launched in step 3 stuck at `DESTROYING` state, and the `mesos-execute` command launched in step 3 never returns, i.e., from its point of view, the task is still in the `TASK_RUNNING` status.
{code:java}
I0820 09:09:33.072842 16004 slave.cpp:6865] Current disk usage 40.20%. Max allowed age: 3.486303275117465days
I0820 09:09:37.069425 16004 containerizer.cpp:2807] Container 95aa3e70-4f1d-42f0-93bb-3963d63126b8.8b3080cc-200b-4484-9661-fb2025668dbe has exited
I0820 09:09:37.069540 16004 containerizer.cpp:2354] Destroying container 95aa3e70-4f1d-42f0-93bb-3963d63126b8.8b3080cc-200b-4484-9661-fb2025668dbe in RUNNING state
I0820 09:09:37.069591 16004 containerizer.cpp:2968] Transitioning the state of container 95aa3e70-4f1d-42f0-93bb-3963d63126b8.8b3080cc-200b-4484-9661-fb2025668dbe from RUNNING to DESTROYING
I0820 09:09:37.071295 16004 linux_launcher.cpp:514] Asked to destroy container 95aa3e70-4f1d-42f0-93bb-3963d63126b8.8b3080cc-200b-4484-9661-fb2025668dbe.ba5e905c-af5f-41ea-abf4-9b197cabf8f1
I0820 09:09:37.073197 16004 linux_launcher.cpp:560] Using freezer to destroy cgroup mesos/95aa3e70-4f1d-42f0-93bb-3963d63126b8/mesos/8b3080cc-200b-4484-9661-fb2025668dbe/mesos/ba5e905c-af5f-41ea-abf4-9b197cabf8f1
I0820 09:09:37.076241 16005 cgroups.cpp:3060] Freezing cgroup /sys/fs/cgroup/freezer/mesos/95aa3e70-4f1d-42f0-93bb-3963d63126b8/mesos/8b3080cc-200b-4484-9661-fb2025668dbe/mesos/ba5e905c-af5f-41ea-abf4-9b197cabf8f1
I0820 09:09:37.079506 16002 cgroups.cpp:1415] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/95aa3e70-4f1d-42f0-93bb-3963d63126b8/mesos/8b3080cc-200b-4484-9661-fb2025668dbe/mesos/ba5e905c-af5f-41ea-abf4-9b197cabf8f1 after 3.18976ms
I0820 09:09:37.083220 16003 cgroups.cpp:3078] Thawing cgroup /sys/fs/cgroup/freezer/mesos/95aa3e70-4f1d-42f0-93bb-3963d63126b8/mesos/8b3080cc-200b-4484-9661-fb2025668dbe/mesos/ba5e905c-af5f-41ea-abf4-9b197cabf8f1
I0820 09:09:37.086514 15999 cgroups.cpp:1444] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/95aa3e70-4f1d-42f0-93bb-3963d63126b8/mesos/8b3080cc-200b-4484-9661-fb2025668dbe/mesos/ba5e905c-af5f-41ea-abf4-9b197cabf8f1 after 3.170048ms
I0820 09:09:42.178274 16007 switchboard.cpp:789] Sending SIGTERM to I/O switchboard server (pid: 16551) since container 95aa3e70-4f1d-42f0-93bb-3963d63126b8.8b3080cc-200b-4484-9661-fb2025668dbe.ba5e905c-af5f-41ea-abf4-9b197cabf8f1 is being destroyed
{code}
And in the agent's runtime directory, we can see all the container's directories are still there, no one got removed.
{code:java}
/var/run/mesos
├── containers
│   └── 95aa3e70-4f1d-42f0-93bb-3963d63126b8
│   ├── config
│   ├── containers
│   │   └── 8b3080cc-200b-4484-9661-fb2025668dbe
│   │   ├── config
│   │   ├── containers
│   │   │   └── ba5e905c-af5f-41ea-abf4-9b197cabf8f1
│   │   │   ├── config
│   │   │   ├── force_destroy_on_recovery
│   │   │   ├── io_switchboard
│   │   │   │   ├── pid
│   │   │   │   └── socket
│   │   │   ├── launch_info
│   │   │   ├── pid
│   │   │   └── status
│   │   ├── launch_info
│   │   ├── pid
│   │   └── status
│   ├── launch_info
│   ├── pid
│   └── status 
{code}
 


was (Author: qianzhang):
I found a way to steadily reproduce this issue:

1. Start Mesos master.
{code:java}
$ sudo ./bin/mesos-master.sh --work_dir=/opt/mesos --ip=192.168.56.5
{code}
2. Start Mesos agent.
{code:java}
$ sudo ./bin/mesos-slave.sh --master=192.168.56.5:5050 --containerizers=mesos,docker --image_providers=docker --isolation=filesystem/linux,docker/runtime,network/cni,cgroups/cpu,cgroups/mem --network_cni_config_dir=/opt/cni/net_configs --network_cni_plugins_dir=/opt/cni/plugins --work_dir=/opt/mesos --docker_store_dir=/opt/mesos/store/docker --executor_registration_timeout=60mins --ip=192.168.56.5 --resources='cpus:2;mem:4096;disk:10240'
{code}
3. Launch a task group which has one task.
{code:java}
$ cat task_group.json 
{
  "tasks":[
    {
      "name" : "test1",
      "task_id" : {"value" : "test1"},
      "agent_id": {"value" : ""},
      "resources": [
        {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
        {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
      ],
      "command": {
        "value": "sleep 1000"
      }
    }
  ]
}

$ mesos-execute --master=192.168.56.5:5050 --task_group=file:///home/stack/workspace/config/task_group.json
I0820 09:06:03.655900 16130 scheduler.cpp:188] Version: 1.5.1
I0820 09:06:03.738003 16146 scheduler.cpp:311] Using default 'basic' HTTP authenticatee
I0820 09:06:03.739403 16150 scheduler.cpp:494] New master detected at master@192.168.56.5:5050
Subscribed with ID 9a871ea9-68aa-40ad-ae2d-f77cab3b63c0-0000
Submitted task group with tasks [ test1 ] to agent '9a871ea9-68aa-40ad-ae2d-f77cab3b63c0-S0'
Received status update TASK_STARTING for task 'test1'
source: SOURCE_EXECUTOR
Received status update TASK_RUNNING for task 'test1'
source: SOURCE_EXECUTOR{code}
4. Run the `dcos` command below to kill the process of the nested container (task) launched in step 3. This issue will not be reproduced if `-t` option is specified in the command below or `-i` is not specified.
{code:java}
$ dcos task exec -i test1 bash
kill -9 <PID of the container process>
{code}
Now in the agent log, we can see the nested container launched in step 3 has exited (since it got killed by the `dcos` command in step 4), and then tried to destroy its child container (i.e., the nested container launched in step 4 by the `dcos` command), but the destroy will never complete. As a result the nested container launched in step 3 stuck at `DESTROYING` state, and the `mesos-execute` command launched in step 3 never returns, i.e., from its point of view, the task is still in the `TASK_RUNNING` status.
{code:java}
I0820 09:09:33.072842 16004 slave.cpp:6865] Current disk usage 40.20%. Max allowed age: 3.486303275117465days
I0820 09:09:37.069425 16004 containerizer.cpp:2807] Container 95aa3e70-4f1d-42f0-93bb-3963d63126b8.8b3080cc-200b-4484-9661-fb2025668dbe has exited
I0820 09:09:37.069540 16004 containerizer.cpp:2354] Destroying container 95aa3e70-4f1d-42f0-93bb-3963d63126b8.8b3080cc-200b-4484-9661-fb2025668dbe in RUNNING state
I0820 09:09:37.069591 16004 containerizer.cpp:2968] Transitioning the state of container 95aa3e70-4f1d-42f0-93bb-3963d63126b8.8b3080cc-200b-4484-9661-fb2025668dbe from RUNNING to DESTROYING
I0820 09:09:37.071295 16004 linux_launcher.cpp:514] Asked to destroy container 95aa3e70-4f1d-42f0-93bb-3963d63126b8.8b3080cc-200b-4484-9661-fb2025668dbe.ba5e905c-af5f-41ea-abf4-9b197cabf8f1
I0820 09:09:37.073197 16004 linux_launcher.cpp:560] Using freezer to destroy cgroup mesos/95aa3e70-4f1d-42f0-93bb-3963d63126b8/mesos/8b3080cc-200b-4484-9661-fb2025668dbe/mesos/ba5e905c-af5f-41ea-abf4-9b197cabf8f1
I0820 09:09:37.076241 16005 cgroups.cpp:3060] Freezing cgroup /sys/fs/cgroup/freezer/mesos/95aa3e70-4f1d-42f0-93bb-3963d63126b8/mesos/8b3080cc-200b-4484-9661-fb2025668dbe/mesos/ba5e905c-af5f-41ea-abf4-9b197cabf8f1
I0820 09:09:37.079506 16002 cgroups.cpp:1415] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/95aa3e70-4f1d-42f0-93bb-3963d63126b8/mesos/8b3080cc-200b-4484-9661-fb2025668dbe/mesos/ba5e905c-af5f-41ea-abf4-9b197cabf8f1 after 3.18976ms
I0820 09:09:37.083220 16003 cgroups.cpp:3078] Thawing cgroup /sys/fs/cgroup/freezer/mesos/95aa3e70-4f1d-42f0-93bb-3963d63126b8/mesos/8b3080cc-200b-4484-9661-fb2025668dbe/mesos/ba5e905c-af5f-41ea-abf4-9b197cabf8f1
I0820 09:09:37.086514 15999 cgroups.cpp:1444] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/95aa3e70-4f1d-42f0-93bb-3963d63126b8/mesos/8b3080cc-200b-4484-9661-fb2025668dbe/mesos/ba5e905c-af5f-41ea-abf4-9b197cabf8f1 after 3.170048ms
I0820 09:09:42.178274 16007 switchboard.cpp:789] Sending SIGTERM to I/O switchboard server (pid: 16551) since container 95aa3e70-4f1d-42f0-93bb-3963d63126b8.8b3080cc-200b-4484-9661-fb2025668dbe.ba5e905c-af5f-41ea-abf4-9b197cabf8f1 is being destroyed
{code}
 And in the agent's runtime directory, we can see all the container's directories are still there, no one got removed.
{code:java}
/var/run/mesos
├── containers
│   └── 95aa3e70-4f1d-42f0-93bb-3963d63126b8
│   ├── config
│   ├── containers
│   │   └── 8b3080cc-200b-4484-9661-fb2025668dbe
│   │   ├── config
│   │   ├── containers
│   │   │   └── ba5e905c-af5f-41ea-abf4-9b197cabf8f1
│   │   │   ├── config
│   │   │   ├── force_destroy_on_recovery
│   │   │   ├── io_switchboard
│   │   │   │   ├── pid
│   │   │   │   └── socket
│   │   │   ├── launch_info
│   │   │   ├── pid
│   │   │   └── status
│   │   ├── launch_info
│   │   ├── pid
│   │   └── status
│   ├── launch_info
│   ├── pid
│   └── status 
{code}
 

 

 

 

 

 

 

> Health checks launching nested containers while a container is being destroyed lead to unkillable tasks
> -------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-9131
>                 URL: https://issues.apache.org/jira/browse/MESOS-9131
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, containerization
>    Affects Versions: 1.5.1
>            Reporter: Jan Schlicht
>            Assignee: Qian Zhang
>            Priority: Blocker
>              Labels: container-stuck
>
> A container might get stuck in {{DESTROYING}} state if there's a command health check that starts new nested containers while its parent container is getting destroyed.
> Here are some logs which unrelated lines removed. The `REMOVE_NESTED_CONTAINER`/`LAUNCH_NESTED_CONTAINER_SESSION` keeps looping afterwards.
> {noformat}
> 2018-04-16 12:37:54: I0416 12:37:54.235877  3863 containerizer.cpp:2807] Container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 has exited
> 2018-04-16 12:37:54: I0416 12:37:54.235914  3863 containerizer.cpp:2354] Destroying container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 in RUNNING state
> 2018-04-16 12:37:54: I0416 12:37:54.235932  3863 containerizer.cpp:2968] Transitioning the state of container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 from RUNNING to DESTROYING
> 2018-04-16 12:37:54: I0416 12:37:54.236100  3852 linux_launcher.cpp:514] Asked to destroy container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.237671  3852 linux_launcher.cpp:560] Using freezer to destroy cgroup mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.240327  3852 cgroups.cpp:3060] Freezing cgroup /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.244179  3852 cgroups.cpp:1415] Successfully froze cgroup /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 after 3.814144ms
> 2018-04-16 12:37:54: I0416 12:37:54.250550  3853 cgroups.cpp:3078] Thawing cgroup /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11
> 2018-04-16 12:37:54: I0416 12:37:54.256599  3853 cgroups.cpp:1444] Successfully thawed cgroup /sys/fs/cgroup/freezer/mesos/db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3/mesos/0e44d4d7-629f-41f1-80df-4aae9583d133/mesos/e6e01854-40a0-4da3-b458-2b4cf52bbc11 after 5.977856ms
> ...
> 2018-04-16 12:37:54: I0416 12:37:54.371117  3837 http.cpp:3502] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd'
> 2018-04-16 12:37:54: W0416 12:37:54.371692  3842 http.cpp:2758] Failed to launch container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd: Parent container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is in 'DESTROYING' state
> 2018-04-16 12:37:54: W0416 12:37:54.371826  3840 containerizer.cpp:2337] Attempted to destroy unknown container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.2bfd8eed-b528-493b-8434-04311e453dcd
> ...
> 2018-04-16 12:37:55: I0416 12:37:55.504456  3856 http.cpp:3078] Processing REMOVE_NESTED_CONTAINER call for container 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-f3a1238c-7f0f-4db3-bda4-c0ea951d46b6'
> ...
> 2018-04-16 12:37:55: I0416 12:37:55.556367  3857 http.cpp:3502] Processing LAUNCH_NESTED_CONTAINER_SESSION call for container 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211'
> ...
> 2018-04-16 12:37:55: W0416 12:37:55.582137  3850 http.cpp:2758] Failed to launch container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211: Parent container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133 is in 'DESTROYING' state
> ...
> 2018-04-16 12:37:55: W0416 12:37:55.583330  3844 containerizer.cpp:2337] Attempted to destroy unknown container db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133.check-0db8bd89-6f19-48c6-a69f-40196b4bc211
> ...
> {noformat}
> This stops when the framework reconciles and instructs Mesos to kill the task. Which also results in a
> {noformat}
> 2018-04-16 13:06:04: I0416 13:06:04.161623  3843 http.cpp:2966] Processing KILL_NESTED_CONTAINER call for container 'db1c0ab0-3b73-453b-b2b5-a8fc8e1d0ae3.0e44d4d7-629f-41f1-80df-4aae9583d133'
> {noformat}
> Nothing else related to this container is logged following this line.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)