You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Qian Zhang (JIRA)" <ji...@apache.org> on 2018/06/26 02:42:00 UTC

[jira] [Assigned] (MESOS-9025) The container which joins CNI network and has checkpoint enabled will be mistakenly destroyed by agent

     [ https://issues.apache.org/jira/browse/MESOS-9025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Qian Zhang reassigned MESOS-9025:
---------------------------------

    Assignee: Jie Yu

> The container which joins CNI network and has checkpoint enabled will be mistakenly destroyed by agent
> ------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-9025
>                 URL: https://issues.apache.org/jira/browse/MESOS-9025
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>    Affects Versions: 1.6.0
>            Reporter: Qian Zhang
>            Assignee: Jie Yu
>            Priority: Blocker
>              Labels: cni
>
> Reproduce steps:
> 1) Run {{mesos-execute}} to launch a task which joins a CNI network {{net1}} and has checkpoint enabled:
> {code:java}
> $ cat task_cni.json
> {
>   "name": "test1",
>   "task_id": {"value" : "test1"},
>   "agent_id": {"value" : ""},
>   "resources": [
>     {"name": "cpus", "type": "SCALAR", "scalar": {"value": 0.1}},
>     {"name": "mem", "type": "SCALAR", "scalar": {"value": 32}}
>   ],
>   "command": {
>     "value": "sleep 1000"
>   },
>   "container": {
>     "type": "MESOS",
>     "network_infos": [
>       {
>         "name": "net1"
>       }
>     ]
>   }
> }
> $ mesos-execute --master=192.168.56.5:5050 --task=file:///home/stack/workspace/config/task_cni.json --checkpoint
> {code}
> 2) After task is in the {{TASK_RUNNING}} state, restart the agent process, and then in the agent log, we will see the container is destroyed.
> {code:java}
> ...
> I0622 17:30:00.792310  7426 containerizer.cpp:1024] Recovering isolators
> I0622 17:30:00.798740  7430 cni.cpp:437] Removing unknown orphaned container faf69105-e76f-49c7-8e56-964c2f882cff
> ...
> I0622 17:30:01.025600  7433 cni.cpp:1546] Unmounted the network namespace handle '/run/mesos/isolators/network/cni/faf69105-e76f-49c7-8e56-964c2f882cff/ns' for container faf69105-e76f-49c7-8e56-964c2f882cff
> I0622 17:30:01.026211  7433 cni.cpp:1557] Removed the container directory '/run/mesos/isolators/network/cni/faf69105-e76f-49c7-8e56-964c2f882cff'
> I0622 17:30:02.935093  7429 slave.cpp:5215] Cleaning up un-reregistered executors
> I0622 17:30:02.935221  7429 slave.cpp:5233] Killing un-reregistered executor 'test1' of framework dc2b3db0-953c-47a4-8fd4-f6d040e9d10e-0002 at executor(1)@192.168.11.7:33719
> I0622 17:30:02.935900  7429 slave.cpp:7311] Finished recovery
> I0622 17:30:02.937409  7427 containerizer.cpp:2405] Destroying container faf69105-e76f-49c7-8e56-964c2f882cff in RUNNING state
> {code}
> And {{mesos-execute}} will receive a {{TASK_GONE}} for the task:
> {code:java}
> $ mesos-execute --master=192.168.56.5:5050 --task=file:///home/stack/workspace/config/task_cni.json --checkpoint
> I0622 17:29:50.538630  7246 scheduler.cpp:189] Version: 1.7.0
> I0622 17:29:50.548589  7261 scheduler.cpp:355] Using default 'basic' HTTP authenticatee
> I0622 17:29:50.550348  7263 scheduler.cpp:538] New master detected at master@192.168.56.5:5050
> Subscribed with ID dc2b3db0-953c-47a4-8fd4-f6d040e9d10e-0002
> Submitted task 'test1' to agent 'dc2b3db0-953c-47a4-8fd4-f6d040e9d10e-S0'
> Received status update TASK_STARTING for task 'test1'
>   source: SOURCE_EXECUTOR
> Received status update TASK_RUNNING for task 'test1'
>   source: SOURCE_EXECUTOR
> Received status update TASK_GONE for task 'test1'
>   message: 'Executor did not reregister within 2secs'
>   source: SOURCE_AGENT
>   reason: REASON_EXECUTOR_REREGISTRATION_TIMEOUT
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)