You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Chris Fortier (JIRA)" <ji...@apache.org> on 2015/10/27 22:54:27 UTC

[jira] [Updated] (MESOS-3808) slave/containerizer/docker leaves orphan containers on restart of mesos-slave

     [ https://issues.apache.org/jira/browse/MESOS-3808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chris Fortier updated MESOS-3808:
---------------------------------
    Description: 
We attempted to upgrade from Mesos 0.23 to 0.25 but noticed that Docker containers launched by Mesos were being orphaned and not destroyed when the Mesos agent was restarted.

Relavent log output:

{noformat}
I1027 20:36:22.343880 23004 docker.cpp:535] Recovering Docker containers
I1027 20:36:22.517032 23008 docker.cpp:639] Recovering container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' for executor 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000
I1027 20:36:22.517467 23008 docker.cpp:639] Recovering container '77b1748e-f295-4eb5-9966-d7a3bba2fc31' for executor 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000
I1027 20:36:22.517817 23007 slave.cpp:4051] Sending reconnect request to executor ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 at executor(1)@10.131.100.57:40596
I1027 20:36:22.518033 23007 slave.cpp:4051] Sending reconnect request to executor ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 at executor(1)@10.131.100.57:57469
I1027 20:36:22.518038 23008 docker.cpp:1592] Executor for container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' has exited
E1027 20:36:22.518070 23010 socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107]
I1027 20:36:22.518084 23008 docker.cpp:1390] Destroying container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614'
I1027 20:36:22.518282 23008 docker.cpp:1592] Executor for container '77b1748e-f295-4eb5-9966-d7a3bba2fc31' has exited
I1027 20:36:22.518324 23008 docker.cpp:1390] Destroying container '77b1748e-f295-4eb5-9966-d7a3bba2fc31'
E1027 20:36:22.518357 23010 socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107]
I1027 20:36:22.518360 23008 docker.cpp:1494] Running docker stop on container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614'
I1027 20:36:22.518489 23008 docker.cpp:1494] Running docker stop on container '77b1748e-f295-4eb5-9966-d7a3bba2fc31'
I1027 20:36:22.518592 23005 slave.cpp:3433] Executor 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status
I1027 20:36:22.519127 23005 slave.cpp:2717] Handling status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0
I1027 20:36:22.519263 23005 slave.cpp:3433] Executor 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status
I1027 20:36:22.519300 23005 slave.cpp:2717] Handling status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0
W1027 20:36:22.519498 23003 docker.cpp:1002] Ignoring updating unknown container: a2308dfc-ec2f-4687-ae92-f045dd2d3614
W1027 20:36:22.519611 23003 docker.cpp:1002] Ignoring updating unknown container: 77b1748e-f295-4eb5-9966-d7a3bba2fc31
I1027 20:36:22.519691 23003 status_update_manager.cpp:322] Received status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
I1027 20:36:22.519755 23003 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
I1027 20:36:22.525867 23003 status_update_manager.cpp:322] Received status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
I1027 20:36:22.525907 23003 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
W1027 20:36:22.526645 23009 slave.cpp:2968] Dropping status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 sent by status update manager because the slave is in RECOVERING state
W1027 20:36:22.529747 23007 slave.cpp:2968] Dropping status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 sent by status update manager because the slave is in RECOVERING state
I1027 20:36:24.518846 23004 slave.cpp:2666] Cleaning up un-reregistered executors
I1027 20:36:24.519011 23004 slave.cpp:4110] Finished recovery
{noformat}

Docker output:
{noformat}
CONTAINER ID        IMAGE                             COMMAND                CREATED              STATUS              PORTS               NAMES
8d0d69fe34d7        libmesos/ubuntu                   "/bin/sh -c 'while s   About a minute ago   Up About a minute                       mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a1492e45-2fce-4ca4-bd16-edcef439ca31
e4344cfbcc6d        libmesos/ubuntu                   "/bin/sh -c 'while s   About a minute ago   Up About a minute                       mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.c3624e67-7a27-4309-8aa4-365d3fd1bfe2
3ce690f3b872        libmesos/ubuntu                   "/bin/sh -c 'while s   4 minutes ago        Up 4 minutes                            mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a2308dfc-ec2f-4687-ae92-f045dd2d3614
5b4546d3087a        libmesos/ubuntu                   "/bin/sh -c 'while s   4 minutes ago        Up 4 minutes                            mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.77b1748e-f295-4eb5-9966-d7a3bba2fc31
{noformat}

After digging in to the issue it seems the below comment might be the problem. 
https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L97


It appears that the recovery command is still only sending the containerId and not the frameworkId + containerId.

  was:
We attempted to upgrade from Mesos 0.23 to 0.25 but noticed that Docker containers launched by Mesos were being orphaned and not destroyed when the Mesos agent was restarted.

Relavent log output:

```
I1027 20:36:22.343880 23004 docker.cpp:535] Recovering Docker containers
I1027 20:36:22.517032 23008 docker.cpp:639] Recovering container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' for executor 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000
I1027 20:36:22.517467 23008 docker.cpp:639] Recovering container '77b1748e-f295-4eb5-9966-d7a3bba2fc31' for executor 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000
I1027 20:36:22.517817 23007 slave.cpp:4051] Sending reconnect request to executor ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 at executor(1)@10.131.100.57:40596
I1027 20:36:22.518033 23007 slave.cpp:4051] Sending reconnect request to executor ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 at executor(1)@10.131.100.57:57469
I1027 20:36:22.518038 23008 docker.cpp:1592] Executor for container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' has exited
E1027 20:36:22.518070 23010 socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107]
I1027 20:36:22.518084 23008 docker.cpp:1390] Destroying container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614'
I1027 20:36:22.518282 23008 docker.cpp:1592] Executor for container '77b1748e-f295-4eb5-9966-d7a3bba2fc31' has exited
I1027 20:36:22.518324 23008 docker.cpp:1390] Destroying container '77b1748e-f295-4eb5-9966-d7a3bba2fc31'
E1027 20:36:22.518357 23010 socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107]
I1027 20:36:22.518360 23008 docker.cpp:1494] Running docker stop on container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614'
I1027 20:36:22.518489 23008 docker.cpp:1494] Running docker stop on container '77b1748e-f295-4eb5-9966-d7a3bba2fc31'
I1027 20:36:22.518592 23005 slave.cpp:3433] Executor 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status
I1027 20:36:22.519127 23005 slave.cpp:2717] Handling status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0
I1027 20:36:22.519263 23005 slave.cpp:3433] Executor 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status
I1027 20:36:22.519300 23005 slave.cpp:2717] Handling status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0
W1027 20:36:22.519498 23003 docker.cpp:1002] Ignoring updating unknown container: a2308dfc-ec2f-4687-ae92-f045dd2d3614
W1027 20:36:22.519611 23003 docker.cpp:1002] Ignoring updating unknown container: 77b1748e-f295-4eb5-9966-d7a3bba2fc31
I1027 20:36:22.519691 23003 status_update_manager.cpp:322] Received status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
I1027 20:36:22.519755 23003 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
I1027 20:36:22.525867 23003 status_update_manager.cpp:322] Received status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
I1027 20:36:22.525907 23003 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
W1027 20:36:22.526645 23009 slave.cpp:2968] Dropping status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 sent by status update manager because the slave is in RECOVERING state
W1027 20:36:22.529747 23007 slave.cpp:2968] Dropping status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 sent by status update manager because the slave is in RECOVERING state
I1027 20:36:24.518846 23004 slave.cpp:2666] Cleaning up un-reregistered executors
I1027 20:36:24.519011 23004 slave.cpp:4110] Finished recovery
```

Docker output:
```
CONTAINER ID        IMAGE                             COMMAND                CREATED              STATUS              PORTS               NAMES
8d0d69fe34d7        libmesos/ubuntu                   "/bin/sh -c 'while s   About a minute ago   Up About a minute                       mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a1492e45-2fce-4ca4-bd16-edcef439ca31
e4344cfbcc6d        libmesos/ubuntu                   "/bin/sh -c 'while s   About a minute ago   Up About a minute                       mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.c3624e67-7a27-4309-8aa4-365d3fd1bfe2
3ce690f3b872        libmesos/ubuntu                   "/bin/sh -c 'while s   4 minutes ago        Up 4 minutes                            mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a2308dfc-ec2f-4687-ae92-f045dd2d3614
5b4546d3087a        libmesos/ubuntu                   "/bin/sh -c 'while s   4 minutes ago        Up 4 minutes                            mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.77b1748e-f295-4eb5-9966-d7a3bba2fc31
```

After digging in to the issue it seems the below comment might be the problem. 
https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L97


It appears that the recovery command is still only sending the containerId and not the frameworkId + containerId.


> slave/containerizer/docker leaves orphan containers on restart of mesos-slave
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-3808
>                 URL: https://issues.apache.org/jira/browse/MESOS-3808
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization, docker, slave
>    Affects Versions: 0.25.0
>         Environment: CoreOS. Running mesos-slave in a container.
>            Reporter: Chris Fortier
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> We attempted to upgrade from Mesos 0.23 to 0.25 but noticed that Docker containers launched by Mesos were being orphaned and not destroyed when the Mesos agent was restarted.
> Relavent log output:
> {noformat}
> I1027 20:36:22.343880 23004 docker.cpp:535] Recovering Docker containers
> I1027 20:36:22.517032 23008 docker.cpp:639] Recovering container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' for executor 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000
> I1027 20:36:22.517467 23008 docker.cpp:639] Recovering container '77b1748e-f295-4eb5-9966-d7a3bba2fc31' for executor 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000
> I1027 20:36:22.517817 23007 slave.cpp:4051] Sending reconnect request to executor ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 at executor(1)@10.131.100.57:40596
> I1027 20:36:22.518033 23007 slave.cpp:4051] Sending reconnect request to executor ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 at executor(1)@10.131.100.57:57469
> I1027 20:36:22.518038 23008 docker.cpp:1592] Executor for container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614' has exited
> E1027 20:36:22.518070 23010 socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107]
> I1027 20:36:22.518084 23008 docker.cpp:1390] Destroying container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614'
> I1027 20:36:22.518282 23008 docker.cpp:1592] Executor for container '77b1748e-f295-4eb5-9966-d7a3bba2fc31' has exited
> I1027 20:36:22.518324 23008 docker.cpp:1390] Destroying container '77b1748e-f295-4eb5-9966-d7a3bba2fc31'
> E1027 20:36:22.518357 23010 socket.hpp:174] Shutdown failed on fd=13: Transport endpoint is not connected [107]
> I1027 20:36:22.518360 23008 docker.cpp:1494] Running docker stop on container 'a2308dfc-ec2f-4687-ae92-f045dd2d3614'
> I1027 20:36:22.518489 23008 docker.cpp:1494] Running docker stop on container '77b1748e-f295-4eb5-9966-d7a3bba2fc31'
> I1027 20:36:22.518592 23005 slave.cpp:3433] Executor 'ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status
> I1027 20:36:22.519127 23005 slave.cpp:2717] Handling status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0
> I1027 20:36:22.519263 23005 slave.cpp:3433] Executor 'ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db' of framework 20151016-161150-1902412554-5050-1-0000 has terminated with unknown status
> I1027 20:36:22.519300 23005 slave.cpp:2717] Handling status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 from @0.0.0.0:0
> W1027 20:36:22.519498 23003 docker.cpp:1002] Ignoring updating unknown container: a2308dfc-ec2f-4687-ae92-f045dd2d3614
> W1027 20:36:22.519611 23003 docker.cpp:1002] Ignoring updating unknown container: 77b1748e-f295-4eb5-9966-d7a3bba2fc31
> I1027 20:36:22.519691 23003 status_update_manager.cpp:322] Received status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
> I1027 20:36:22.519755 23003 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
> I1027 20:36:22.525867 23003 status_update_manager.cpp:322] Received status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
> I1027 20:36:22.525907 23003 status_update_manager.cpp:826] Checkpointing UPDATE for status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000
> W1027 20:36:22.526645 23009 slave.cpp:2968] Dropping status update TASK_LOST (UUID: b07be363-433f-4a11-8c81-1f5787debc76) for task ubuntu.059ced51-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 sent by status update manager because the slave is in RECOVERING state
> W1027 20:36:22.529747 23007 slave.cpp:2968] Dropping status update TASK_LOST (UUID: 6a687305-78fc-48ec-b49a-8aeb4b42b3ac) for task ubuntu.059d1462-7cea-11e5-a442-1ac2f22f38db of framework 20151016-161150-1902412554-5050-1-0000 sent by status update manager because the slave is in RECOVERING state
> I1027 20:36:24.518846 23004 slave.cpp:2666] Cleaning up un-reregistered executors
> I1027 20:36:24.519011 23004 slave.cpp:4110] Finished recovery
> {noformat}
> Docker output:
> {noformat}
> CONTAINER ID        IMAGE                             COMMAND                CREATED              STATUS              PORTS               NAMES
> 8d0d69fe34d7        libmesos/ubuntu                   "/bin/sh -c 'while s   About a minute ago   Up About a minute                       mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a1492e45-2fce-4ca4-bd16-edcef439ca31
> e4344cfbcc6d        libmesos/ubuntu                   "/bin/sh -c 'while s   About a minute ago   Up About a minute                       mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.c3624e67-7a27-4309-8aa4-365d3fd1bfe2
> 3ce690f3b872        libmesos/ubuntu                   "/bin/sh -c 'while s   4 minutes ago        Up 4 minutes                            mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.a2308dfc-ec2f-4687-ae92-f045dd2d3614
> 5b4546d3087a        libmesos/ubuntu                   "/bin/sh -c 'while s   4 minutes ago        Up 4 minutes                            mesos-bc7d28c1-81cd-4dfe-8c53-afa8fdfeb472-S14.77b1748e-f295-4eb5-9966-d7a3bba2fc31
> {noformat}
> After digging in to the issue it seems the below comment might be the problem. 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/docker.cpp#L97
> It appears that the recovery command is still only sending the containerId and not the frameworkId + containerId.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)