You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Owen Smith (JIRA)" <ji...@apache.org> on 2015/05/14 11:51:00 UTC

[jira] [Commented] (MESOS-2276) Mesos-slave refuses to startup with many stopped docker containers

    [ https://issues.apache.org/jira/browse/MESOS-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14543439#comment-14543439 ] 

Owen Smith commented on MESOS-2276:
-----------------------------------

I've experienced this same issue, and debugging it / figuring out a course of action was pretty tough.

> (we have an app which crashes on startup right now, retrying to restart every few seconds)

Yup, that was the trigger situation for us too. When using frameworks like marathon, it's pretty easy for someone to accidentally create a situation like this while developing.

For others' benefit, it's not always _just_ mesos at fault here. With enough dead containers, there can be additional complications from docker itself. For example, [~sivaramsk] I think I saw the same thing (although regretfully I didn't check the lsof counts). We ended up attributing to our use of devicemapper as the docker storage driver, based on some nasty docker+devicemapper issues we've seen previously. We ended up needing to restart the affected machines :-/ (and switched to aufs for docker while we were at it)

> Mesos-slave refuses to startup with many stopped docker containers
> ------------------------------------------------------------------
>
>                 Key: MESOS-2276
>                 URL: https://issues.apache.org/jira/browse/MESOS-2276
>             Project: Mesos
>          Issue Type: Bug
>          Components: docker, slave
>    Affects Versions: 0.21.0, 0.21.1
>         Environment: Ubuntu 14.04LTS, Mesosphere packages
>            Reporter: Dr. Stefan Schimanski
>
> The mesos-slave is launched as
> # /usr/local/sbin/mesos-slave --master=zk://10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181/mesos --ip=10.0.0.2 --log_dir=/var/log/mesos --attributes=node_id:srv002 --checkpoint --containerizers=docker --executor_registration_timeout=5mins --logging_level=INFO
> giving this output:
> I0127 19:26:32.674113 19880 logging.cpp:172] INFO level logging started!
> I0127 19:26:32.674741 19880 main.cpp:142] Build: 2014-11-22 05:29:57 by root
> I0127 19:26:32.674774 19880 main.cpp:144] Version: 0.21.0
> I0127 19:26:32.674799 19880 main.cpp:147] Git tag: 0.21.0
> I0127 19:26:32.674824 19880 main.cpp:151] Git SHA: ab8fa655d34e8e15a4290422df38a18db1c09b5b
> I0127 19:26:32.786731 19880 main.cpp:165] Starting Mesos slave
> 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
> 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@716: Client environment:host.name=srv002
> 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
> 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@724: Client environment:os.arch=3.13.0-44-generic
> 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@725: Client environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
> 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@733: Client environment:user.name=root
> 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@741: Client environment:user.home=/root
> 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@753: Client environment:user.dir=/root
> 2015-01-27 19:26:32,789:19880(0x7fcf0cf9f700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 sessionTimeout=10000 watcher=0x7fcf13592a0a sessionId=0 sessionPasswd=<null> context=0x7fceec0009e0 flags=0
> I0127 19:26:32.796588 19880 slave.cpp:169] Slave started on 1)@10.0.0.2:5051
> I0127 19:26:32.797345 19880 slave.cpp:289] Slave resources: cpus(*):8; mem(*):6960; disk(*):246731; ports(*):[31000-32000]
> I0127 19:26:32.798017 19880 slave.cpp:318] Slave hostname: srv002
> I0127 19:26:32.798076 19880 slave.cpp:319] Slave checkpoint: true
> 2015-01-27 19:26:32,800:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1703: initiated connection to server [10.0.0.1:2181]
> I0127 19:26:32.808229 19886 state.cpp:33] Recovering state from '/tmp/mesos/meta'
> I0127 19:26:32.809090 19882 status_update_manager.cpp:197] Recovering status update manager
> I0127 19:26:32.809677 19887 docker.cpp:767] Recovering Docker containers
> 2015-01-27 19:26:32,821:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1750: session establishment complete on server [10.0.0.1:2181], sessionId=0x14b2adf7a560106, negotiated timeout=10000
> I0127 19:26:32.823292 19885 group.cpp:313] Group process (group(1)@10.0.0.2:5051) connected to ZooKeeper
> I0127 19:26:32.823443 19885 group.cpp:790] Syncing group operations: queue size (joins, cancels, datas) = (0, 0, 0)
> I0127 19:26:32.823484 19885 group.cpp:385] Trying to create path '/mesos' in ZooKeeper
> I0127 19:26:32.829711 19882 detector.cpp:138] Detected a new leader: (id='143')
> I0127 19:26:32.830559 19882 group.cpp:659] Trying to get '/mesos/info_0000000143' in ZooKeeper
> I0127 19:26:32.837913 19886 detector.cpp:433] A new leading master (UPID=master@10.0.0.1:5050) is detected
> Failed to perform recovery: Collect failed: Failed to create pipe: Too many open files
> To remedy this do as follows:
> Step 1: rm -f /tmp/mesos/meta/slaves/latest
>         This ensures slave doesn't recover old live executors.
> Step 2: Restart the slave.
> At /tmp/mesos/meta/slaves/latest there is nothing.
> The slave was part of a 3 node cluster before.
> When started as an upstart service, the process is relaunched all the time and a large number of defunct processes appear, like these ones:
> root     30321  0.0  0.0  13000   440 ?        S    19:28   0:00 iptables --wait -L -n
> root     30322  0.0  0.0   4444   396 ?        S    19:28   0:00 sh -c docker inspect mesos-e1f538b4-993a-4cd4-99b0-d633c5e9dd55
> root     30328  0.0  0.0      0     0 ?        Z    19:28   0:00 [sh] <defunct>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)