You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Andrei Budnik (Jira)" <ji...@apache.org> on 2019/12/06 12:45:00 UTC
[jira] [Commented] (MESOS-10066) mesos-docker-executor process dies
when agent stops. Recovery fails when agent returns
[ https://issues.apache.org/jira/browse/MESOS-10066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16989728#comment-16989728 ]
Andrei Budnik commented on MESOS-10066:
---------------------------------------
Could you please attach full agent logs?
> mesos-docker-executor process dies when agent stops. Recovery fails when agent returns
> --------------------------------------------------------------------------------------
>
> Key: MESOS-10066
> URL: https://issues.apache.org/jira/browse/MESOS-10066
> Project: Mesos
> Issue Type: Bug
> Components: agent, containerization, docker, executor
> Affects Versions: 1.7.3
> Reporter: Dalton Matos Coelho Barreto
> Priority: Critical
>
> Hello all,
> The documentation about Agent Recovery shows two conditions for the recovery to be possible:
> - The agent must have recovery enabled (default true?);
> - The scheduler must register itself saying that it has checkpointing enabled.
> In my tests I'm using Marathon as the scheduler and Mesos itself sees Marathon as e checkpoint-enabled scheduler:
> {noformat}
> $ curl -sL 10.234.172.27:5050/state | jq '.frameworks[] | {"name": .name, "id": .id, "checkpoint": .checkpoint, "active": .active}'
> {
> "name": "asgard-chronos",
> "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0001",
> "checkpoint": true,
> "active": true
> }
> {
> "name": "marathon",
> "id": "4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000",
> "checkpoint": true,
> "active": true
> }
> }}
> {noformat}
> Here is what I'm using:
> # Mesos Master, 1.4.1
> # Mesos Agent 1.7.3
> # Using docker image {{mesos/mesos-centos:1.7.x}}
> # Docker sock mounted from the host
> # Docker binary also mounted from the host
> # Marathon: 1.4.12
> # Docker
> {noformat}
> Client: Docker Engine - Community
> Version: 19.03.5
> API version: 1.39 (downgraded from 1.40)
> Go version: go1.12.12
> Git commit: 633a0ea838
> Built: Wed Nov 13 07:22:05 2019
> OS/Arch: linux/amd64
> Experimental: false
>
> Server: Docker Engine - Community
> Engine:
> Version: 18.09.2
> API version: 1.39 (minimum version 1.12)
> Go version: go1.10.6
> Git commit: 6247962
> Built: Sun Feb 10 03:42:13 2019
> OS/Arch: linux/amd64
> Experimental: false
> {noformat}
> h2. The problem
> Here is the Marathon test app, a simple {{sleep 99d}} based on {{debian}} docker image.
> {noformat}
> {
> "id": "/sleep",
> "cmd": "sleep 99d",
> "cpus": 0.1,
> "mem": 128,
> "disk": 0,
> "instances": 1,
> "constraints": [],
> "acceptedResourceRoles": [
> "*"
> ],
> "container": {
> "type": "DOCKER",
> "volumes": [],
> "docker": {
> "image": "debian",
> "network": "HOST",
> "privileged": false,
> "parameters": [],
> "forcePullImage": true
> }
> },
> "labels": {},
> "portDefinitions": []
> }
> {noformat}
> This task runs fine and get scheduled on the right agent, which is running mesos agent 1.7.3 (using the docker image, {{mesos/mesos-centos:1.7.x}}).
> Here is a sample log:
> {noformat}
> mesos-slave_1 | I1205 13:24:21.391464 19849 slave.cpp:2403] Authorizing task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> mesos-slave_1 | I1205 13:24:21.392707 19849 slave.cpp:2846] Launching task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> mesos-slave_1 | I1205 13:24:21.392895 19849 paths.cpp:748] Creating sandbox '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1 | I1205 13:24:21.394399 19849 paths.cpp:748] Creating sandbox '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1 | I1205 13:24:21.394918 19849 slave.cpp:9068] Launching executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 with resources [{"allocation_info":{"role":"*"},"name":"cpus","scalar":{"value":0.1},"type":"SCALAR"},{"allocation_info":{"role":"*"},"name":"mem","scalar":{"value":32.0},"type":"SCALAR"}] in work directory '/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1 | I1205 13:24:21.396499 19849 slave.cpp:3078] Queued task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> mesos-slave_1 | I1205 13:24:21.397038 19849 slave.cpp:3526] Launching container 53ec0ef3-3290-476a-b2b6-385099e9b923 for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> mesos-slave_1 | I1205 13:24:21.398028 19846 docker.cpp:1177] Starting container '53ec0ef3-3290-476a-b2b6-385099e9b923' for task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' (and executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f') of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> mesos-slave_1 | W1205 13:24:22.576869 19846 slave.cpp:8496] Failed to get resource statistics for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000: Failed to run '/usr/bin/docker -H unix:///var/run/docker.sock inspect --type=container mesos-53ec0ef3-3290-476a-b2b6-385099e9b923': exited with status 1; stderr='Error: No such container: mesos-53ec0ef3-3290-476a-b2b6-385099e9b923'
> mesos-slave_1 | I1205 13:24:24.094985 19853 docker.cpp:792] Checkpointing pid 12435 to '/var/lib/mesos/agent/meta/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/pids/forked.pid'
> mesos-slave_1 | I1205 13:24:24.343099 19848 slave.cpp:4839] Got registration for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from executor(1)@10.234.172.56:16653
> mesos-slave_1 | I1205 13:24:24.345593 19848 docker.cpp:1685] Ignoring updating container 53ec0ef3-3290-476a-b2b6-385099e9b923 because resources passed to update are identical to existing resources
> mesos-slave_1 | I1205 13:24:24.345945 19848 slave.cpp:3296] Sending queued task 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at executor(1)@10.234.172.56:16653
> mesos-slave_1 | I1205 13:24:24.362699 19853 slave.cpp:5310] Handling status update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from executor(1)@10.234.172.56:16653
> mesos-slave_1 | I1205 13:24:24.363222 19853 task_status_update_manager.cpp:328] Received task status update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> mesos-slave_1 | I1205 13:24:24.364115 19853 task_status_update_manager.cpp:842] Checkpointing UPDATE for task status update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> mesos-slave_1 | I1205 13:24:24.364993 19847 slave.cpp:5815] Forwarding the update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to master@10.234.172.34:5050
> mesos-slave_1 | I1205 13:24:24.365594 19847 slave.cpp:5726] Sending acknowledgement for status update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to executor(1)@10.234.172.56:16653
> mesos-slave_1 | I1205 13:24:24.401759 19846 task_status_update_manager.cpp:401] Received task status update acknowledgement (UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> mesos-slave_1 | I1205 13:24:24.401926 19846 task_status_update_manager.cpp:842] Checkpointing ACK for task status update TASK_STARTING (Status UUID: 8b06fe8c-d709-453b-817d-2948e50782c9) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> mesos-slave_1 | I1205 13:24:30.481829 19850 slave.cpp:5310] Handling status update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from executor(1)@10.234.172.56:16653
> mesos-slave_1 | I1205 13:24:30.482340 19848 task_status_update_manager.cpp:328] Received task status update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> mesos-slave_1 | I1205 13:24:30.482550 19848 task_status_update_manager.cpp:842] Checkpointing UPDATE for task status update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> mesos-slave_1 | I1205 13:24:30.483163 19848 slave.cpp:5815] Forwarding the update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to master@10.234.172.34:5050
> mesos-slave_1 | I1205 13:24:30.483664 19848 slave.cpp:5726] Sending acknowledgement for status update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 to executor(1)@10.234.172.56:16653
> mesos-slave_1 | I1205 13:24:30.557307 19852 task_status_update_manager.cpp:401] Received task status update acknowledgement (UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> mesos-slave_1 | I1205 13:24:30.557467 19852 task_status_update_manager.cpp:842] Checkpointing ACK for task status update TASK_RUNNING (Status UUID: 480abd48-6a85-4703-a2f2-0455f5c4c053) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> {noformat}
> An important part of this log:
> {noformat}
> Sending acknowledgement for status update TASK_RUNNING for task sleep.8c187c41-1762-11ea-a2e5-02429217540f to executor(1)@10.234.172.56:16653
> {noformat}
> Here we have the executor's address that is responsible for our newly created task: {{executor(1)@10.234.172.56:16653}}.
> Looking at the O.S process list we see one instance of {{mesos-docker-executor}}:
> {noformat}
> ps aux | grep executor
> root 12435 0.5 1.3 851796 54100 ? Ssl 10:24 0:03 mesos-docker-executor --cgroups_enable_cfs=true --container=mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 --docker=/usr/bin/docker --docker_socket=/var/run/docker.sock --help=false --initialize_driver_logging=true --launcher_dir=/usr/libexec/mesos --logbufsecs=0 --logging_level=INFO --mapped_directory=/mnt/mesos/sandbox --quiet=false --sandbox_directory=/var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923 --stop_timeout=10secs
> root 12456 0.0 0.8 152636 34548 ? Sl 10:24 0:00 /usr/bin/docker -H unix:///var/run/docker.sock run --cpu-shares 102 --cpu-quota 10000 --memory 134217728 -e HOST=10.234.172.56 -e MARATHON_APP_DOCKER_IMAGE=debian -e MARATHON_APP_ID=/sleep -e MARATHON_APP_LABELS=HOLLOWMAN_DEFAULT_SCALE -e MARATHON_APP_LABEL_HOLLOWMAN_DEFAULT_SCALE=1 -e MARATHON_APP_RESOURCE_CPUS=0.1 -e MARATHON_APP_RESOURCE_DISK=0.0 -e MARATHON_APP_RESOURCE_GPUS=0 -e MARATHON_APP_RESOURCE_MEM=128.0 -e MARATHON_APP_VERSION=2019-12-05T13:24:18.206Z -e MESOS_CONTAINER_NAME=mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 -e MESOS_SANDBOX=/mnt/mesos/sandbox -e MESOS_TASK_ID=sleep.8c187c41-1762-11ea-a2e5-02429217540f -v /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923:/mnt/mesos/sandbox --net host --entrypoint /bin/sh --name mesos-53ec0ef3-3290-476a-b2b6-385099e9b923 --label=hollowman.appname=/asgard/sleep --label=MESOS_TASK_ID=sleep.8c187c41-1762-11ea-a2e5-02429217540f debian -c sleep 99d
> root 16776 0.0 0.0 7960 812 pts/3 S+ 10:34 0:00 tail -f /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr
> {noformat}
> Here we have 3 processes:
> - Mesos executor;
> - The container that this executor started, running {{sleep 99d}};
> - {{tail -f}} looking into {{stderr}} of this executor.
> The {{stderr}} content is as follows:
> {noformat}
> $ tail -f /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr
> I1205 13:24:24.337672 12435 exec.cpp:162] Version: 1.7.3
> I1205 13:24:24.347107 12449 exec.cpp:236] Executor registered on agent 79ad3a13-b567-4273-ac8c-30378d35a439-S60499
> I1205 13:24:24.349907 12451 executor.cpp:130] Registered docker executor on 10.234.172.56
> I1205 13:24:24.350291 12449 executor.cpp:186] Starting task sleep.8c187c41-1762-11ea-a2e5-02429217540f
> WARNING: Your kernel does not support swap limit capabilities or the cgroup is not mounted. Memory limited without swap.
> W1205 13:24:29.363168 12455 executor.cpp:253] Docker inspect timed out after 5secs for container 'mesos-53ec0ef3-3290-476a-b2b6-385099e9b923'
> {noformat}
> While Mesos Agent is still running it is possible to connect on this executor port.
> {noformat}
> telnet 10.234.172.56 16653
> Trying 10.234.172.56...
> Connected to 10.234.172.56.
> Escape character is '^]'.
> ^]
> telnet> Connection closed.
> {noformat}
> As soon as the Agent shutsdown (a simple {{docker stop}}), this line appears on the {{stderr}} log of tyhe executor:
> {noformat}
> I1205 13:40:31.160290 12452 exec.cpp:518] Agent exited, but framework has checkpointing enabled. Waiting 15mins to reconnect with agent 79ad3a13-b567-4273-ac8c-30378d35a439-S60499
> {noformat}
> And now, looking ate the O.S process list, the executor process is gone:
> {noformat}
> $ ps aux | grep executor
> root 16776 0.0 0.0 7960 812 pts/3 S+ 10:34 0:00 tail -f /var/lib/mesos/agent/slaves/79ad3a13-b567-4273-ac8c-30378d35a439-S60499/frameworks/4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000/executors/sleep.8c187c41-1762-11ea-a2e5-02429217540f/runs/53ec0ef3-3290-476a-b2b6-385099e9b923/stderr
> {noformat}
> and we can't telnet to that port anymore:
> {noformat}
> telnet 10.234.172.56 16653
> Trying 10.234.172.56...
> telnet: Unable to connect to remote host: Connection refused
> {noformat}
> When we run the Mesos Agent again, we see this:
> {noformat}
> mesos-slave_1 | I1205 13:42:39.925992 18421 docker.cpp:899] Recovering Docker containers
> mesos-slave_1 | I1205 13:42:44.006150 18421 docker.cpp:912] Got the list of Docker containers
> mesos-slave_1 | I1205 13:42:44.010700 18421 docker.cpp:1009] Recovering container '53ec0ef3-3290-476a-b2b6-385099e9b923' for executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> mesos-slave_1 | W1205 13:42:44.011874 18421 docker.cpp:1053] Failed to connect to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000: Failed to connect to 10.234.172.56:16653: Connection refused
> mesos-slave_1 | I1205 13:42:44.016306 18421 docker.cpp:2564] Executor for container 53ec0ef3-3290-476a-b2b6-385099e9b923 has exited
> mesos-slave_1 | I1205 13:42:44.016412 18421 docker.cpp:2335] Destroying container 53ec0ef3-3290-476a-b2b6-385099e9b923 in RUNNING state
> mesos-slave_1 | I1205 13:42:44.016705 18421 docker.cpp:1133] Finished processing orphaned Docker containers
> mesos-slave_1 | I1205 13:42:44.017102 18421 docker.cpp:2385] Running docker stop on container 53ec0ef3-3290-476a-b2b6-385099e9b923
> mesos-slave_1 | I1205 13:42:44.017251 18426 slave.cpp:7205] Recovering executors
> mesos-slave_1 | I1205 13:42:44.017493 18426 slave.cpp:7229] Sending reconnect request to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at executor(1)@10.234.172.56:16653
> mesos-slave_1 | W1205 13:42:44.018522 18426 process.cpp:1890] Failed to send 'mesos.internal.ReconnectExecutorMessage' to '10.234.172.56:16653', connect: Failed to connect to 10.234.172.56:16653: Connection refused
> mesos-slave_1 | I1205 13:42:44.019569 18422 slave.cpp:6336] Executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 has terminated with unknown status
> mesos-slave_1 | I1205 13:42:44.022147 18422 slave.cpp:5310] Handling status update TASK_FAILED (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 from @0.0.0.0:0
> mesos-slave_1 | W1205 13:42:44.023823 18423 docker.cpp:1672] Ignoring updating unknown container 53ec0ef3-3290-476a-b2b6-385099e9b923
> mesos-slave_1 | I1205 13:42:44.024247 18425 task_status_update_manager.cpp:328] Received task status update TASK_FAILED (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> mesos-slave_1 | I1205 13:42:44.024363 18425 task_status_update_manager.cpp:842] Checkpointing UPDATE for task status update TASK_FAILED (Status UUID: c946cd2b-1dec-4fc3-9ea6-501b40dcf4a7) for task sleep.8c187c41-1762-11ea-a2e5-02429217540f of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000
> mesos-slave_1 | I1205 13:42:46.020193 18423 slave.cpp:5238] Cleaning up un-reregistered executors
> mesos-slave_1 | I1205 13:42:46.020882 18423 slave.cpp:7358] Finished recovery
> {noformat}
> And then a new container is scheduled to run. Marathon shows a status of {{TASK_FAILED}} with message {{Container terminated}} for the previous task.
> Special attention to the part where the Agent seems to try to connect to the executor port:
> {noformat}
> mesos-slave_1 | I1205 13:42:44.017493 18426 slave.cpp:7229] Sending reconnect request to executor 'sleep.8c187c41-1762-11ea-a2e5-02429217540f' of framework 4783cf15-4fb1-4c75-90fe-44eeec5258a7-0000 at executor(1)@10.234.172.56:16653
> mesos-slave_1 | W1205 13:42:44.018522 18426 process.cpp:1890] Failed to send 'mesos.internal.ReconnectExecutorMessage' to '10.234.172.56:16653', connect: Failed to connect to 10.234.172.56:16653: Connection refused
> {noformat}
> Looking at the mesos Agent recovery documentation the setup seems very straight forward. I'm using the default {{mesos-docker-executor}} to run all docker tasks. Is there any detail that I'm not seeing or is mis-configurated for my setup?
> Can anyone confirm that the {{mesos-docker-executor}} is capable of doing task recovery?
> Thank you very much and if you need any additional information, let me know.
> Thanks,
--
This message was sent by Atlassian Jira
(v8.3.4#803005)