You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Rob Johnson (JIRA)" <ji...@apache.org> on 2016/08/16 12:04:20 UTC

[jira] [Commented] (MESOS-4084) mesos-slave assigned marathon task wrongly to chronos framework after task failure

    [ https://issues.apache.org/jira/browse/MESOS-4084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15422638#comment-15422638 ] 

Rob Johnson commented on MESOS-4084:
------------------------------------

We've been hit by this bug in the last couple of days: specifically, Chronos was receiving a status update about a task owned by Marathon (note that the task had now finished). To resolve the situtation, we had to clear the Mesos agent's work directory and restart it.

Here are some of the logs picked out of the agent log and stderr of the sandbox:

Timeline:

{noformat}
2016-08-14 20:58:20,820 the application dies
{noformat}

The agent notices it's death

{noformat}
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814 20:58:23.803419 29833 slave.cpp:3012] Handling status update TASK_FAILED (UUID: 7efdda87-2b3b-4ae8-92b7-3cb16f5fe0e4) for task myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 from executor(1)@1.2.3.4:32870
...
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814 20:58:24.812420 29858 slave.cpp:3871] Executor 'myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c' of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 exited with status 0
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814 20:58:24.813383 29840 status_update_manager.cpp:320] Received status update TASK_FAILED (UUID: 7efdda87-2b3b-4ae8-92b7-3cb16f5fe0e4) for task myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814 20:58:24.813478 29840 status_update_manager.cpp:824] Checkpointing UPDATE for status update TASK_FAILED (UUID: 7efdda87-2b3b-4ae8-92b7-3cb16f5fe0e4) for task myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814 20:58:24.841326 29849 slave.cpp:3410] Forwarding the update TASK_FAILED (UUID: 7efdda87-2b3b-4ae8-92b7-3cb16f5fe0e4) for task myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 to master@1.2.3.5:5050
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814 20:58:24.841840 29849 slave.cpp:3320] Sending acknowledgement for status update TASK_FAILED (UUID: 7efdda87-2b3b-4ae8-92b7-3cb16f5fe0e4) for task myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 to executor(1)@1.2.3.4:32870
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814 20:58:30.629101 29864 status_update_manager.cpp:392] Received status update acknowledgement (UUID: 7efdda87-2b3b-4ae8-92b7-3cb16f5fe0e4) for task myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814 20:58:30.629396 29864 status_update_manager.cpp:824] Checkpointing ACK for status update TASK_FAILED (UUID: 7efdda87-2b3b-4ae8-92b7-3cb16f5fe0e4) for task myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814 20:58:30.651621 29838 slave.cpp:3975] Cleaning up executor 'myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c' of framework f52e09ed-169c-42a2-9627-60815851d9d2-0000 at executor(1)@1.2.3.4:32870
{noformat}

However, in the mean time, the container is still within it's grace period, and so the slave continues to health check. But, the Docker container has been killed
{noformat}
I0814 20:58:29.862861 30341 main.cpp:98] Ignoring failure as health check still in grace period
Error response from daemon: Container mesos-a1951f98-b39c-482f-8afc-6dcb3cba7a24-S33.5ad311ef-a023-44fa-b261-10ea736b24c1 is not running
I0814 20:58:39.930883 30328 main.cpp:98] Ignoring failure as health check still in grace period
Error response from daemon: Container mesos-a1951f98-b39c-482f-8afc-6dcb3cba7a24-S33.5ad311ef-a023-44fa-b261-10ea736b24c1 is not running
I0814 20:58:49.996784 30332 main.cpp:98] Ignoring failure as health check still in grace period
Error response from daemon: Container mesos-a1951f98-b39c-482f-8afc-6dcb3cba7a24-S33.5ad311ef-a023-44fa-b261-10ea736b24c1 is not running
I0814 20:59:00.061403 30346 main.cpp:98] Ignoring failure as health check still in grace period
Error response from daemon: Container mesos-a1951f98-b39c-482f-8afc-6dcb3cba7a24-S33.5ad311ef-a023-44fa-b261-10ea736b24c1 is not running
I0814 20:59:10.126847 30325 main.cpp:98] Ignoring failure as health check still in grace period
{noformat}

From there, there are a number of entries in the stderr of the sandbox that look like this

{noformat}
E0814 20:59:20.191401 30349 process.cpp:1958] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Error response from daemon: Container mesos-a1951f98-b39c-482f-8afc-6dcb3cba7a24-S33.5ad311ef-a023-44fa-b261-10ea736b24c1 is not running
E0814 20:59:30.353951 30349 process.cpp:1958] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Error response from daemon: Container mesos-a1951f98-b39c-482f-8afc-6dcb3cba7a24-S33.5ad311ef-a023-44fa-b261-10ea736b24c1 is not running
E0814 20:59:40.413733 30349 process.cpp:1958] Failed to shutdown socket with fd 11: Transport endpoint is not connected
Error response from daemon: Container mesos-a1951f98-b39c-482f-8afc-6dcb3cba7a24-S33.5ad311ef-a023-44fa-b261-10ea736b24c1 is not running
{noformat}


This continues for 5 minutes (this is the end of the stderr logs)

{noformat}
E0814 21:04:13.749197 30349 process.cpp:1958] Failed to shutdown socket with fd 11: Transport endpoint is not connected
W0814 21:04:14.749591 30309 main.cpp:387] Health check failed Health command check exited with status 1
{noformat}

However, there are a number of status updates mentioning that the task is still running

{noformat}
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814 21:00:20.850172 29860 slave.cpp:3012] Handling status update TASK_RUNNING (UUID: 6c0e3a29-6323-4de1-8f82-7472bd390252) for task myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c in health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000 from executor(1)@1.2.3.4:32870
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:W0814 21:00:20.850363 29860 slave.cpp:3092] Could not find the executor for status update TASK_RUNNING (UUID: 6c0e3a29-6323-4de1-8f82-7472bd390252) for task myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c in health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814 21:00:20.850623 29852 status_update_manager.cpp:320] Received status update TASK_RUNNING (UUID: 6c0e3a29-6323-4de1-8f82-7472bd390252) for task myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c in health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814 21:00:20.851411 29849 slave.cpp:3410] Forwarding the update TASK_RUNNING (UUID: 6c0e3a29-6323-4de1-8f82-7472bd390252) for task myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c in health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000 to master@1.2.3.5:5050
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814 21:00:20.851662 29849 slave.cpp:3320] Sending acknowledgement for status update TASK_RUNNING (UUID: 6c0e3a29-6323-4de1-8f82-7472bd390252) for task myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c in health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000 to executor(1)@1.2.3.4:32870
{noformat}

From here, the agent continues to handle TASK_RUNNING status updates until we move the state and restart
{noformat}
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:W0814 21:00:50.853273 29860 status_update_manager.cpp:475] Resending status update TASK_RUNNING (UUID: 6c0e3a29-6323-4de1-8f82-7472bd390252) for task myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c in health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000
mesos-slave.hostname.invalid-user.log.INFO.20160621-111458.29832:I0814 21:00:50.853715 29838 slave.cpp:3410] Forwarding the update TASK_RUNNING (UUID: 6c0e3a29-6323-4de1-8f82-7472bd390252) for task myservice.git034740b1.configc9d21b00.7e4463a5-629c-11e6-a691-764fd3cce37c in health state unhealthy of framework 20151218-172703-1745234954-5050-16740-0000 to master@1.2.3.5:5050
{noformat}

> mesos-slave assigned marathon task wrongly to chronos framework after task failure
> ----------------------------------------------------------------------------------
>
>                 Key: MESOS-4084
>                 URL: https://issues.apache.org/jira/browse/MESOS-4084
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.22.2
>         Environment: Ubuntu 14.04.2 LTS
> Mesos 0.22.2
> Marathon 0.11.0
> Chronos 2.4.0
>            Reporter: Erhan Kesken
>            Priority: Minor
>
> I don't know how to reproduce problem, only thing I can do, is sharing my logs:
> https://gist.github.com/ekesken/f2edfd65cca8638b0136
> These are highlights from my logs:
> mesos-slave logs:
> {noformat}
> Dec  7 08:58:27 mesos-slave-node-012 mesos-slave[56099]: I1207 08:58:27.089156 56130 slave.cpp:2531] Handling status update TASK_FAILED (UUID: 5b335fab-1722-4270-83a6-b4ec843be47f) for task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 of framework 20151113-112010-100670892-5050-7957-0001 from executor(1)@172.29.1.12:1651
> 08:58:27 mesos-slave-node-012 mesos-slave[56099]: E1207 08:58:27.089874 56128 slave.cpp:2662] Failed to update resources for container ed5f4f67-464d-4786-9628-cd48732de6b7 of executor collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 running task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 on status update for terminal task, destroying container: Failed to determine cgroup for the 'cpu' subsystem: Failed to read /proc/34074/cgroup: Failed to open file '/proc/34074/cgroup': No such file or directory
> {noformat}
> notice the framework id above, then 5m later we got following logs:
> {noformat}
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 09:03:27.653187 56130 slave.cpp:2531] Handling status update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 in health state unhealthy of framework 20150624-210230-117448108-5050-3678-0001 from executor(1)@172.29.1.12:1651
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: W1207 09:03:27.653282 56130 slave.cpp:2568] Could not find the executor for status update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 in health state unhealthy of framework 20150624-210230-117448108-5050-3678-0001
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 09:03:27.653390 56130 status_update_manager.cpp:317] Received status update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 in health state unhealthy of framework 20150624-210230-117448108-5050-3678-0001
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 09:03:27.653543 56130 slave.cpp:2776] Forwarding the update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 in health state unhealthy of framework 20150624-210230-117448108-5050-3678-0001 to master@172.29.0.5:5050
> Dec  7 09:03:27 mesos-slave-node-012 mesos-slave[56099]: I1207 09:03:27.653688 56130 slave.cpp:2709] Sending acknowledgement for status update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 in health state unhealthy of framework 20150624-210230-117448108-5050-3678-0001 to executor(1)@172.29.1.12:1651
> Dec  7 09:03:37 mesos-slave-node-012 mesos-slave[56099]: W1207 09:03:37.654337 56134 status_update_manager.cpp:472] Resending status update TASK_RUNNING (UUID: 81aee6b0-2b9d-470a-a543-f14f7cae699b) for task collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 in health state unhealthy of framework 20150624-210230-117448108-5050-3678-0001
> {noformat} 
> this caused deactivation of chronos immediately as seen on mesos-master log:
> {noformat}
> Dec  7 09:03:27 mesos-master-node-001 mesos-master[40898]: I1207 09:03:27.654770 40948 master.cpp:1964] Deactivating framework 20150624-210230-117448108-5050-3678-0001 (chronos-2.4.0) at scheduler-7a4396f7-1f68-4f41-901e-805db5de0432@172.29.0.6:11893
> Dec  7 09:03:27 mesos-master-node-001 mesos-master[40898]: I1207 09:03:27.654989 40948 hierarchical.hpp:400] Deactivated framework 20150624-210230-117448108-5050-3678-0001
> {noformat}
> no clue at chronos log at that time, but when we restart chronos master, we got these logs:
> {noformat}
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: Exception in thread "Thread-62550" scala.MatchError: collector_tr_insurance_ebv_facebookscraper.ab3ddc6b-9cc0-11e5-8f21-0242ec411128 (of class java.lang.String)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at org.apache.mesos.chronos.scheduler.jobs.TaskUtils$.parseTaskId(TaskUtils.scala:132)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at org.apache.mesos.chronos.scheduler.mesos.MesosJobFramework.statusUpdate(MesosJobFramework.scala:215)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at sun.reflect.GeneratedMethodAccessor93.invoke(Unknown Source)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at java.lang.reflect.Method.invoke(Method.java:497)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at com.google.inject.internal.DelegatingInvocationHandler.invoke(DelegatingInvocationHandler.java:37)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: #011at com.sun.proxy.$Proxy31.statusUpdate(Unknown Source)
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: I1207 12:35:25.238782 15035 sched.cpp:1623] Asked to abort the driver
> Dec  7 12:35:25 mesos-master-node-003 chronos[15003]: I1207 12:35:25.238836 15035 sched.cpp:856] Aborting framework '20150624-210230-117448108-5050-3678-0001'
> {noformat}
> Problem is solved by following commands in problematic slave:
> {noformat}
> service mesos-slave stop
> rm -fr /tmp/mesos
> service mesos-slave start
> {noformat}
> I needed to restart chronos master to reactivate framework, I don't know if there is a way to reactivate framework manually rather than restarting master node.
> I guess this issue https://github.com/mesos/chronos/issues/569 opened in chronos project is related with this bug too.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)