You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Adam B (JIRA)" <ji...@apache.org> on 2015/04/01 00:43:53 UTC

[jira] [Commented] (MESOS-2580) mesos slave resends TaskStatus as TASK_RUNNING after TASK_KILLED

    [ https://issues.apache.org/jira/browse/MESOS-2580?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389555#comment-14389555 ] 

Adam B commented on MESOS-2580:
-------------------------------

[~hgschmie] Where exactly do you see the TASK_KILLED status from your framework? If the TASK_RUNNING never got acknowledged by your framework, the slave will keep resending it. However, since the slave knows that the task has subsequently been killed, it sets the StatusUpdate.latest_state field so that the master can count the task as complete and its resources as available. Perhaps you were reading that field, or reading the status from master/state.json, rather than actually getting an actual TASK_KILLED status update?

> mesos slave resends TaskStatus as TASK_RUNNING after TASK_KILLED
> ----------------------------------------------------------------
>
>                 Key: MESOS-2580
>                 URL: https://issues.apache.org/jira/browse/MESOS-2580
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.22.0
>            Reporter: Henning Schmiedehausen
>            Assignee: Niklas Quarfot Nielsen
>             Fix For: 0.22.1, 0.23.0
>
>
> I believe that this uncovered an actual bug in the slave
> behavior:
> - Executor exits, Slave sends TaskStatus "TASK_KILLED" message to master
> - Master sends the message to the Framework.
> - Framework code fails to acknowledge the message (which is a
> legitimate error case, because messages can get lost)
> - After a timeout, the Slave resends the TaskStatus message, however,
> it does so with state TASK_RUNNING, not TASK_KILLED.
> I could verify that behavior in 0.22.0 and the current trunk
> here is my log file from the slave:
> https://gist.github.com/hgschmie/fc20b361a2cadeba0fbd
> - Slave shuts down executor:
> I0330 16:27:59.187947 18428 slave.cpp:1558] Asked to kill task
> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> framework Singularity_local
> I0330 16:27:59.239296 18425 slave.cpp:2508] Handling status update
> TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task
> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> framework Singularity_local from executor(1)@10.101.131.128:55377
> I0330 16:27:59.239320 18425 slave.cpp:4468] Terminating task
> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
> I0330 16:27:59.240444 18424 status_update_manager.cpp:317] Received
> status update TASK_KILLED (UUID: 7f4fdc95-3a7d-474d-b4f1-b6da45e96396)
> for task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
> of framework Singularity_local
> I0330 16:27:59.240613 18425 slave.cpp:2680] Status update manager
> successfully handled status update TASK_KILLED (UUID:
> 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task
> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> framework Singularity_local
> I0330 16:27:59.240623 18425 slave.cpp:2686] Sending acknowledgement
> for status update TASK_KILLED (UUID:
> 7f4fdc95-3a7d-474d-b4f1-b6da45e96396) for task
> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> framework Singularity_local to executor(1)@10.101.131.128:55377
> I0330 16:28:00.443361 18422 slave.cpp:3193] Executor
> 'MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT' of
> framework Singularity_local exited with status 0
> Now my framework receives the TASK_KILLED (I see that) but fails to
> send the ack message back to mesos. After 30 seconds:
> W0330 16:28:30.097679 18426 status_update_manager.cpp:472] Resending
> status update TASK_RUNNING (UUID:
> ea431985-98de-4447-8668-fda26c95f040) for task
> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> framework Singularity_local
> I0330 16:28:30.097704 18426 status_update_manager.cpp:371] Forwarding
> update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for
> task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
> of framework Singularity_local to the slave
> I0330 16:28:30.097761 18426 slave.cpp:2753] Forwarding the update
> TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task
> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> framework Singularity_local to master@10.101.131.128:5050
> After another 20 seconds:
> W0330 16:29:50.098641 18425 status_update_manager.cpp:472] Resending
> status update TASK_RUNNING (UUID:
> ea431985-98de-4447-8668-fda26c95f040) for task
> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> framework Singularity_local
> I0330 16:29:50.098659 18425 status_update_manager.cpp:371] Forwarding
> update TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for
> task MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT
> of framework Singularity_local to the slave
> I0330 16:29:50.098737 18422 slave.cpp:2753] Forwarding the update
> TASK_RUNNING (UUID: ea431985-98de-4447-8668-fda26c95f040) for task
> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT of
> framework Singularity_local to master@10.101.131.128:5050
> so for my task "
> MyFirstDeploy-a351e465-1427758039385-2-blackbox.group.on-DEFAULT",
> which is finished (the executor has exited), the slave is now
> resending "TASK_RUNNING". And that can not be right. This should be
> "TASK_KILLED".
> As a result, the state machine in my framework is now confused,
> because a task just transitioned from TASK_KILLED to TASK_RUNNING.
> Which should not happen, because TASK_KILLED is a terminal state.
> So I guess that a bug in my framework code has exposed a bug in the
> mesos slave state resending code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)