You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Joseph Wu (JIRA)" <ji...@apache.org> on 2016/05/17 17:47:12 UTC

[jira] [Commented] (MESOS-5395) Task getting stuck in staging state if launch it on a rebooted slave.

    [ https://issues.apache.org/jira/browse/MESOS-5395?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15287096#comment-15287096 ] 

Joseph Wu commented on MESOS-5395:
----------------------------------

The log messages you're seeing come from the framework telling Mesos to kill said tasks.  There might be something else going on that's preventing your task from launching after an agent failover.

Can you also share:
* The resources of your agents
* Full master/agent/Marathon logs before/during/after the event
* Full stdout/stderr files for the task in question
* Your Marathon app definition

> Task getting stuck in staging state if launch it on a rebooted slave.
> ---------------------------------------------------------------------
>
>                 Key: MESOS-5395
>                 URL: https://issues.apache.org/jira/browse/MESOS-5395
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.28.0
>         Environment: mesos/marathon cluster,  3 maters/4 slaves
> Mesos: 0.28.0 ,  Marathon 0.15.2
>            Reporter: Mengkui gong
>
> if rebooting a slave, after that,  using Marathon to launch a task,  the task can start on other slaves without problem.  But if launch it on the rebooted slave, the task will be stuck. From Mesos UI shows it in staging state from active tasks list.  From Marathon UI shows it in deploying state. It can keeping in stuck state for more than 2 hours.  After that time, Marathon will automatically launch the task on this rebooted slave or other slave as normal. So the rebooted slave be recovered as well after that time.   
> From Mesos log,  I can see "telling slave to kill task" all the time.
> I0517 15:25:27.207237 20568 master.cpp:3826] Telling slave 282745ab-423a-4350-a449-3e8cdfccfb93-S1 at slave(1)@10.254.234.236:5050 (mesos-slave-3) to kill task project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e of framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 (marathon) at scheduler-fe615b72-ab92-49ca-89e6-e74e600c7e15@10.254.228.3:56757.
> From rebooted slave log, I can see:
> May 17 15:28:37 euca-10-254-234-236 mesos-slave[829]: I0517 15:28:37.206831   916 slave.cpp:1891] Asked to kill task project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e of framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000
> May 17 15:28:37 euca-10-254-234-236 mesos-slave[829]: W0517 15:28:37.206866   916 slave.cpp:2018] Ignoring kill task project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e because the executor 'project-hub_project-hub-frontend.b645f24b-1c1f-11e6-bb25-d00d2cce797e' of framework 17cd3756-1d59-4dfc-984d-3fe09f6b5730-0000 is terminating/terminated.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)