You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Henrik Hobein (JIRA)" <ji...@apache.org> on 2015/02/11 14:22:11 UTC

[jira] [Commented] (MESOS-2334) Tasks get stuck in TASK_STAGING after a network decode error

    [ https://issues.apache.org/jira/browse/MESOS-2334?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14316186#comment-14316186 ] 

Henrik Hobein commented on MESOS-2334:
--------------------------------------

We used Mesos version 0.21.0 for both master and nodes.


> Tasks get stuck in TASK_STAGING after a network decode error
> ------------------------------------------------------------
>
>                 Key: MESOS-2334
>                 URL: https://issues.apache.org/jira/browse/MESOS-2334
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Andreas Raster
>
> We observed that with a test case that schedules a large amount of small CommandInfo tasks (shell commands that look like this: "sleep `shuf -i 2-3 -n 1`; echo foo >> /share/bar") on a cluster with launchTasks, that sometimes we would get an issue where a single task that has been launched and was set to TASK_STAGING would never receive a TASK_RUNNING message (or any other message at all). So it would then just stay in TASK_STAGING infinitely until we would kill the framework.
> We asked in #mesos on freenode about this and got an answer from alexr_:
> [15:56:55] <alexr_> henno: thanks for the slave logs
> [15:57:09] rakete [~rakete@static.198.2.63.178.clients.your-server.de] has left #mesos
> [15:58:47] <alexr_> henno: it looks from the logs, that the slave successfully registers the executor and sends the task
> [15:59:07] tillt_ [~Till@212.53.142.20] has joined #mesos
> [15:59:30] <alexr_> the executor, for some reason, refuses to start the task, most probably because of the message decoding error
> telling us that he suspects the reason is a network decoding error. I am currently not 100% sure what he means by that and I wasn't the guy talking to alexr_ on irc so I cannot post the exact log section that indicates that decoding error. But I'll attach the logs that we supplied to alexr_, so those should contain the relevant information.
> The tasks name in question was: 727527fc-a3f3-418d-a44e-ec3bbdd26315
> cat /var/log/mesos/mesos-slave.INFO | grep 727527fc-a3f3-418d-a44e-ec3bbdd26315
> >> http://paste.ubuntu.com/10160270/
> cat /tmp/mesos/slaves/20141217-133241-2867204268-5050-12776-S1/frameworks/20150209-153125-2867204268-5050-2553-0025/executors/29cde3b3-994a-4480-b10e-c49b4fc6c706+0+727527fc-a3f3-418d-a44e-ec3bbdd26315/runs/d73a76e7-aec6-4760-bd48-86b79df89d52/stderr 
> >> http://paste.ubuntu.com/10160335/
> cat /tmp/mesos/slaves/20141217-133241-2867204268-5050-12776-S1/frameworks/20150209-153125-2867204268-5050-2553-0025/executors/29cde3b3-994a-4480-b10e-c49b4fc6c706+0+727527fc-a3f3-418d-a44e-ec3bbdd26315/runs/d73a76e7-aec6-4760-bd48-86b79df89d52/stdout
> >> http://paste.ubuntu.com/10160346/
> Now, if some relevant information is still missing, don't hesitate to ask me.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)