You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Steven Schlansker (JIRA)" <ji...@apache.org> on 2015/05/01 23:36:06 UTC

[jira] [Comment Edited] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure

    [ https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14524006#comment-14524006 ] 

Steven Schlansker edited comment on MESOS-2684 at 5/1/15 9:35 PM:
------------------------------------------------------------------

I've attached the log from slave restart.  The FATAL error above was the last line written before the abort, this is the head of the new log file created on restart.  I misspoke about LOST, it was actually KILLED.


was (Author: stevenschlansker):
I've attached the log from slave restart.  The FATAL error above was the last line written before the abort, this is the head of the new log file created on restart.

> mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
> --------------------------------------------------------------------------
>
>                 Key: MESOS-2684
>                 URL: https://issues.apache.org/jira/browse/MESOS-2684
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.21.1
>            Reporter: Steven Schlansker
>         Attachments: mesos-slave-restart.txt
>
>
> mesos-slave can encounter a variety of problems while attempting to launch a task.  If the task fails, that is unfortunate, but not the end of the world.  Other tasks should not be affected.
> However, if the task failure happens to trigger an assertion, the entire slave comes crashing down:
> F0501 19:10:46.095464  1705 paths.hpp:342] CHECK_SOME(mkdir): No space left on device Failed to create executor directory '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01'
> Immediately afterwards, all tasks on this slave were declared TASK_LOST when mesos-slave restarted.
> Something as simple as a 'mkdir' failing is not worthy of an assertion failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)