You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Scott D.W. Rankin (JIRA)" <ji...@apache.org> on 2015/08/26 16:16:49 UTC

[jira] [Commented] (MESOS-2684) mesos-slave should not abort when a single task has e.g. a 'mkdir' failure

    [ https://issues.apache.org/jira/browse/MESOS-2684?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14713497#comment-14713497 ] 

Scott D.W. Rankin commented on MESOS-2684:
------------------------------------------

Hi all - I'm seeing this issue as well.  We're running Marathon 0.8.2, Mesos 0.22.1 on CentOS 6.6 and are getting errors similar to the one pasted below pretty regularly.  We can't reproduce it all the time, but it happens when initiating a deployment from Marathon.  

26 Aug 2015 09:35:01.213  host=mesosnode6-aws-west tag=mesos-slave[30248]:  F0826 06:35:01.136056 30280 slave.cpp:3354] CHECK_SOME(os::touch(path)): Failed to open file: No such file or directory Context
26 Aug 2015 09:35:01.369  host=mesosnode6-aws-west tag=mesos-slave[30248]:  *** Check failure stack trace: *** Context
26 Aug 2015 09:35:01.369  host=mesosnode6-aws-west tag=mesos-slave[30248]:      @       0x3de1e765cd  (unknown) Context
26 Aug 2015 09:35:01.369  host=mesosnode6-aws-west tag=mesos-slave[30248]:      @       0x3de1e7a5e7  (unknown) Context
26 Aug 2015 09:35:01.369  host=mesosnode6-aws-west tag=mesos-slave[30248]:      @       0x3de1e78469  (unknown) Context
26 Aug 2015 09:35:01.369  host=mesosnode6-aws-west tag=mesos-slave[30248]:      @       0x3de1e7876d  (unknown) Context
26 Aug 2015 09:35:01.369  host=mesosnode6-aws-west tag=mesos-slave[30248]:      @       0x3de17c5696  (unknown) Context
26 Aug 2015 09:35:01.369  host=mesosnode6-aws-west tag=mesos-slave[30248]:      @       0x3de1a1855a  (unknown) Context
26 Aug 2015 09:35:01.369  host=mesosnode6-aws-west tag=mesos-slave[30248]:      @       0x3de1a1c0a9  (unknown) Context
26 Aug 2015 09:35:01.369  host=mesosnode6-aws-west tag=mesos-slave[30248]:      @       0x3de1a510ff  (unknown) Context
26 Aug 2015 09:35:01.369  host=mesosnode6-aws-west tag=mesos-slave[30248]:      @       0x3de1e18b83  (unknown) Context
26 Aug 2015 09:35:01.369  host=mesosnode6-aws-west tag=mesos-slave[30248]:      @       0x3de1e1978c  (unknown) Context
26 Aug 2015 09:35:01.369  host=mesosnode6-aws-west tag=mesos-slave[30248]:      @       0x39d58079d1  (unknown) Context
26 Aug 2015 09:35:01.369  host=mesosnode6-aws-west tag=mesos-slave[30248]:      @       0x39d54e88fd  (unknown) Context
26 Aug 2015 09:35:01.369  host=mesosnode6-aws-west tag=init:  mesos-slave main process (30248) killed by ABRT signal Context
26 Aug 2015 09:35:01.369  host=mesosnode6-aws-west tag=init:  mesos-slave main process ended, respawning Context


> mesos-slave should not abort when a single task has e.g. a 'mkdir' failure
> --------------------------------------------------------------------------
>
>                 Key: MESOS-2684
>                 URL: https://issues.apache.org/jira/browse/MESOS-2684
>             Project: Mesos
>          Issue Type: Bug
>          Components: slave
>    Affects Versions: 0.21.1
>            Reporter: Steven Schlansker
>         Attachments: mesos-slave-restart.txt
>
>
> mesos-slave can encounter a variety of problems while attempting to launch a task.  If the task fails, that is unfortunate, but not the end of the world.  Other tasks should not be affected.
> However, if the task failure happens to trigger an assertion, the entire slave comes crashing down:
> F0501 19:10:46.095464  1705 paths.hpp:342] CHECK_SOME(mkdir): No space left on device Failed to create executor directory '/mnt/mesos/slaves/20150327-194449-419644938-5050-1649-S71/frameworks/Singularity/executors/pp-gc-eventlog-teamcity.2015.03.31T23.55.14-1430507446029-2-10.70.8.160-us_west_2b/runs/95a54aeb-322c-48e9-9f6f-5b359bccbc01'
> Immediately afterwards, all tasks on this slave were declared TASK_KILLED when mesos-slave restarted.
> Something as simple as a 'mkdir' failing is not worthy of an assertion failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)