You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@mesos.apache.org by "Lukas Loesche (JIRA)" <ji...@apache.org> on 2015/02/25 01:07:05 UTC

[jira] [Commented] (MESOS-2299) default work_dir of /tmp/mesos is problematic

    [ https://issues.apache.org/jira/browse/MESOS-2299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14335698#comment-14335698 ] 

Lukas Loesche commented on MESOS-2299:
--------------------------------------

What I saw was this:
6x slaves, 3x masters. 3 slaves & masters running on the same hosts.
an Idle app (tail -f /dev/null) running through Marathon with 20 tasks.
I hard-reset all 6 nodes at the same time.
When the nodes came back up Marathon showed the 20 tasks as still running even though they weren't. Mesos didn't show them as running.
It took 10 Minutes for the tasks to disappear in Marathon and get properly re-started.

I reported this as a Marathon bug to Dario Rexin <da...@mesosphere.io>. 
He and Alex Rukletsov <al...@mesosphere.io> investigated the cause and explained it to me like this:

Because all nodes got reset at the same time, the slaves didn't properly unregister from the masters. When they came back up their workdir in /tmp was wiped, therefor they registered as new slaves.
If I understood it correctly there's a 10 minute timeout in Mesos for tasks (or slaves?) to get picked up by a slave again which is why they were shown as still running in Marathon.

So if I have e.g. a complete power outage and my cluster comes back up it takes an additional 10 minutes before tasks would be restarted by Marathon.
Does that make sense? It's been a month since I ran into this so maybe I'm mixing some details up. I'll ask Dario and Alex tomorrow to confirm and will update/clarify if necessary.

> default work_dir of /tmp/mesos is problematic
> ---------------------------------------------
>
>                 Key: MESOS-2299
>                 URL: https://issues.apache.org/jira/browse/MESOS-2299
>             Project: Mesos
>          Issue Type: Improvement
>    Affects Versions: 0.20.1
>            Reporter: Lukas Loesche
>            Priority: Trivial
>
> Mesos uses a default of /tmp/mesos if work_dir is not defined.
> This is bad because /tmp on most distros gets wiped upon boot. Therefor when a slave reboots it Registers as a new slave instead of re-registering. Causing problems with task reconciliation.
> A better default following the FHS standard would be /var/tmp/mesos which is the temp that doesn't get wiped at all or in much longer intervals.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)