You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Zach Carlson (JIRA)" <ji...@apache.org> on 2014/11/18 00:30:33 UTC

[jira] [Updated] (MESOS-2122) MesosSchedulerDriver stop causes resource offer exhaustion

     [ https://issues.apache.org/jira/browse/MESOS-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Zach Carlson updated MESOS-2122:
--------------------------------
    Affects Version/s: 0.20.1

> MesosSchedulerDriver stop causes resource offer exhaustion
> ----------------------------------------------------------
>
>                 Key: MESOS-2122
>                 URL: https://issues.apache.org/jira/browse/MESOS-2122
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.20.0, 0.21.0, 0.20.1
>         Environment: x86_64 Debian Wheezy (w/ mesosphere repos, packages)
>            Reporter: Zach Carlson
>
> For additional consideration, see https://github.com/airbnb/chronos/issues/290 and https://github.com/mesosphere/marathon/issues/787
> When the SchedulerProcess managed by the MesosSchedulerDriver detects a master, it performs a link() to the master. Libprocess proceeds to establish the link. Once the scheduler has performed all the work necessary, it may call MesosSchedulerDriver.stop(failover = true). 
> This is where things go awry: at this point, the SchedulerProcess schedules a termination event for itself. When libprocess's schedule thread rolls through, it performs a cleanup() of the SchedulerProcess, as expected. Part of the cleanup() is calling SocketManager::exited() on the SchedulerProcess. The problem with this is that SocketManager::exited() cleans up the links from the link map, but does not actually close the sockets. Now, since MesosSchedulerDriver::stop() was called with failover = true, no DeregisterFramework message was sent, so the Mesos master believes that the connection (which is still active) is still valid with a registered framework listening for events. It sends resourceOffers to the 'valid' framework... and since there's nothing actually listening for events, no response is sent, no offers are accepted or declined, and Mesos will grind to a halt (*until version 0.21.0, which will (according to release notes) rescind un-responded offers after a configurable timeout) -- no further offers made to any framework, and when all current framework work has completed, no further work will be performed due to the offers being wasted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)