You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2014/11/19 21:11:34 UTC

[jira] [Commented] (MESOS-2122) MesosSchedulerDriver stop causes resource offer exhaustion

    [ https://issues.apache.org/jira/browse/MESOS-2122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14218426#comment-14218426 ] 

Benjamin Mahler commented on MESOS-2122:
----------------------------------------

Hm, I can't seem to dig up the ticket related to this. It's a long standing limitation in libprocess: we don't notify termination when libprocess stays up but a Process terminates:
https://github.com/apache/mesos/blob/0.21.0/3rdparty/libprocess/TODO#L9

To my knowledge, we've been skirting this issue because for the most part, frameworks will be failing over when calling stop(failover=true). At which point, a new instantiation of the framework will re-register and we'll treat the old one as having gone away.

> MesosSchedulerDriver stop causes resource offer exhaustion
> ----------------------------------------------------------
>
>                 Key: MESOS-2122
>                 URL: https://issues.apache.org/jira/browse/MESOS-2122
>             Project: Mesos
>          Issue Type: Bug
>    Affects Versions: 0.20.0, 0.20.1
>         Environment: x86_64 Debian Wheezy (w/ mesosphere repos, packages)
>            Reporter: Zach Carlson
>         Attachments: mesos_2122.py
>
>
> For additional consideration, see https://github.com/airbnb/chronos/issues/290 and https://github.com/mesosphere/marathon/issues/787
> When the SchedulerProcess managed by the MesosSchedulerDriver detects a master, it performs a link() to the master. Libprocess proceeds to establish the link. Once the scheduler has performed all the work necessary, it may call MesosSchedulerDriver.stop(failover = true). 
> This is where things go awry: at this point, the SchedulerProcess schedules a termination event for itself. When libprocess's schedule thread rolls through, it performs a cleanup() of the SchedulerProcess, as expected. Part of the cleanup() is calling SocketManager::exited() on the SchedulerProcess. The problem with this is that SocketManager::exited() cleans up the links from the link map, but does not actually close the sockets. Now, since MesosSchedulerDriver::stop() was called with failover = true, no DeregisterFramework message was sent, so the Mesos master believes that the connection (which is still active) is still valid with a registered framework listening for events. It sends resourceOffers to the 'valid' framework... and since there's nothing actually listening for events, no response is sent, no offers are accepted or declined, and Mesos will grind to a halt (*until version 0.21.0, which will (according to release notes) rescind un-responded offers after a configurable timeout) -- no further offers made to any framework, and when all current framework work has completed, no further work will be performed due to the offers being wasted. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)