You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Joseph Wu (JIRA)" <ji...@apache.org> on 2016/04/12 02:18:25 UTC

[jira] [Updated] (MESOS-5180) Scheduler driver does not detect disconnection with master and reregister.

     [ https://issues.apache.org/jira/browse/MESOS-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph Wu updated MESOS-5180:
-----------------------------
    Description: 
The existing implementation of the scheduler driver does not re-register with the master under some network partition cases.

When a scheduler registers with the master:
1) master links to the framework
2) framework links to the master

It is possible for either of these links to break *without* the master changing.  (Currently, the scheduler driver will only re-register if the master changes).

If both links break or if just link (1) breaks, the master views the framework as {{inactive}} and {{disconnected}}.  This means the framework will not receive any more events (such as offers) from the master until it re-registers.  There is currently no way for the scheduler to detect a one-way link breakage.

if link (2) breaks, it makes (almost) no difference to the scheduler.  The scheduler usually uses the link to send messages to the master, but libprocess will create another socket if the persistent one is not available.

To fix link breakages for (1+2) and (2), the scheduler driver should implement a `::exited` event handler for the master's {{pid}} and re-register in this case.

See the related issue MESOS-5181 for link (1) breakage.

  was:
The existing implementation of the scheduler driver does not re-register with the master under some network partition cases.

When a scheduler registers with the master:
1) master links to the framework
2) framework links to the master

It is possible for either of these links to break *without* the master changing.  (Currently, the scheduler driver will only re-register if the master changes).

If both links break or if just link (1) breaks, the master views the framework as {{inactive}} and {{disconnected}}.  This means the framework will not receive any more events (such as offers) from the master until it re-registers.  There is currently no way for the scheduler to detect a one-way link breakage.

if link (2) breaks, it makes (almost) no difference to the scheduler.  The scheduler usually uses the link to send messages to the master, but libprocess will create another socket if the persistent one is not available.

To fix link breakages for (1+2) and (2), the scheduler driver should implement a `::exited` event handler for the master's {{pid}} and re-register in this case.

See the related issue [TODO] for link (1) breakage.


> Scheduler driver does not detect disconnection with master and reregister.
> --------------------------------------------------------------------------
>
>                 Key: MESOS-5180
>                 URL: https://issues.apache.org/jira/browse/MESOS-5180
>             Project: Mesos
>          Issue Type: Bug
>          Components: scheduler driver
>    Affects Versions: 0.24.0
>            Reporter: Joseph Wu
>            Assignee: Anand Mazumdar
>              Labels: mesosphere
>
> The existing implementation of the scheduler driver does not re-register with the master under some network partition cases.
> When a scheduler registers with the master:
> 1) master links to the framework
> 2) framework links to the master
> It is possible for either of these links to break *without* the master changing.  (Currently, the scheduler driver will only re-register if the master changes).
> If both links break or if just link (1) breaks, the master views the framework as {{inactive}} and {{disconnected}}.  This means the framework will not receive any more events (such as offers) from the master until it re-registers.  There is currently no way for the scheduler to detect a one-way link breakage.
> if link (2) breaks, it makes (almost) no difference to the scheduler.  The scheduler usually uses the link to send messages to the master, but libprocess will create another socket if the persistent one is not available.
> To fix link breakages for (1+2) and (2), the scheduler driver should implement a `::exited` event handler for the master's {{pid}} and re-register in this case.
> See the related issue MESOS-5181 for link (1) breakage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)