You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Greg Mann (JIRA)" <ji...@apache.org> on 2016/04/18 22:38:25 UTC
[jira] [Comment Edited] (MESOS-5180) Scheduler driver does not detect disconnection with master and reregister.

    [ https://issues.apache.org/jira/browse/MESOS-5180?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15246371#comment-15246371 ] 

Greg Mann edited comment on MESOS-5180 at 4/18/16 8:38 PM:
-----------------------------------------------------------

We're currently running into this in a long-running cluster with Mesos and Marathon. The master logs show the moment when Marathon disconnects:
{code}
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393314 21960 master.cpp:1275] Framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114 disconnected
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393350 21960 master.cpp:2658] Disconnecting framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393373 21960 master.cpp:2682] Deactivating framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393434 21960 master.hpp:1825] Master attempted to send message to disconnected framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393440 21958 hierarchical.cpp:375] Deactivated framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393635 21960 master.hpp:1825] Master attempted to send message to disconnected framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393723 21960 master.hpp:1825] Master attempted to send message to disconnected framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393815 21960 master.hpp:1825] Master attempted to send message to disconnected framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393875 21960 master.cpp:1299] Giving framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114 1weeks to failover
{code}

But looking in the Marathon logs around the same time doesn't yield an indication that the scheduler has disconnected. It continues to receive task status updates, but doesn't receive offers, as expected.

It would be great if the master's logging messages could provide more information about the disconnection when it occurs, if possible.


was (Author: greggomann):
We're currently running into this in a long-running cluster with Mesos and Marathon. The master logs show the moment when Marathon disconnects:
{code}
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393314 21960 master.cpp:1275] Framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114 disconnected
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393350 21960 master.cpp:2658] Disconnecting framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393373 21960 master.cpp:2682] Deactivating framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393434 21960 master.hpp:1825] Master attempted to send message to disconnected framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393440 21958 hierarchical.cpp:375] Deactivated framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393635 21960 master.hpp:1825] Master attempted to send message to disconnected framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393723 21960 master.hpp:1825] Master attempted to send message to disconnected framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: W0418 18:07:20.393815 21960 master.hpp:1825] Master attempted to send message to disconnected framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114
Apr 18 18:07:20 ip-10-10-0-206 mesos-master[21951]: I0418 18:07:20.393875 21960 master.cpp:1299] Giving framework 29b0cddb-f239-47cd-9d43-84624751d5ad-0000 (marathon) at scheduler-c13a237a-610e-43ae-89a7-31c3831f5471@10.10.0.210:36114 1weeks to failover
{code}

But looking in the Marathon logs around the same time doesn't yield an indication that the scheduler has disconnected. It continues to receive task status updates, but doesn't seem to be receiving offers.

It would be great if the master's logging messages could provide more information about the disconnection when it occurs, if possible.

> Scheduler driver does not detect disconnection with master and reregister.
> --------------------------------------------------------------------------
>
>                 Key: MESOS-5180
>                 URL: https://issues.apache.org/jira/browse/MESOS-5180
>             Project: Mesos
>          Issue Type: Bug
>          Components: scheduler driver
>    Affects Versions: 0.24.0
>            Reporter: Joseph Wu
>            Assignee: Anand Mazumdar
>              Labels: mesosphere
>
> The existing implementation of the scheduler driver does not re-register with the master under some network partition cases.
> When a scheduler registers with the master:
> 1) master links to the framework
> 2) framework links to the master
> It is possible for either of these links to break *without* the master changing.  (Currently, the scheduler driver will only re-register if the master changes).
> If both links break or if just link (1) breaks, the master views the framework as {{inactive}} and {{disconnected}}.  This means the framework will not receive any more events (such as offers) from the master until it re-registers.  There is currently no way for the scheduler to detect a one-way link breakage.
> if link (2) breaks, it makes (almost) no difference to the scheduler.  The scheduler usually uses the link to send messages to the master, but libprocess will create another socket if the persistent one is not available.
> To fix link breakages for (1+2) and (2), the scheduler driver should implement a `::exited` event handler for the master's {{pid}} and trigger a master (re-)detection upon a disconnection. This in turn should make the driver (re)-register with the master. The scheduler library already does this: https://github.com/apache/mesos/blob/master/src/scheduler/scheduler.cpp#L395
> See the related issue MESOS-5181 for link (1) breakage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)