You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Neil Conway <ne...@gmail.com> on 2016/12/07 20:04:13 UTC
Review Request 54495: Ensured master always relinks during scheduler
re-registration.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54495/
-----------------------------------------------------------
Review request for mesos and Vinod Kone.
Bugs: MESOS-6676
https://issues.apache.org/jira/browse/MESOS-6676
Repository: mesos
Description
-------
In the following scenario:
* Master sees a re-registration attempt from a PID-based scheduler,
* The scheduler was previously registered with the master,
* and the "force" flag is not set
The master neglected to re-link with the scheduler. For example, this
might happen if:
* The master sees an ExitedEvent for the framework and marks it
disconnected.
* The master sends a FrameworkErrorMessage to the framework but this
message is dropped, e.g., due to a transient network failure.
* The scheduler attempts to re-register with the master, e.g., because
it detects (spuriously) that the current leading master has changed.
This is problematic, because it might leave the master -> scheduler
connection using an ephemeral socket.
Diffs
-----
src/master/master.cpp 67f32229470da4cf7953881d1c5dcb99393002de
Diff: https://reviews.apache.org/r/54495/diff/
Testing
-------
`make check`
Note that it would be _great_ to write a unit test for this situation (as well as a class of related failure conditions), but the current testing infrastructure doesn't make that easy.
Thanks,
Neil Conway
Re: Review Request 54495: Ensured master always relinks during
scheduler re-registration.
Posted by Neil Conway <ne...@gmail.com>.
> On Dec. 7, 2016, 9:53 p.m., Joseph Wu wrote:
> > src/master/master.cpp, lines 2841-2843
> > <https://reviews.apache.org/r/54495/diff/1/?file=1579042#file1579042line2841>
> >
> > Do you want to force a relink too?
> >
> > i.e. give this as the second argument: `process::RemoteConnection::RECONNECT`
Per discussion on Slack with Joseph, it seems we don't need to force a reconnect here. Because the master will promptly send a (re-)registered message to the framework; if the socket is half-open, that should eventually result in an error due to the socket send. This will result in another `exited` event, at which point we'll correctly mark the framework as disconnected again and send it another `FrameworkErrorMessage`.
- Neil
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54495/#review158408
-----------------------------------------------------------
On Dec. 7, 2016, 8:04 p.m., Neil Conway wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54495/
> -----------------------------------------------------------
>
> (Updated Dec. 7, 2016, 8:04 p.m.)
>
>
> Review request for mesos and Vinod Kone.
>
>
> Bugs: MESOS-6676
> https://issues.apache.org/jira/browse/MESOS-6676
>
>
> Repository: mesos
>
>
> Description
> -------
>
> In the following scenario:
> * Master sees a re-registration attempt from a PID-based scheduler,
> * The scheduler was previously registered with the master,
> * and the "force" flag is not set
>
> The master neglected to re-link with the scheduler. For example, this
> might happen if:
>
> * The master sees an ExitedEvent for the framework and marks it
> disconnected.
> * The master sends a FrameworkErrorMessage to the framework but this
> message is dropped, e.g., due to a transient network failure.
> * The scheduler attempts to re-register with the master, e.g., because
> it detects (spuriously) that the current leading master has changed.
>
> This is problematic, because it might leave the master -> scheduler
> connection using an ephemeral socket.
>
>
> Diffs
> -----
>
> src/master/master.cpp 67f32229470da4cf7953881d1c5dcb99393002de
>
> Diff: https://reviews.apache.org/r/54495/diff/
>
>
> Testing
> -------
>
> `make check`
>
> Note that it would be _great_ to write a unit test for this situation (as well as a class of related failure conditions), but the current testing infrastructure doesn't make that easy.
>
>
> Thanks,
>
> Neil Conway
>
>
Re: Review Request 54495: Ensured master always relinks during
scheduler re-registration.
Posted by Joseph Wu <jo...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54495/#review158408
-----------------------------------------------------------
src/master/master.cpp (lines 2841 - 2843)
<https://reviews.apache.org/r/54495/#comment229174>
Do you want to force a relink too?
i.e. give this as the second argument: `process::RemoteConnection::RECONNECT`
- Joseph Wu
On Dec. 7, 2016, 12:04 p.m., Neil Conway wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54495/
> -----------------------------------------------------------
>
> (Updated Dec. 7, 2016, 12:04 p.m.)
>
>
> Review request for mesos and Vinod Kone.
>
>
> Bugs: MESOS-6676
> https://issues.apache.org/jira/browse/MESOS-6676
>
>
> Repository: mesos
>
>
> Description
> -------
>
> In the following scenario:
> * Master sees a re-registration attempt from a PID-based scheduler,
> * The scheduler was previously registered with the master,
> * and the "force" flag is not set
>
> The master neglected to re-link with the scheduler. For example, this
> might happen if:
>
> * The master sees an ExitedEvent for the framework and marks it
> disconnected.
> * The master sends a FrameworkErrorMessage to the framework but this
> message is dropped, e.g., due to a transient network failure.
> * The scheduler attempts to re-register with the master, e.g., because
> it detects (spuriously) that the current leading master has changed.
>
> This is problematic, because it might leave the master -> scheduler
> connection using an ephemeral socket.
>
>
> Diffs
> -----
>
> src/master/master.cpp 67f32229470da4cf7953881d1c5dcb99393002de
>
> Diff: https://reviews.apache.org/r/54495/diff/
>
>
> Testing
> -------
>
> `make check`
>
> Note that it would be _great_ to write a unit test for this situation (as well as a class of related failure conditions), but the current testing infrastructure doesn't make that easy.
>
>
> Thanks,
>
> Neil Conway
>
>
Re: Review Request 54495: Ensured master always relinks during
scheduler re-registration.
Posted by Vinod Kone <vi...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54495/#review158457
-----------------------------------------------------------
Ship it!
Ship It!
- Vinod Kone
On Dec. 7, 2016, 8:04 p.m., Neil Conway wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54495/
> -----------------------------------------------------------
>
> (Updated Dec. 7, 2016, 8:04 p.m.)
>
>
> Review request for mesos and Vinod Kone.
>
>
> Bugs: MESOS-6676
> https://issues.apache.org/jira/browse/MESOS-6676
>
>
> Repository: mesos
>
>
> Description
> -------
>
> In the following scenario:
> * Master sees a re-registration attempt from a PID-based scheduler,
> * The scheduler was previously registered with the master,
> * and the "force" flag is not set
>
> The master neglected to re-link with the scheduler. For example, this
> might happen if:
>
> * The master sees an ExitedEvent for the framework and marks it
> disconnected.
> * The master sends a FrameworkErrorMessage to the framework but this
> message is dropped, e.g., due to a transient network failure.
> * The scheduler attempts to re-register with the master, e.g., because
> it detects (spuriously) that the current leading master has changed.
>
> This is problematic, because it might leave the master -> scheduler
> connection using an ephemeral socket.
>
>
> Diffs
> -----
>
> src/master/master.cpp 67f32229470da4cf7953881d1c5dcb99393002de
>
> Diff: https://reviews.apache.org/r/54495/diff/
>
>
> Testing
> -------
>
> `make check`
>
> Note that it would be _great_ to write a unit test for this situation (as well as a class of related failure conditions), but the current testing infrastructure doesn't make that easy.
>
>
> Thanks,
>
> Neil Conway
>
>