You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@mesos.apache.org by Neil Conway <ne...@gmail.com> on 2017/01/07 21:13:21 UTC

Review Request 55307: Improved handling of agents that restart but never re-register.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55307/
-----------------------------------------------------------

Review request for mesos and Vinod Kone.


Bugs: MESOS-6286
    https://issues.apache.org/jira/browse/MESOS-6286


Repository: mesos


Description
-------

The master expected that if an agent responds to pings, it will
(eventually) register or re-register. However, if the agent hangs during
recovery, that assumption does not hold: the agent will continue to
respond to pings but won't attempt to re-register until recovery
finishes.

To handle this case, the master now expects an agent to re-register
within `agent_reregister_timeout` if the master -> agent socket breaks;
if no re-registration is seen, the master will mark the agent
unreachable. This is a "backup" to handle the case where recovery hangs,
as explained above; more commonly, the agent will re-register (when it
receives a ping and notices the master believes it is disconnected) or
be marked unreachable because it fails to respond to pings.


Diffs
-----

  docs/configuration.md e4beb2d5a72f1c5f59b2e40f4984cc60b8437d9d 
  src/master/flags.cpp 737290a42c532f2349009d0a451ce271d6f107b9 
  src/master/master.hpp 57fc6e6f2995078df80f0aa52707727db802ede0 
  src/master/master.cpp 11c34a048586d30c6ac67be8638ed8fa81cc3f1f 
  src/slave/slave.cpp f8f2ccfadb9a00be17c0b552586aa5875b7cbb19 
  src/tests/master_tests.cpp 1cf4c92b2474e18771459f877b2f3c49077e8a01 
  src/tests/slave_tests.cpp d633a74d6b342239fbca0b44eec281eb315df5ff 

Diff: https://reviews.apache.org/r/55307/diff/


Testing
-------

`make check`

Ran new tests a few thousand times on OSX and Linux VM to check for flakiness.


Thanks,

Neil Conway

Re: Review Request 55307: Improved handling of agents that restart but never re-register.

Posted by Mesos ReviewBot <re...@mesos.apache.org>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55307/#review160823
-----------------------------------------------------------



Patch looks great!

Reviews applied: [55300, 55301, 55302, 55303, 55304, 55305, 55306, 55307]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker_build.sh

- Mesos ReviewBot


On Jan. 7, 2017, 9:13 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55307/
> -----------------------------------------------------------
> 
> (Updated Jan. 7, 2017, 9:13 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6286
>     https://issues.apache.org/jira/browse/MESOS-6286
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The master expected that if an agent responds to pings, it will
> (eventually) register or re-register. However, if the agent hangs during
> recovery, that assumption does not hold: the agent will continue to
> respond to pings but won't attempt to re-register until recovery
> finishes.
> 
> To handle this case, the master now expects an agent to re-register
> within `agent_reregister_timeout` if the master -> agent socket breaks;
> if no re-registration is seen, the master will mark the agent
> unreachable. This is a "backup" to handle the case where recovery hangs,
> as explained above; more commonly, the agent will re-register (when it
> receives a ping and notices the master believes it is disconnected) or
> be marked unreachable because it fails to respond to pings.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md e4beb2d5a72f1c5f59b2e40f4984cc60b8437d9d 
>   src/master/flags.cpp 737290a42c532f2349009d0a451ce271d6f107b9 
>   src/master/master.hpp 57fc6e6f2995078df80f0aa52707727db802ede0 
>   src/master/master.cpp 11c34a048586d30c6ac67be8638ed8fa81cc3f1f 
>   src/slave/slave.cpp f8f2ccfadb9a00be17c0b552586aa5875b7cbb19 
>   src/tests/master_tests.cpp 1cf4c92b2474e18771459f877b2f3c49077e8a01 
>   src/tests/slave_tests.cpp d633a74d6b342239fbca0b44eec281eb315df5ff 
> 
> Diff: https://reviews.apache.org/r/55307/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Ran new tests a few thousand times on OSX and Linux VM to check for flakiness.
> 
> 
> Thanks,
> 
> Neil Conway
> 
>

Re: Review Request 55307: Improved handling of agents that restart but never re-register.

Posted by Vinod Kone <vi...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55307/#review160955
-----------------------------------------------------------


Ship it!




Ship It!

- Vinod Kone


On Jan. 9, 2017, 7:44 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55307/
> -----------------------------------------------------------
> 
> (Updated Jan. 9, 2017, 7:44 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6286
>     https://issues.apache.org/jira/browse/MESOS-6286
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The master expected that if an agent responds to pings, it will
> (eventually) register or re-register. However, if the agent hangs during
> recovery, that assumption does not hold: the agent will continue to
> respond to pings but won't attempt to re-register until recovery
> finishes.
> 
> To handle this case, the master now expects an agent to re-register
> within `agent_reregister_timeout` if the master -> agent socket breaks;
> if no re-registration is seen, the master will mark the agent
> unreachable. This is a "backup" to handle the case where recovery hangs,
> as explained above; more commonly, the agent will re-register (when it
> receives a ping and notices the master believes it is disconnected) or
> be marked unreachable because it fails to respond to pings.
> 
> 
> Diffs
> -----
> 
>   CHANGELOG f9122fa67a5cb2094c8ab55624799bc0940fb0b0 
>   docs/configuration.md e4beb2d5a72f1c5f59b2e40f4984cc60b8437d9d 
>   src/master/flags.cpp 737290a42c532f2349009d0a451ce271d6f107b9 
>   src/master/master.hpp 57fc6e6f2995078df80f0aa52707727db802ede0 
>   src/master/master.cpp 11c34a048586d30c6ac67be8638ed8fa81cc3f1f 
>   src/slave/slave.cpp f8f2ccfadb9a00be17c0b552586aa5875b7cbb19 
>   src/tests/master_tests.cpp 1cf4c92b2474e18771459f877b2f3c49077e8a01 
>   src/tests/slave_tests.cpp d633a74d6b342239fbca0b44eec281eb315df5ff 
> 
> Diff: https://reviews.apache.org/r/55307/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Ran new tests a few thousand times on OSX and Linux VM to check for flakiness.
> 
> 
> Thanks,
> 
> Neil Conway
> 
>

Re: Review Request 55307: Improved handling of agents that restart but never re-register.

Posted by Neil Conway <ne...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55307/
-----------------------------------------------------------

(Updated Jan. 9, 2017, 7:44 p.m.)


Review request for mesos and Vinod Kone.


Changes
-------

Address review comments.


Bugs: MESOS-6286
    https://issues.apache.org/jira/browse/MESOS-6286


Repository: mesos


Description
-------

The master expected that if an agent responds to pings, it will
(eventually) register or re-register. However, if the agent hangs during
recovery, that assumption does not hold: the agent will continue to
respond to pings but won't attempt to re-register until recovery
finishes.

To handle this case, the master now expects an agent to re-register
within `agent_reregister_timeout` if the master -> agent socket breaks;
if no re-registration is seen, the master will mark the agent
unreachable. This is a "backup" to handle the case where recovery hangs,
as explained above; more commonly, the agent will re-register (when it
receives a ping and notices the master believes it is disconnected) or
be marked unreachable because it fails to respond to pings.


Diffs (updated)
-----

  CHANGELOG f9122fa67a5cb2094c8ab55624799bc0940fb0b0 
  docs/configuration.md e4beb2d5a72f1c5f59b2e40f4984cc60b8437d9d 
  src/master/flags.cpp 737290a42c532f2349009d0a451ce271d6f107b9 
  src/master/master.hpp 57fc6e6f2995078df80f0aa52707727db802ede0 
  src/master/master.cpp 11c34a048586d30c6ac67be8638ed8fa81cc3f1f 
  src/slave/slave.cpp f8f2ccfadb9a00be17c0b552586aa5875b7cbb19 
  src/tests/master_tests.cpp 1cf4c92b2474e18771459f877b2f3c49077e8a01 
  src/tests/slave_tests.cpp d633a74d6b342239fbca0b44eec281eb315df5ff 

Diff: https://reviews.apache.org/r/55307/diff/


Testing
-------

`make check`

Ran new tests a few thousand times on OSX and Linux VM to check for flakiness.


Thanks,

Neil Conway

Re: Review Request 55307: Improved handling of agents that restart but never re-register.

Posted by Neil Conway <ne...@gmail.com>.


> On Jan. 9, 2017, 3:17 a.m., Vinod Kone wrote:
> > src/master/master.cpp, lines 1368-1370
> > <https://reviews.apache.org/r/55307/diff/1/?file=1599464#file1599464line1368>
> >
> >     just like "health check time out", can this be succinct? maybe "re-registration time out"?

I think being a bit more verbose is warranted here -- e.g., pointing out that we previously observed the agent disconnecting, which is why we expected it to re-register.


> On Jan. 9, 2017, 3:17 a.m., Vinod Kone wrote:
> > src/tests/master_tests.cpp, lines 6105-6117
> > <https://reviews.apache.org/r/55307/diff/1/?file=1599466#file1599466line6105>
> >
> >     Instead of advancing the clock for the recovery to finish, why not just not advance it. That way you don't have to do the registration drops either?
> >     
> >     also, you only do `detector.appoint` way below; does agent even send re-registration messages without that here? i guess it does if the master pid didn't change?

Per offline discussion, we can't not-advance the clock because we need to advance the clock below to cause `agent_reregister_timeout` to expire. We also need to advance the clock if we want to trigger a master -> agent ping, which is a useful thing to do (since we want to verify that the agent continues to receive and respond to pings without having finished recovery).


- Neil


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55307/#review160859
-----------------------------------------------------------


On Jan. 7, 2017, 9:13 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55307/
> -----------------------------------------------------------
> 
> (Updated Jan. 7, 2017, 9:13 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6286
>     https://issues.apache.org/jira/browse/MESOS-6286
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The master expected that if an agent responds to pings, it will
> (eventually) register or re-register. However, if the agent hangs during
> recovery, that assumption does not hold: the agent will continue to
> respond to pings but won't attempt to re-register until recovery
> finishes.
> 
> To handle this case, the master now expects an agent to re-register
> within `agent_reregister_timeout` if the master -> agent socket breaks;
> if no re-registration is seen, the master will mark the agent
> unreachable. This is a "backup" to handle the case where recovery hangs,
> as explained above; more commonly, the agent will re-register (when it
> receives a ping and notices the master believes it is disconnected) or
> be marked unreachable because it fails to respond to pings.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md e4beb2d5a72f1c5f59b2e40f4984cc60b8437d9d 
>   src/master/flags.cpp 737290a42c532f2349009d0a451ce271d6f107b9 
>   src/master/master.hpp 57fc6e6f2995078df80f0aa52707727db802ede0 
>   src/master/master.cpp 11c34a048586d30c6ac67be8638ed8fa81cc3f1f 
>   src/slave/slave.cpp f8f2ccfadb9a00be17c0b552586aa5875b7cbb19 
>   src/tests/master_tests.cpp 1cf4c92b2474e18771459f877b2f3c49077e8a01 
>   src/tests/slave_tests.cpp d633a74d6b342239fbca0b44eec281eb315df5ff 
> 
> Diff: https://reviews.apache.org/r/55307/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Ran new tests a few thousand times on OSX and Linux VM to check for flakiness.
> 
> 
> Thanks,
> 
> Neil Conway
> 
>

Re: Review Request 55307: Improved handling of agents that restart but never re-register.

Posted by Vinod Kone <vi...@gmail.com>.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/55307/#review160859
-----------------------------------------------------------




src/master/flags.cpp (line 127)
<https://reviews.apache.org/r/55307/#comment232086>

    s/each/an/ ?



src/master/master.cpp (line 1333)
<https://reviews.apache.org/r/55307/#comment232088>

    s/limiter/limited/



src/master/master.cpp (line 1339)
<https://reviews.apache.org/r/55307/#comment232089>

    can you do "<< *slave" here?



src/master/master.cpp (lines 1368 - 1370)
<https://reviews.apache.org/r/55307/#comment232090>

    just like "health check time out", can this be succinct? maybe "re-registration time out"?



src/slave/slave.cpp (line 4282)
<https://reviews.apache.org/r/55307/#comment232091>

    I think this is worth calling out in the CHANGELOG because it is a significant change in the behavior.



src/tests/master_tests.cpp (lines 6105 - 6117)
<https://reviews.apache.org/r/55307/#comment232092>

    Instead of advancing the clock for the recovery to finish, why not just not advance it. That way you don't have to do the registration drops either?
    
    also, you only do `detector.appoint` way below; does agent even send re-registration messages without that here? i guess it does if the master pid didn't change?



src/tests/master_tests.cpp (lines 6270 - 6276)
<https://reviews.apache.org/r/55307/#comment232095>

    see above.


- Vinod Kone


On Jan. 7, 2017, 9:13 p.m., Neil Conway wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/55307/
> -----------------------------------------------------------
> 
> (Updated Jan. 7, 2017, 9:13 p.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-6286
>     https://issues.apache.org/jira/browse/MESOS-6286
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> The master expected that if an agent responds to pings, it will
> (eventually) register or re-register. However, if the agent hangs during
> recovery, that assumption does not hold: the agent will continue to
> respond to pings but won't attempt to re-register until recovery
> finishes.
> 
> To handle this case, the master now expects an agent to re-register
> within `agent_reregister_timeout` if the master -> agent socket breaks;
> if no re-registration is seen, the master will mark the agent
> unreachable. This is a "backup" to handle the case where recovery hangs,
> as explained above; more commonly, the agent will re-register (when it
> receives a ping and notices the master believes it is disconnected) or
> be marked unreachable because it fails to respond to pings.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md e4beb2d5a72f1c5f59b2e40f4984cc60b8437d9d 
>   src/master/flags.cpp 737290a42c532f2349009d0a451ce271d6f107b9 
>   src/master/master.hpp 57fc6e6f2995078df80f0aa52707727db802ede0 
>   src/master/master.cpp 11c34a048586d30c6ac67be8638ed8fa81cc3f1f 
>   src/slave/slave.cpp f8f2ccfadb9a00be17c0b552586aa5875b7cbb19 
>   src/tests/master_tests.cpp 1cf4c92b2474e18771459f877b2f3c49077e8a01 
>   src/tests/slave_tests.cpp d633a74d6b342239fbca0b44eec281eb315df5ff 
> 
> Diff: https://reviews.apache.org/r/55307/diff/
> 
> 
> Testing
> -------
> 
> `make check`
> 
> Ran new tests a few thousand times on OSX and Linux VM to check for flakiness.
> 
> 
> Thanks,
> 
> Neil Conway
> 
>