You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Alex Clemmer <cl...@gmail.com> on 2016/12/21 03:09:34 UTC

Re: Review Request 54909: Fixed spurious registration bug in framework and agent.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54909/
-----------------------------------------------------------

(Updated Dec. 21, 2016, 3:09 a.m.)


Review request for mesos, Andrew Schwartzmeyer, Daniel Pravat, John Kordich, and Joseph Wu.


Changes
-------

Let's fix the spurious framework registration bug while we're at it! :)


Summary (updated)
-----------------

Fixed spurious registration bug in framework and agent.


Bugs: MESOS-6803
    https://issues.apache.org/jira/browse/MESOS-6803


Repository: mesos


Description (updated)
-------

This issue appears when `HAS_AUTHENTICATION` is not defined. This commit
will resolve the issue, and fix tests that surfaced it (as well as some
associated errors that cause them to be flaky when this symbol is not
defined).

Currently when a new master is detected and no credential is provided,
the agent and framework (which have very similar registration code
paths) will attempt to (re)register after some random initial `delay`,
to avoid a "thundering herd" problem.  It is hence possible to have
spurious double-(re)registrations, since a new master could be detected
after we add the `delay`d registration, but before we execute it.

In a degenerate case, suppose a single agent has a registration delay
of one minute.  A master is brought up, to which, the agent successfully
registers.  Prior to this commit, the agent will still have a scheduled
registration loop (`doReliableRegistration`) with a backoff factor.
If the master goes down and a new master is brought up, the agent
will race against itself (two ongoing loops of `doReliableRegistration`)
to register with the new master.  If the first loop reaches the new
master first, authentication will fail and cause the agent to commit
suicide.

To resolve this problem, we store the value of the `Timer` returned by
`delay` in `doReliableRegistration` and cancel it when we have either
registered, or need to start a new cycle of registration. This solution
is implemented in both the framework code and the agent code.


Diffs (updated)
-----

  src/sched/sched.cpp 1eecb8b9f8d75c123d13e73d3dd25129e6cd93c4 
  src/slave/slave.hpp 03860b5d0242289034d4574bd36a85ab6fb87a79 
  src/slave/slave.cpp a7a3a394e5e4b7f40a051663cd70add3890bdf18 
  src/tests/master_tests.cpp b3b5630899082a69119d2cccb6460766d1352963 
  src/tests/reservation_tests.cpp ffbb50bdf16fdeb0ad0aa98afbe71c38c784cd71 

Diff: https://reviews.apache.org/r/54909/diff/


Testing
-------

`make check` and `mesos-tests --gtest_repeat=1000 --gtest_break_on_failure` to catch intermittent failures, which is how we caught the failing test in `reservation_tests.cpp`. Note that this bug was discovered when we added a `delay` to the call to `authenticate` in `slave::detected` (in order to get it to match the behavior of the non-authenticated call to `doReliableRegistration`.


Thanks,

Alex Clemmer


Re: Review Request 54909: Fixed spurious registration bug in framework and agent.

Posted by Joseph Wu <jo...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54909/#review159945
-----------------------------------------------------------


Fix it, then Ship it!




I'm going to add this comment block above each of the four `Clock::cancel`s.
```
// Cancel the pending registration timer to avoid spurious attempts
// at reregistration. `Clock::cancel` is idempotent, so this call
// is safe even if no timer is active or pending.
```


src/sched/sched.cpp (lines 335 - 339)
<https://reviews.apache.org/r/54909/#comment230998>

    We should move the `Clock::cancel` down into the `if (master.isSome())` block, as that is where the registration loop is started.



src/slave/slave.hpp (line 799)
<https://reviews.apache.org/r/54909/#comment230999>

    Nit: We standardized how we write `reregistered` in comments a while ago.  No hyphens anymore :)



src/slave/slave.cpp (lines 923 - 924)
<https://reviews.apache.org/r/54909/#comment230997>

    We also need to cancel the agent registration timer here (as mentioned in the commit description)


- Joseph Wu


On Dec. 20, 2016, 7:09 p.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54909/
> -----------------------------------------------------------
> 
> (Updated Dec. 20, 2016, 7:09 p.m.)
> 
> 
> Review request for mesos, Andrew Schwartzmeyer, Daniel Pravat, John Kordich, and Joseph Wu.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> This issue appears when `HAS_AUTHENTICATION` is not defined. This commit
> will resolve the issue, and fix tests that surfaced it (as well as some
> associated errors that cause them to be flaky when this symbol is not
> defined).
> 
> Currently when a new master is detected and no credential is provided,
> the agent and framework (which have very similar registration code
> paths) will attempt to (re)register after some random initial `delay`,
> to avoid a "thundering herd" problem.  It is hence possible to have
> spurious double-(re)registrations, since a new master could be detected
> after we add the `delay`d registration, but before we execute it.
> 
> In a degenerate case, suppose a single agent has a registration delay
> of one minute.  A master is brought up, to which, the agent successfully
> registers.  Prior to this commit, the agent will still have a scheduled
> registration loop (`doReliableRegistration`) with a backoff factor.
> If the master goes down and a new master is brought up, the agent
> will race against itself (two ongoing loops of `doReliableRegistration`)
> to register with the new master.  If the first loop reaches the new
> master first, authentication will fail and cause the agent to commit
> suicide.
> 
> To resolve this problem, we store the value of the `Timer` returned by
> `delay` in `doReliableRegistration` and cancel it when we have either
> registered, or need to start a new cycle of registration. This solution
> is implemented in both the framework code and the agent code.
> 
> 
> Diffs
> -----
> 
>   src/sched/sched.cpp 1eecb8b9f8d75c123d13e73d3dd25129e6cd93c4 
>   src/slave/slave.hpp 03860b5d0242289034d4574bd36a85ab6fb87a79 
>   src/slave/slave.cpp a7a3a394e5e4b7f40a051663cd70add3890bdf18 
>   src/tests/master_tests.cpp b3b5630899082a69119d2cccb6460766d1352963 
>   src/tests/reservation_tests.cpp ffbb50bdf16fdeb0ad0aa98afbe71c38c784cd71 
> 
> Diff: https://reviews.apache.org/r/54909/diff/
> 
> 
> Testing
> -------
> 
> `make check` and `mesos-tests --gtest_repeat=1000 --gtest_break_on_failure` to catch intermittent failures, which is how we caught the failing test in `reservation_tests.cpp`. Note that this bug was discovered when we added a `delay` to the call to `authenticate` in `slave::detected` (in order to get it to match the behavior of the non-authenticated call to `doReliableRegistration`.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>