You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Niklas Nielsen <ni...@qni.dk> on 2015/06/01 22:15:52 UTC

Re: Review Request 29507: Added Configurable Slave Ping Timeouts

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review86039
-----------------------------------------------------------


LGTM - Would it make sense to have sane min/max values for the timeouts/counts?

I wonder it would make sense to have a test to exercise an upgrade path (the timeout being different in the slaves, than in the master). Maybe I missed that in your first test.


docs/configuration.md
<https://reviews.apache.org/r/29507/#comment137889>

    As this will be markdown, how about formatting max_slave_ping_timeouts in `` or as "--max_slave_ping_timeouts" ?



docs/upgrades.md
<https://reviews.apache.org/r/29507/#comment137891>

    You shouldn't need this line - look at line 9 :)



src/master/master.cpp
<https://reviews.apache.org/r/29507/#comment137899>

    You should be able to const this, right?
    
    Here and below



src/messages/messages.proto
<https://reviews.apache.org/r/29507/#comment137902>

    Why a zero default?



src/slave/slave.hpp
<https://reviews.apache.org/r/29507/#comment137903>

    We should Camel case here, right?



src/slave/slave.hpp
<https://reviews.apache.org/r/29507/#comment137904>

    Same here



src/slave/slave.cpp
<https://reviews.apache.org/r/29507/#comment137905>

    Camel case?



src/slave/slave.cpp
<https://reviews.apache.org/r/29507/#comment137906>

    If we removed the 0 default, we'd do:
    
    if (ping_timeout_seconds.isSome()) {
      // ...
    }
    
    Which seems a bit more logical to me :)



src/tests/partition_tests.cpp
<https://reviews.apache.org/r/29507/#comment137936>

    What do you think about commenting on why you chose these dividing factors (Why 3, Why 2? What will these do compared to the original timeouts?)



src/tests/partition_tests.cpp
<https://reviews.apache.org/r/29507/#comment137931>

    const?



src/tests/partition_tests.cpp
<https://reviews.apache.org/r/29507/#comment137938>

    newline?



src/tests/partition_tests.cpp
<https://reviews.apache.org/r/29507/#comment137937>

    Newline?


- Niklas Nielsen


On May 28, 2015, 4:13 p.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated May 28, 2015, 4:13 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> Also fixed the log message in recoveredSlavesTimeout() to correctly
> reference flags.slave_reregister_timeout instead of the unrelated
> ping timeouts.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md 5a41477 
>   docs/upgrades.md 355307a 
>   src/master/constants.hpp 57cf8fb 
>   src/master/constants.cpp 8c7174a 
>   src/master/flags.hpp 84fa238 
>   src/master/flags.cpp 49d953a 
>   src/master/master.cpp 1526f59 
>   src/messages/messages.proto 39dac72 
>   src/slave/constants.hpp 206d439 
>   src/slave/constants.cpp d8d2f98 
>   src/slave/slave.hpp 0207eaf 
>   src/slave/slave.cpp b4d2029 
>   src/tests/partition_tests.cpp f7ee3ab 
>   src/tests/slave_recovery_tests.cpp c036e9c 
>   src/tests/slave_tests.cpp acae497 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>


Re: Review Request 29507: Added Configurable Slave Ping Timeouts

Posted by Adam B <ad...@mesosphere.io>.

> On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
> > LGTM - Would it make sense to have sane min/max values for the timeouts/counts?
> > 
> > I wonder it would make sense to have a test to exercise an upgrade path (the timeout being different in the slaves, than in the master). Maybe I missed that in your first test.

Added flag validation using the cool new lambda style from https://reviews.apache.org/r/34943

It's not so easy to test the upgrade path in a unit test, since as soon as a slave from this source base gets its SlaveRegistered ack from the master, it then has the same timeout as the master (as designed). And if the slave never receives its registered ack, it will retry registration after a /random/ backoff initially capped at 1min (--registration_backoff_factor). If the random backoff_factor ends up being longer than the slave's ping timeout, when the ping timeout expires, the slave will anyway re-detect the master and retry registration. As I revisit the tests (in a follow-up patch), I will brainstorm ways to test the upgrade path.


> On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
> > src/messages/messages.proto, line 279
> > <https://reviews.apache.org/r/29507/diff/7/?file=973883#file973883line279>
> >
> >     Why a zero default?

Originally added as a way for the recipient of this message to know that the sender didn't set it, presumably because the sender is pre-0.23. Since we don't have a way for install handlers to translate optional fields into Options instead of just pulling in the default value, I couldn't think of a cleaner option. BenM, however, suggested pushing the ping timeout into a nested submessage, so I could just check has_ping_timeout_seconds() instead. Went with that instead.


> On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
> > src/slave/slave.cpp, line 864
> > <https://reviews.apache.org/r/29507/diff/7/?file=973887#file973887line864>
> >
> >     If we removed the 0 default, we'd do:
> >     
> >     if (ping_timeout_seconds.isSome()) {
> >       // ...
> >     }
> >     
> >     Which seems a bit more logical to me :)

Unfortunately, we don't currently translate optional protobuf fields into Options in install handlers, so I went for an explicit default. BenM suggested using a nested message so I can check `if (connection.has_ping_timeout_seconds())` instead. Much more readable.


> On June 1, 2015, 1:15 p.m., Niklas Nielsen wrote:
> > docs/configuration.md, line 339
> > <https://reviews.apache.org/r/29507/diff/7/?file=973876#file973876line339>
> >
> >     As this will be markdown, how about formatting max_slave_ping_timeouts in `` or as "--max_slave_ping_timeouts" ?

Fixed.


- Adam


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/29507/#review86039
-----------------------------------------------------------


On May 28, 2015, 4:13 p.m., Adam B wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/29507/
> -----------------------------------------------------------
> 
> (Updated May 28, 2015, 4:13 p.m.)
> 
> 
> Review request for mesos, Ben Mahler and Niklas Nielsen.
> 
> 
> Bugs: MESOS-2110
>     https://issues.apache.org/jira/browse/MESOS-2110
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Added new --slave_ping_timeout and --max_slave_ping_timeouts flags
> to mesos-master to supplement the DEFAULT_SLAVE_PING_TIMEOUT (15secs)
> and DEFAULT_MAX_SLAVE_PING_TIMEOUTS (5).
>    
> These can be extended if slaves are expected/allowed to be down for
> longer than a minute or two.
> 
> Slave will receive master's ping timeout in SlaveRe[re]gisteredMessage.
>   
> Beware that this affects recovery from network timeouts as well as
> actual slave node/process failover.
> 
> Also fixed the log message in recoveredSlavesTimeout() to correctly
> reference flags.slave_reregister_timeout instead of the unrelated
> ping timeouts.
> 
> 
> Diffs
> -----
> 
>   docs/configuration.md 5a41477 
>   docs/upgrades.md 355307a 
>   src/master/constants.hpp 57cf8fb 
>   src/master/constants.cpp 8c7174a 
>   src/master/flags.hpp 84fa238 
>   src/master/flags.cpp 49d953a 
>   src/master/master.cpp 1526f59 
>   src/messages/messages.proto 39dac72 
>   src/slave/constants.hpp 206d439 
>   src/slave/constants.cpp d8d2f98 
>   src/slave/slave.hpp 0207eaf 
>   src/slave/slave.cpp b4d2029 
>   src/tests/partition_tests.cpp f7ee3ab 
>   src/tests/slave_recovery_tests.cpp c036e9c 
>   src/tests/slave_tests.cpp acae497 
> 
> Diff: https://reviews.apache.org/r/29507/diff/
> 
> 
> Testing
> -------
> 
> Manually tested slave failover/shutdown with master using different --slave_ping_timeout and --max_slave_ping_timeouts.
> Ran unit tests with shorter non-default values for ping timeouts.
> `make check` with new unit tests: ShortPingTimeoutUnreachableMaster and ShortPingTimeoutUnreachableSlave
> 
> 
> Thanks,
> 
> Adam B
> 
>