You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@mesos.apache.org by Alex Clemmer <cl...@gmail.com> on 2016/12/16 02:47:23 UTC

Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/
-----------------------------------------------------------

Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.


Bugs: MESOS-6803
    https://issues.apache.org/jira/browse/MESOS-6803


Repository: mesos


Description
-------

Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
use `delay` to schedule a random time in the future to register with the
Master, to avoid the thundering herd problem after a Master failover.
The authentication codepath, in contrast, schedules the registration
immediately.

In tests where we have `Clock::pause`'d when we are supposed to be
registering the slave, the authention codepath will succeeed, while
no-authentication codepath will hang forever.

A much more detailed analysis of this situation exists in MESOS-6803.

This commit will resolve this issue for `slave_tests.cpp` by changing
the tests to not use `Clock::pause` when we are waiting for Agent
registration.


Diffs
-----

  src/tests/slave_tests.cpp fc6b56c074c71b827a9ee522cd715c0d15ecc7e3 

Diff: https://reviews.apache.org/r/54803/diff/


Testing
-------

Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.


Thanks,

Alex Clemmer


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Alex Clemmer <cl...@gmail.com>.

> On Dec. 16, 2016, 7:04 p.m., Greg Mann wrote:
> > src/tests/slave_tests.cpp, lines 2733-2736
> > <https://reviews.apache.org/r/54803/diff/1/?file=1587155#file1587155line2733>
> >
> >     In some cases you explicitly advance the clock by the backoff factor, while here you resume the clock instead. What do you think about consistently using `Clock::advance(flags.registration_backoff_factor);` in order to establish a common pattern?

I personally think it's a really great idea. I didn't do it here because I wasn't sure how long to advance specifically, and in particular, it's not clear to me whether we will possibly wait a multiple of `registration_backoff_factor`.

What would you use in the call to `Clock::advance` here?


- Alex


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/#review159462
-----------------------------------------------------------


On Dec. 16, 2016, 2:47 a.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54803/
> -----------------------------------------------------------
> 
> (Updated Dec. 16, 2016, 2:47 a.m.)
> 
> 
> Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
> use `delay` to schedule a random time in the future to register with the
> Master, to avoid the thundering herd problem after a Master failover.
> The authentication codepath, in contrast, schedules the registration
> immediately.
> 
> In tests where we have `Clock::pause`'d when we are supposed to be
> registering the slave, the authention codepath will succeeed, while
> no-authentication codepath will hang forever.
> 
> A much more detailed analysis of this situation exists in MESOS-6803.
> 
> This commit will resolve this issue for `slave_tests.cpp` by changing
> the tests to not use `Clock::pause` when we are waiting for Agent
> registration.
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp fc6b56c074c71b827a9ee522cd715c0d15ecc7e3 
> 
> Diff: https://reviews.apache.org/r/54803/diff/
> 
> 
> Testing
> -------
> 
> Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Greg Mann <gr...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/#review159462
-----------------------------------------------------------




src/tests/slave_tests.cpp (lines 2733 - 2736)
<https://reviews.apache.org/r/54803/#comment230467>

    In some cases you explicitly advance the clock by the backoff factor, while here you resume the clock instead. What do you think about consistently using `Clock::advance(flags.registration_backoff_factor);` in order to establish a common pattern?


- Greg Mann


On Dec. 16, 2016, 2:47 a.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54803/
> -----------------------------------------------------------
> 
> (Updated Dec. 16, 2016, 2:47 a.m.)
> 
> 
> Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
> use `delay` to schedule a random time in the future to register with the
> Master, to avoid the thundering herd problem after a Master failover.
> The authentication codepath, in contrast, schedules the registration
> immediately.
> 
> In tests where we have `Clock::pause`'d when we are supposed to be
> registering the slave, the authention codepath will succeeed, while
> no-authentication codepath will hang forever.
> 
> A much more detailed analysis of this situation exists in MESOS-6803.
> 
> This commit will resolve this issue for `slave_tests.cpp` by changing
> the tests to not use `Clock::pause` when we are waiting for Agent
> registration.
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp fc6b56c074c71b827a9ee522cd715c0d15ecc7e3 
> 
> Diff: https://reviews.apache.org/r/54803/diff/
> 
> 
> Testing
> -------
> 
> Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Mesos ReviewBot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/#review159531
-----------------------------------------------------------



Patch looks great!

Reviews applied: [54803]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker_build.sh

- Mesos ReviewBot


On Dec. 16, 2016, 7:23 p.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54803/
> -----------------------------------------------------------
> 
> (Updated Dec. 16, 2016, 7:23 p.m.)
> 
> 
> Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
> use `delay` to schedule a random time in the future to register with the
> Master, to avoid the thundering herd problem after a Master failover.
> The authentication codepath, in contrast, schedules the registration
> immediately.
> 
> In tests where we have `Clock::pause`'d when we are supposed to be
> registering the slave, the authention codepath will succeeed, while
> no-authentication codepath will hang forever.
> 
> A much more detailed analysis of this situation exists in MESOS-6803.
> 
> This commit will resolve this issue for `slave_tests.cpp` by changing
> the tests to not use `Clock::pause` when we are waiting for Agent
> registration.
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp fc6b56c074c71b827a9ee522cd715c0d15ecc7e3 
> 
> Diff: https://reviews.apache.org/r/54803/diff/
> 
> 
> Testing
> -------
> 
> Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Mesos ReviewBot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/#review159599
-----------------------------------------------------------



Patch looks great!

Reviews applied: [54803]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker_build.sh

- Mesos ReviewBot


On Dec. 17, 2016, 11:01 p.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54803/
> -----------------------------------------------------------
> 
> (Updated Dec. 17, 2016, 11:01 p.m.)
> 
> 
> Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
> use `delay` to schedule a random time in the future to register with the
> Master, to avoid the thundering herd problem after a Master failover.
> The authentication codepath, in contrast, schedules the registration
> immediately.
> 
> In tests where we have `Clock::pause`'d when we are supposed to be
> registering the slave, the authention codepath will succeeed, while
> no-authentication codepath will hang forever.
> 
> A much more detailed analysis of this situation exists in MESOS-6803.
> 
> This commit will resolve this issue for `slave_tests.cpp` by changing
> the tests to not use `Clock::pause` when we are waiting for Agent
> registration.
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp d956a326ef29bf29837e0587a14bae457147cbca 
> 
> Diff: https://reviews.apache.org/r/54803/diff/
> 
> 
> Testing
> -------
> 
> Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Alex Clemmer <cl...@gmail.com>.

> On Dec. 20, 2016, 12:01 a.m., Greg Mann wrote:
> > I was able to catch one flaky test by running the agent tests in repetition. For the other patches you're working on, I would recommend running the altered tests for a while with `--gtest_repeat=-1 --gtest_break_on_failure` to check for flakiness.

Thanks for the tip. This time I verified this solution with:

```
make mesos-tests -j4 && ./src/mesos-tests --gtest_repeat=1000 --gtest_break_on_failure --gtest_filter="SlaveTest.DuplicateTerminalUpdateBeforeAck:SlaveTest.MetricsSlaveLaunchErrors:SlaveTest.StateEndpoint:SlaveTest.PingTimeoutNoPings:SlaveTest.PingTimeoutSomePings:SlaveTest.ReregisterWithStatusUpdateTaskState"
```


- Alex


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/#review159647
-----------------------------------------------------------


On Dec. 17, 2016, 11:01 p.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54803/
> -----------------------------------------------------------
> 
> (Updated Dec. 17, 2016, 11:01 p.m.)
> 
> 
> Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
> use `delay` to schedule a random time in the future to register with the
> Master, to avoid the thundering herd problem after a Master failover.
> The authentication codepath, in contrast, schedules the registration
> immediately.
> 
> In tests where we have `Clock::pause`'d when we are supposed to be
> registering the slave, the authention codepath will succeeed, while
> no-authentication codepath will hang forever.
> 
> A much more detailed analysis of this situation exists in MESOS-6803.
> 
> This commit will resolve this issue for `slave_tests.cpp` by changing
> the tests to not use `Clock::pause` when we are waiting for Agent
> registration.
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp d956a326ef29bf29837e0587a14bae457147cbca 
> 
> Diff: https://reviews.apache.org/r/54803/diff/
> 
> 
> Testing
> -------
> 
> Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Greg Mann <gr...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/#review159647
-----------------------------------------------------------



I was able to catch one flaky test by running the agent tests in repetition. For the other patches you're working on, I would recommend running the altered tests for a while with `--gtest_repeat=-1 --gtest_break_on_failure` to check for flakiness.


src/tests/slave_tests.cpp (lines 203 - 204)
<https://reviews.apache.org/r/54803/#comment230669>

    When assigning, we usually break after the equal sign:
    
    ```
    Try<Owned<cluster::Slave>> slave =
      StartSlave(detector.get(), &containerizer, agentFlags);
    ```



src/tests/slave_tests.cpp (lines 3661 - 3662)
<https://reviews.apache.org/r/54803/#comment230684>

    Line break after equal sign



src/tests/slave_tests.cpp (lines 3691 - 3697)
<https://reviews.apache.org/r/54803/#comment230685>

    Instead of pausing the clock just after the status update arrives, let's pause the clock at the beginning of the test and advance it manually.
    
    The only thing depending on the clock above this point in the test is the offer being sent by the agent. So if you pause the clock at the beginning of the test (and create some `masterFlags` as you've done elsewhere), you could do:
    
    ```
    Future<Nothing> ___statusUpdate = FUTURE_DISPATCH(_, &Slave::___statusUpdate);
    
    driver.start();
    
    Clock::advance(masterFlags.allocation_interval);
    
    // Wait until TASK_RUNNING is sent to the master.
    AWAIT_READY(statusUpdateMessage);
    ```



src/tests/slave_tests.cpp (lines 3725 - 3726)
<https://reviews.apache.org/r/54803/#comment230668>

    This await appears to be flaky. After testing a bit, it looks like the agent sometimes fails to reregister because the clock gets advanced *before* `Slave::detected` calls `delay` to set a timer for the delayed reregistration.
    
    We can resolve this by running `Clock::settle()` here before we advance the clock.



src/tests/slave_tests.cpp (line 5626)
<https://reviews.apache.org/r/54803/#comment230686>

    s/batchallocation/batch allocation/


- Greg Mann


On Dec. 17, 2016, 11:01 p.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54803/
> -----------------------------------------------------------
> 
> (Updated Dec. 17, 2016, 11:01 p.m.)
> 
> 
> Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
> use `delay` to schedule a random time in the future to register with the
> Master, to avoid the thundering herd problem after a Master failover.
> The authentication codepath, in contrast, schedules the registration
> immediately.
> 
> In tests where we have `Clock::pause`'d when we are supposed to be
> registering the slave, the authention codepath will succeeed, while
> no-authentication codepath will hang forever.
> 
> A much more detailed analysis of this situation exists in MESOS-6803.
> 
> This commit will resolve this issue for `slave_tests.cpp` by changing
> the tests to not use `Clock::pause` when we are waiting for Agent
> registration.
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp d956a326ef29bf29837e0587a14bae457147cbca 
> 
> Diff: https://reviews.apache.org/r/54803/diff/
> 
> 
> Testing
> -------
> 
> Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Alex Clemmer <cl...@gmail.com>.

> On Dec. 20, 2016, 5:36 p.m., Greg Mann wrote:
> > LGTM, thanks Alex! Have you tested these on Windows as well? If so, it would be good to include in the "testing" section.

I have, but unfortunately, some of these tests have other issues that I need to deal with on Windows before they're ready to ship. I have another review in the chain that deals with this orthogonal concern.


- Alex


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/#review159732
-----------------------------------------------------------


On Dec. 20, 2016, 10:37 a.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54803/
> -----------------------------------------------------------
> 
> (Updated Dec. 20, 2016, 10:37 a.m.)
> 
> 
> Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
> use `delay` to schedule a random time in the future to register with the
> Master, to avoid the thundering herd problem after a Master failover.
> The authentication codepath, in contrast, schedules the registration
> immediately.
> 
> In tests where we have `Clock::pause`'d when we are supposed to be
> registering the slave, the authention codepath will succeeed, while
> no-authentication codepath will hang forever.
> 
> A much more detailed analysis of this situation exists in MESOS-6803.
> 
> This commit will resolve this issue for `slave_tests.cpp` by changing
> the tests to not use `Clock::pause` when we are waiting for Agent
> registration.
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp d956a326ef29bf29837e0587a14bae457147cbca 
> 
> Diff: https://reviews.apache.org/r/54803/diff/
> 
> 
> Testing
> -------
> 
> Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Greg Mann <gr...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/#review159732
-----------------------------------------------------------


Ship it!




LGTM, thanks Alex! Have you tested these on Windows as well? If so, it would be good to include in the "testing" section.

- Greg Mann


On Dec. 20, 2016, 10:37 a.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54803/
> -----------------------------------------------------------
> 
> (Updated Dec. 20, 2016, 10:37 a.m.)
> 
> 
> Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
> use `delay` to schedule a random time in the future to register with the
> Master, to avoid the thundering herd problem after a Master failover.
> The authentication codepath, in contrast, schedules the registration
> immediately.
> 
> In tests where we have `Clock::pause`'d when we are supposed to be
> registering the slave, the authention codepath will succeeed, while
> no-authentication codepath will hang forever.
> 
> A much more detailed analysis of this situation exists in MESOS-6803.
> 
> This commit will resolve this issue for `slave_tests.cpp` by changing
> the tests to not use `Clock::pause` when we are waiting for Agent
> registration.
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp d956a326ef29bf29837e0587a14bae457147cbca 
> 
> Diff: https://reviews.apache.org/r/54803/diff/
> 
> 
> Testing
> -------
> 
> Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Joseph Wu <jo...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/#review159850
-----------------------------------------------------------


Ship it!




Confirmed that these changes will fix the tests when `HAS_AUTHENTICATION=false`.

```
cmake .. -DHAS_AUTHENTICATION=0
make check GTEST_FILTER=""
src/mesos-tests --gtest_filter="SlaveTest*"
```

- Joseph Wu


On Dec. 20, 2016, 2:32 p.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54803/
> -----------------------------------------------------------
> 
> (Updated Dec. 20, 2016, 2:32 p.m.)
> 
> 
> Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
> use `delay` to schedule a random time in the future to register with the
> Master, to avoid the thundering herd problem after a Master failover.
> The authentication codepath, in contrast, schedules the registration
> immediately.
> 
> In tests where we have `Clock::pause`'d when we are supposed to be
> registering the slave, the authention codepath will succeeed, while
> no-authentication codepath will hang forever.
> 
> A much more detailed analysis of this situation exists in MESOS-6803.
> 
> This commit will resolve this issue for `slave_tests.cpp` by changing
> the tests to not use `Clock::pause` when we are waiting for Agent
> registration.
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp d956a326ef29bf29837e0587a14bae457147cbca 
> 
> Diff: https://reviews.apache.org/r/54803/diff/
> 
> 
> Testing
> -------
> 
> Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Adam B <ad...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/#review159950
-----------------------------------------------------------



Looks pretty reasonable to me, but I've got a few questions about your pausing/advancing.


src/tests/slave_tests.cpp (line 228)
<https://reviews.apache.org/r/54803/#comment231001>

    Do you really need to advance by both values, or would advancing by the larger of the two be sufficient?



src/tests/slave_tests.cpp (line 3648)
<https://reviews.apache.org/r/54803/#comment231000>

    Why did you need to move the pause all the way up here, instead of just before the first `advance`? Please add a comment about why we're pausing.


- Adam B


On Dec. 20, 2016, 2:32 p.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54803/
> -----------------------------------------------------------
> 
> (Updated Dec. 20, 2016, 2:32 p.m.)
> 
> 
> Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
> use `delay` to schedule a random time in the future to register with the
> Master, to avoid the thundering herd problem after a Master failover.
> The authentication codepath, in contrast, schedules the registration
> immediately.
> 
> In tests where we have `Clock::pause`'d when we are supposed to be
> registering the slave, the authention codepath will succeeed, while
> no-authentication codepath will hang forever.
> 
> A much more detailed analysis of this situation exists in MESOS-6803.
> 
> This commit will resolve this issue for `slave_tests.cpp` by changing
> the tests to not use `Clock::pause` when we are waiting for Agent
> registration.
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp d956a326ef29bf29837e0587a14bae457147cbca 
> 
> Diff: https://reviews.apache.org/r/54803/diff/
> 
> 
> Testing
> -------
> 
> Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Alex Clemmer <cl...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/
-----------------------------------------------------------

(Updated Dec. 20, 2016, 10:32 p.m.)


Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.


Changes
-------

Add comment explaining our use of `Clock::settle`.


Bugs: MESOS-6803
    https://issues.apache.org/jira/browse/MESOS-6803


Repository: mesos


Description
-------

Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
use `delay` to schedule a random time in the future to register with the
Master, to avoid the thundering herd problem after a Master failover.
The authentication codepath, in contrast, schedules the registration
immediately.

In tests where we have `Clock::pause`'d when we are supposed to be
registering the slave, the authention codepath will succeeed, while
no-authentication codepath will hang forever.

A much more detailed analysis of this situation exists in MESOS-6803.

This commit will resolve this issue for `slave_tests.cpp` by changing
the tests to not use `Clock::pause` when we are waiting for Agent
registration.


Diffs (updated)
-----

  src/tests/slave_tests.cpp d956a326ef29bf29837e0587a14bae457147cbca 

Diff: https://reviews.apache.org/r/54803/diff/


Testing
-------

Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.


Thanks,

Alex Clemmer


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Alex Clemmer <cl...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/
-----------------------------------------------------------

(Updated Dec. 20, 2016, 10:37 a.m.)


Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.


Changes
-------

Addressed Greg's comments.


Bugs: MESOS-6803
    https://issues.apache.org/jira/browse/MESOS-6803


Repository: mesos


Description
-------

Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
use `delay` to schedule a random time in the future to register with the
Master, to avoid the thundering herd problem after a Master failover.
The authentication codepath, in contrast, schedules the registration
immediately.

In tests where we have `Clock::pause`'d when we are supposed to be
registering the slave, the authention codepath will succeeed, while
no-authentication codepath will hang forever.

A much more detailed analysis of this situation exists in MESOS-6803.

This commit will resolve this issue for `slave_tests.cpp` by changing
the tests to not use `Clock::pause` when we are waiting for Agent
registration.


Diffs (updated)
-----

  src/tests/slave_tests.cpp d956a326ef29bf29837e0587a14bae457147cbca 

Diff: https://reviews.apache.org/r/54803/diff/


Testing
-------

Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.


Thanks,

Alex Clemmer


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Alex Clemmer <cl...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/
-----------------------------------------------------------

(Updated Dec. 17, 2016, 11:01 p.m.)


Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.


Changes
-------

Replace instances of `Clock::resume` with `Clock::advance`.


Bugs: MESOS-6803
    https://issues.apache.org/jira/browse/MESOS-6803


Repository: mesos


Description
-------

Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
use `delay` to schedule a random time in the future to register with the
Master, to avoid the thundering herd problem after a Master failover.
The authentication codepath, in contrast, schedules the registration
immediately.

In tests where we have `Clock::pause`'d when we are supposed to be
registering the slave, the authention codepath will succeeed, while
no-authentication codepath will hang forever.

A much more detailed analysis of this situation exists in MESOS-6803.

This commit will resolve this issue for `slave_tests.cpp` by changing
the tests to not use `Clock::pause` when we are waiting for Agent
registration.


Diffs (updated)
-----

  src/tests/slave_tests.cpp d956a326ef29bf29837e0587a14bae457147cbca 

Diff: https://reviews.apache.org/r/54803/diff/


Testing
-------

Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.


Thanks,

Alex Clemmer


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Greg Mann <gr...@mesosphere.io>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/#review159484
-----------------------------------------------------------




src/tests/slave_tests.cpp (lines 2733 - 2736)
<https://reviews.apache.org/r/54803/#comment230487>

    It looks to me like
    ```
    Clock::advance(totalTimeout);
    Clock::advance(flags.registration_backoff_factor);
    ```
    here would be sufficient. First we advance by `totalTimeout`, which allows two ping timeout intervals to elapse, leading to the agent being removed. If we then advance by the backoff factor, we can be assured that the agent will reregister even if it delays the first registration attempt. Does that make sense?


- Greg Mann


On Dec. 16, 2016, 7:23 p.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54803/
> -----------------------------------------------------------
> 
> (Updated Dec. 16, 2016, 7:23 p.m.)
> 
> 
> Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
> use `delay` to schedule a random time in the future to register with the
> Master, to avoid the thundering herd problem after a Master failover.
> The authentication codepath, in contrast, schedules the registration
> immediately.
> 
> In tests where we have `Clock::pause`'d when we are supposed to be
> registering the slave, the authention codepath will succeeed, while
> no-authentication codepath will hang forever.
> 
> A much more detailed analysis of this situation exists in MESOS-6803.
> 
> This commit will resolve this issue for `slave_tests.cpp` by changing
> the tests to not use `Clock::pause` when we are waiting for Agent
> registration.
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp fc6b56c074c71b827a9ee522cd715c0d15ecc7e3 
> 
> Diff: https://reviews.apache.org/r/54803/diff/
> 
> 
> Testing
> -------
> 
> Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Alex Clemmer <cl...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/
-----------------------------------------------------------

(Updated Dec. 16, 2016, 7:23 p.m.)


Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.


Bugs: MESOS-6803
    https://issues.apache.org/jira/browse/MESOS-6803


Repository: mesos


Description
-------

Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
use `delay` to schedule a random time in the future to register with the
Master, to avoid the thundering herd problem after a Master failover.
The authentication codepath, in contrast, schedules the registration
immediately.

In tests where we have `Clock::pause`'d when we are supposed to be
registering the slave, the authention codepath will succeeed, while
no-authentication codepath will hang forever.

A much more detailed analysis of this situation exists in MESOS-6803.

This commit will resolve this issue for `slave_tests.cpp` by changing
the tests to not use `Clock::pause` when we are waiting for Agent
registration.


Diffs
-----

  src/tests/slave_tests.cpp fc6b56c074c71b827a9ee522cd715c0d15ecc7e3 

Diff: https://reviews.apache.org/r/54803/diff/


Testing
-------

Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.


Thanks,

Alex Clemmer


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Alex Clemmer <cl...@gmail.com>.

> On Dec. 16, 2016, 4:27 a.m., Mesos ReviewBot wrote:
> > Bad patch!
> > 
> > Reviews applied: [54803, 54677, 54618, 54611, 53706, 52624, 52625, 52972, 52544, 52364]
> > 
> > Failed command: ./support/apply-review.sh -n -r 52544
> > 
> > Error:
> > 2016-12-16 04:26:46 URL:https://reviews.apache.org/r/52544/diff/raw/ [15399/15399] -> "52544.patch" [1]
> > error: patch failed: 3rdparty/stout/include/stout/os.hpp:47
> > error: 3rdparty/stout/include/stout/os.hpp: patch does not apply
> > 
> > Full log: https://builds.apache.org/job/mesos-reviewbot/16468/console

This patch is bad because of dpravat's stale `os::user` review. Let's remove the dependency.


- Alex


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/#review159409
-----------------------------------------------------------


On Dec. 16, 2016, 2:47 a.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54803/
> -----------------------------------------------------------
> 
> (Updated Dec. 16, 2016, 2:47 a.m.)
> 
> 
> Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
> use `delay` to schedule a random time in the future to register with the
> Master, to avoid the thundering herd problem after a Master failover.
> The authentication codepath, in contrast, schedules the registration
> immediately.
> 
> In tests where we have `Clock::pause`'d when we are supposed to be
> registering the slave, the authention codepath will succeeed, while
> no-authentication codepath will hang forever.
> 
> A much more detailed analysis of this situation exists in MESOS-6803.
> 
> This commit will resolve this issue for `slave_tests.cpp` by changing
> the tests to not use `Clock::pause` when we are waiting for Agent
> registration.
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp fc6b56c074c71b827a9ee522cd715c0d15ecc7e3 
> 
> Diff: https://reviews.apache.org/r/54803/diff/
> 
> 
> Testing
> -------
> 
> Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>


Re: Review Request 54803: Fixed `SlaveTests` to pass when `HAS_AUTHENTICATION` is undefined.

Posted by Mesos ReviewBot <re...@mesos.apache.org>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54803/#review159409
-----------------------------------------------------------



Bad patch!

Reviews applied: [54803, 54677, 54618, 54611, 53706, 52624, 52625, 52972, 52544, 52364]

Failed command: ./support/apply-review.sh -n -r 52544

Error:
2016-12-16 04:26:46 URL:https://reviews.apache.org/r/52544/diff/raw/ [15399/15399] -> "52544.patch" [1]
error: patch failed: 3rdparty/stout/include/stout/os.hpp:47
error: 3rdparty/stout/include/stout/os.hpp: patch does not apply

Full log: https://builds.apache.org/job/mesos-reviewbot/16468/console

- Mesos ReviewBot


On Dec. 16, 2016, 2:47 a.m., Alex Clemmer wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/54803/
> -----------------------------------------------------------
> 
> (Updated Dec. 16, 2016, 2:47 a.m.)
> 
> 
> Review request for mesos, Adam B, Andrew Schwartzmeyer, Daniel Pravat, Greg Mann, John Kordich, Joseph Wu, and Vinod Kone.
> 
> 
> Bugs: MESOS-6803
>     https://issues.apache.org/jira/browse/MESOS-6803
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> Currently, when `HAS_AUTHENTICATION` is undefined, the Agent will
> use `delay` to schedule a random time in the future to register with the
> Master, to avoid the thundering herd problem after a Master failover.
> The authentication codepath, in contrast, schedules the registration
> immediately.
> 
> In tests where we have `Clock::pause`'d when we are supposed to be
> registering the slave, the authention codepath will succeeed, while
> no-authentication codepath will hang forever.
> 
> A much more detailed analysis of this situation exists in MESOS-6803.
> 
> This commit will resolve this issue for `slave_tests.cpp` by changing
> the tests to not use `Clock::pause` when we are waiting for Agent
> registration.
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp fc6b56c074c71b827a9ee522cd715c0d15ecc7e3 
> 
> Diff: https://reviews.apache.org/r/54803/diff/
> 
> 
> Testing
> -------
> 
> Added `delay` to the call to `authenticate` in `Slave::detected`, ran tests to find failing tests in `SlaveTest.*`, then fixed, then ran again.
> 
> 
> Thanks,
> 
> Alex Clemmer
> 
>