You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@mesos.apache.org by Vinod Kone <vi...@gmail.com> on 2015/09/01 01:55:04 UTC

Re: Review Request 37785: Fix Flaky SlaveTest.HTTPSchedulerSlaveRestart test

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37785/#review97206
-----------------------------------------------------------



src/tests/slave_tests.cpp (lines 2624 - 2625)
<https://reviews.apache.org/r/37785/#comment153001>

    If resume the clock here, the slave can still retry registration right? You probably want to just pause the clock at the beginning of the test and not resume at all inside the test.


- Vinod Kone


On Aug. 26, 2015, 3:07 a.m., Anand Mazumdar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37785/
> -----------------------------------------------------------
> 
> (Updated Aug. 26, 2015, 3:07 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Vinod Kone.
> 
> 
> Bugs: MESOS-3311
>     https://issues.apache.org/jira/browse/MESOS-3311
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> I was not able to reproduce this with 300 gtest iterations in a loop on a Ubuntu 14.04 VM with clang + ssl i.e. similar to the ASF setup.
> 
> The logs though made it pretty evident on what was going on. The slave was sending a retry re-register message to the master, resulting in the master sending back another FrameworkUpdateMessage, the 2nd one used to set the PID from None() to the original pid() making the message go through directly to the scheduler instead of being routed through the master.
> 
> Log Lines:
> 
> I0825 22:07:39.085610 27642 slave.cpp:1209] Will retry registration in 6.014445ms if necessary
> I0825 22:07:39.092914 27640 master.cpp:3773] Re-registering slave 20150825-220736-234885548-51219-27610-S0 at slave(286)@172.17.0.14:51219 (09c6504e3a31)
> I0825 22:07:39.093181 27630 slave.cpp:1209] Will retry registration in 20.588077ms if necessary
> .... some lines and then
> I0825 22:07:39.094435 27640 master.cpp:3773] Re-registering slave 20150825-220736-234885548-51219-27610-S0 at slave(287)@172.17.0.14:51219 (09c6504e3a31)
> ... more lines
> I0825 22:07:39.096372 27635 slave.cpp:2131] Updating framework 20150825-220736-234885548-51219-27610-0000 pid to @0.0.0.0:0
> ... more lines
> I0825 22:07:39.097450 27635 slave.cpp:2131] Updating framework 20150825-220736-234885548-51219-27610-0000 pid to scheduler-6c5ddcdb-9dd1-4b38-b051-5f714d3c1c55@172.17.0.14:51219
> ... more lines
> I0825 22:07:39.098433 27635 slave.cpp:3043] Sending message for framework 20150825-220736-234885548-51219-27610-0000 to scheduler-6c5ddcdb-9dd1-4b38-b051-5f714d3c1c55@172.17.0.14:51219
> 
> 
> Paused the clock and then settle/resume invocations to ensure the retry does not happen
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp d55e9dd4f4eb84a8fda85439e31a38e70890b377 
> 
> Diff: https://reviews.apache.org/r/37785/diff/
> 
> 
> Testing
> -------
> 
> make check again with 300 iterations without failure
> 
> 
> Thanks,
> 
> Anand Mazumdar
> 
>

Re: Review Request 37785: Fix Flaky SlaveTest.HTTPSchedulerSlaveRestart test

Posted by Anand Mazumdar <ma...@gmail.com>.


> On Aug. 31, 2015, 11:55 p.m., Vinod Kone wrote:
> > src/tests/slave_tests.cpp, lines 2624-2625
> > <https://reviews.apache.org/r/37785/diff/1/?file=1052991#file1052991line2624>
> >
> >     If resume the clock here, the slave can still retry registration right? You probably want to just pause the clock at the beginning of the test and not resume at all inside the test.

Yes, it can still retry registration but it would be a no-op and things would work as expected:
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L1076

We wait for the clock to first settle before resuming. This means that all the pending processes would have executed i.e. the slave would have sent the re-register message, master would have processed it and sent the slave a re-registered message . 

The clock would resume thereafter. Even , if there was a re-register retry on the process message queue that gets trigerred now as you were referring to , it won't do anything and just return from L1076.

I initially tried just pausing the clock as you had specified but it did not work for me and hence had to go down this route. The slave does not finish recovery due to the clock being paused:
https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L4043


- Anand


-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/37785/#review97206
-----------------------------------------------------------


On Aug. 26, 2015, 3:07 a.m., Anand Mazumdar wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/37785/
> -----------------------------------------------------------
> 
> (Updated Aug. 26, 2015, 3:07 a.m.)
> 
> 
> Review request for mesos, Ben Mahler and Vinod Kone.
> 
> 
> Bugs: MESOS-3311
>     https://issues.apache.org/jira/browse/MESOS-3311
> 
> 
> Repository: mesos
> 
> 
> Description
> -------
> 
> I was not able to reproduce this with 300 gtest iterations in a loop on a Ubuntu 14.04 VM with clang + ssl i.e. similar to the ASF setup.
> 
> The logs though made it pretty evident on what was going on. The slave was sending a retry re-register message to the master, resulting in the master sending back another FrameworkUpdateMessage, the 2nd one used to set the PID from None() to the original pid() making the message go through directly to the scheduler instead of being routed through the master.
> 
> Log Lines:
> 
> I0825 22:07:39.085610 27642 slave.cpp:1209] Will retry registration in 6.014445ms if necessary
> I0825 22:07:39.092914 27640 master.cpp:3773] Re-registering slave 20150825-220736-234885548-51219-27610-S0 at slave(286)@172.17.0.14:51219 (09c6504e3a31)
> I0825 22:07:39.093181 27630 slave.cpp:1209] Will retry registration in 20.588077ms if necessary
> .... some lines and then
> I0825 22:07:39.094435 27640 master.cpp:3773] Re-registering slave 20150825-220736-234885548-51219-27610-S0 at slave(287)@172.17.0.14:51219 (09c6504e3a31)
> ... more lines
> I0825 22:07:39.096372 27635 slave.cpp:2131] Updating framework 20150825-220736-234885548-51219-27610-0000 pid to @0.0.0.0:0
> ... more lines
> I0825 22:07:39.097450 27635 slave.cpp:2131] Updating framework 20150825-220736-234885548-51219-27610-0000 pid to scheduler-6c5ddcdb-9dd1-4b38-b051-5f714d3c1c55@172.17.0.14:51219
> ... more lines
> I0825 22:07:39.098433 27635 slave.cpp:3043] Sending message for framework 20150825-220736-234885548-51219-27610-0000 to scheduler-6c5ddcdb-9dd1-4b38-b051-5f714d3c1c55@172.17.0.14:51219
> 
> 
> Paused the clock and then settle/resume invocations to ensure the retry does not happen
> 
> 
> Diffs
> -----
> 
>   src/tests/slave_tests.cpp d55e9dd4f4eb84a8fda85439e31a38e70890b377 
> 
> Diff: https://reviews.apache.org/r/37785/diff/
> 
> 
> Testing
> -------
> 
> make check again with 300 iterations without failure
> 
> 
> Thanks,
> 
> Anand Mazumdar
> 
>