You are viewing a plain text version of this content. The canonical link for it is here.

Posted to commits@mesos.apache.org by cf...@apache.org on 2022/05/01 20:02:33 UTC

[mesos] branch master updated: Fixed random SlaveRecoveryTest.PingTimeoutDuringRecovery test failure.

This is an automated email from the ASF dual-hosted git repository.

cfnatali pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git


The following commit(s) were added to refs/heads/master by this push:
     new 57088b11a Fixed random SlaveRecoveryTest.PingTimeoutDuringRecovery test failure.
57088b11a is described below

commit 57088b11a328355b7f2a53a1b3fba9928a2fde73
Author: Charles-Francois Natali <cf...@gmail.com>
AuthorDate: Sun May 1 19:10:44 2022 +0100

    Fixed random SlaveRecoveryTest.PingTimeoutDuringRecovery test failure.
    
    This test would randomly fail with:
    ```
    18:16:59 3: F0501 17:16:59.192818 19175 slave.cpp:1445] Check failed:
       state == DISCONNECTED || state == RUNNING || state == TERMINATING
    RECOVERING
    ```
    
    The cause was that the test re-starts the slave with the same PID, which
    means that timers started by the previous slave process could fire while
    the new slave process was running.
    
    In this specific case, what happened is that the previous slave's ping
    timer would fire in the middle of recovery of the second slave instance,
    yielding this assertion.
    
    Fixed by making sure to use `Clock::advance` and `Clock::settle` after
    terminating the first instance to ensure that there are no pending
    timers.
    
    Tested by running the test in a loop, while running a CPU-intensive
    workload - `stress-ng --cpu $(nproc)0` in parallel.
---
 src/tests/slave_recovery_tests.cpp | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/src/tests/slave_recovery_tests.cpp b/src/tests/slave_recovery_tests.cpp
index 398011437..41ecd3b6d 100644
--- a/src/tests/slave_recovery_tests.cpp
+++ b/src/tests/slave_recovery_tests.cpp
@@ -1010,6 +1010,15 @@ TYPED_TEST(SlaveRecoveryTest, PingTimeoutDuringRecovery)
 
   slave.get()->terminate();
 
+  // Make sure that timers started by the previous slave's process expire,
+  // because we reuse the same PID we don't want them to fire while the
+  // new process is running.
+  Clock::pause();
+  Clock::advance(masterFlags.agent_ping_timeout *
+    masterFlags.max_agent_ping_timeouts);
+  Clock::settle();
+  Clock::resume();
+
   Future<ReregisterExecutorMessage> reregisterExecutor =
     FUTURE_PROTOBUF(ReregisterExecutorMessage(), _, _);