You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mesos.apache.org by cf...@apache.org on 2022/08/07 10:29:26 UTC
[mesos] branch master updated: Fixed random SlaveRecoveryTest.PingTimeoutDuringRecovery test failure. (#436)
This is an automated email from the ASF dual-hosted git repository.
cfnatali pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/mesos.git
The following commit(s) were added to refs/heads/master by this push:
new fb3d05b15 Fixed random SlaveRecoveryTest.PingTimeoutDuringRecovery test failure. (#436)
fb3d05b15 is described below
commit fb3d05b152fa0e158231cc71b629bc92b62c0a3c
Author: cf-natali <cf...@gmail.com>
AuthorDate: Sun Aug 7 11:29:21 2022 +0100
Fixed random SlaveRecoveryTest.PingTimeoutDuringRecovery test failure. (#436)
This test would randomly fail with:
```
18:16:59 3: F0501 17:16:59.192818 19175 slave.cpp:1445] Check
failed:
state == DISCONNECTED || state == RUNNING || state == TERMINATING
RECOVERING
```
The cause was that the test re-starts the slave with the same PID, which
means that timers started by the previous slave process could fire while
the new slave process was running.
In this specific case, what happened is that the previous slave's ping
timer would fire in the middle of recovery of the second slave instance,
yielding this assertion.
Fixed by cancelling the `pingTimer` in the slave destructor.
Tested by running the test in a loop, while running a CPU-intensive
workload - `stress-ng --cpu $(nproc)0` in parallel.
---
src/slave/slave.cpp | 2 ++
1 file changed, 2 insertions(+)
diff --git a/src/slave/slave.cpp b/src/slave/slave.cpp
index 3d53db49b..dd229fe9b 100644
--- a/src/slave/slave.cpp
+++ b/src/slave/slave.cpp
@@ -255,6 +255,8 @@ Slave::~Slave()
// TODO(benh): Shut down executors? The executor should get an "exited"
// event and initiate a shut down itself.
+ Clock::cancel(pingTimer);
+
foreachvalue (Framework* framework, frameworks) {
delete framework;
}