You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@mesos.apache.org by ne...@apache.org on 2017/05/31 18:05:00 UTC

mesos git commit: Fixed flakiness in OneWayPartitionTest.MasterToSlave.

Repository: mesos
Updated Branches:
  refs/heads/master 6ea47691d -> 4dae1f8d9


Fixed flakiness in OneWayPartitionTest.MasterToSlave.

The test did not pause the clock. This allowed the following sequence of
events to occur, with low probability:

  (1) Agent sends register message M1 to master.
  (2) Agent register timer expires, sends register message M2 to master.
  (3) Master sees M1 and adds agent with ID A1.
  (4) Agent gets SlaveRegisteredMessage with ID A1.
  (5) Test case injects `exited` event for agent; master marks agent as
      disconnected
  (6) Master sees M2; since the agent is currently disconnected, the
      master removes A1 and adds the agent with ID A2.
  (7) Agent gets SlaveRegisteredMessage with ID A2. Since this is
      unexpected, it exits ("Registered but got wrong id").

This commit fixes the test case to pause the clock; this prevents the
second registration attempt in step (2) above.

The scenario described above might occur in an actual Mesos deployment,
albeit with very low probability. This would result in a Mesos agent
shutting down immediately after initial registration. MESOS-7596 has
been created to track this issue.

Review: https://reviews.apache.org/r/59685


Project: http://git-wip-us.apache.org/repos/asf/mesos/repo
Commit: http://git-wip-us.apache.org/repos/asf/mesos/commit/4dae1f8d
Tree: http://git-wip-us.apache.org/repos/asf/mesos/tree/4dae1f8d
Diff: http://git-wip-us.apache.org/repos/asf/mesos/diff/4dae1f8d

Branch: refs/heads/master
Commit: 4dae1f8d95b776850c84fbcd8427fe9072b7bc14
Parents: 6ea4769
Author: Neil Conway <ne...@gmail.com>
Authored: Tue May 30 15:52:57 2017 -0700
Committer: Neil Conway <ne...@gmail.com>
Committed: Wed May 31 11:04:55 2017 -0700

----------------------------------------------------------------------
 src/tests/partition_tests.cpp | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/mesos/blob/4dae1f8d/src/tests/partition_tests.cpp
----------------------------------------------------------------------
diff --git a/src/tests/partition_tests.cpp b/src/tests/partition_tests.cpp
index 4ff4285..62a84f7 100644
--- a/src/tests/partition_tests.cpp
+++ b/src/tests/partition_tests.cpp
@@ -3416,6 +3416,10 @@ class OneWayPartitionTest : public MesosTest {};
 // will re-register with the master.
 TEST_F_TEMP_DISABLED_ON_WINDOWS(OneWayPartitionTest, MasterToSlave)
 {
+  // Pausing the clock ensures that the agent does not attempt to
+  // register multiple times (see MESOS-7596 for context).
+  Clock::pause();
+
   // Start a master.
   master::Flags masterFlags = CreateMasterFlags();
   Try<Owned<cluster::Master>> master = StartMaster(masterFlags);
@@ -3432,6 +3436,7 @@ TEST_F_TEMP_DISABLED_ON_WINDOWS(OneWayPartitionTest, MasterToSlave)
   Try<Owned<cluster::Slave>> slave = StartSlave(detector.get(), agentFlags);
   ASSERT_SOME(slave);
 
+  Clock::advance(agentFlags.registration_backoff_factor);
   AWAIT_READY(slaveRegisteredMessage);
 
   AWAIT_READY(ping);
@@ -3440,8 +3445,6 @@ TEST_F_TEMP_DISABLED_ON_WINDOWS(OneWayPartitionTest, MasterToSlave)
   Future<Nothing> deactivateSlave =
     FUTURE_DISPATCH(_, &MesosAllocatorProcess::deactivateSlave);
 
-  Clock::pause();
-
   // Inject a slave exited event at the master causing the master
   // to mark the slave as disconnected. The slave should not notice
   // until it receives the next ping message.