You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Benno Evers (JIRA)" <ji...@apache.org> on 2018/02/21 14:05:02 UTC

[jira] [Commented] (MESOS-8336) MasterTest.RegistryUpdateAfterReconfiguration is flaky

    [ https://issues.apache.org/jira/browse/MESOS-8336?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16371439#comment-16371439 ] 

Benno Evers commented on MESOS-8336:
------------------------------------

The root cause here is a very familiar one, that has already rendered countless other tests flaky. In particular, I'm talking about this line in `slave.cpp`:
{noformat}
    // Wait for a random amount of time before authentication or
    // registration.
    Duration duration =
      flags.registration_backoff_factor * ((double) os::random() / RAND_MAX);{noformat}
Here, the agent is sending the re-tried `RegisterSlaveMessage` after 9ms, *just* before shutting down, and the master notices that the network link is down before it gets to processing the message.

This leads to the master assigning a second slave ID, almost immediately removing the slave again because the network link is broken as well, and finally the test seeing the remnants of this second slave in the registry.

 

> MasterTest.RegistryUpdateAfterReconfiguration is flaky
> ------------------------------------------------------
>
>                 Key: MESOS-8336
>                 URL: https://issues.apache.org/jira/browse/MESOS-8336
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Benno Evers
>            Priority: Major
>              Labels: flaky-test
>         Attachments: RegistryUpdateAfterReconfiguration-badrun.txt
>
>
> Observed here: https://jenkins.mesosphere.com/service/jenkins/job/mesos/job/Mesos_CI-build/2399/FLAG=CMake,label=mesos-ec2-debian-8/testReport/junit/mesos-ec2-debian-8-CMake.Mesos/MasterTest/RegistryUpdateAfterReconfiguration/
> The test here failed because the registry contained 2 slaves, when it should have only one.
> Looking through the log, everything seems normal (in particular, only 1 slave id appears throughout this test). The only thing out of the ordinary seems to be the agent sending two `RegisterSlaveMessage`s and two `ReregisterSlaveMessage`s, but looking at the code for generating the random backoff factor in the slave that seems to be more or less normal, and shouldn't break the test.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)