You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Alexander Rukletsov (JIRA)" <ji...@apache.org> on 2017/10/25 20:39:00 UTC

[jira] [Created] (MESOS-8133) Investigate whether an agent should distinguish between registered and reregistered messages.

Alexander Rukletsov created MESOS-8133:
------------------------------------------

             Summary: Investigate whether an agent should distinguish between registered and reregistered messages.
                 Key: MESOS-8133
                 URL: https://issues.apache.org/jira/browse/MESOS-8133
             Project: Mesos
          Issue Type: Task
          Components: agent
            Reporter: Alexander Rukletsov


I have refined logs from good and bad runs of the {{MasterTest.IgnoreOldAgentReregistration}} test before {{MESOS-7726}} has been fixed. Full logs attached.
Good run:
{noformat}
19:31:28.542207 16548 slave.cpp:993] New master detected at master@172.16.10.126:60888
19:31:28.553344 16546 slave.cpp:1147] Successfully authenticated with master master@172.16.10.126:60888
19:31:28.553395 16546 slave.cpp:1626] Will retry registration in 1.49694ms if necessary
19:31:28.553436 16545 master.cpp:5801] Received register agent message from slave(294)@172.16.10.126:60888 (ip-172-16-10-126.ec2.internal)
19:31:28.553725 16545 master.cpp:5954] Registering agent at slave(294)@172.16.10.126:60888 (ip-172-16-10-126.ec2.internal) with id 43dd1579-4d4f-42c6-9a02-8ac196135ba1-S0
19:31:28.554425 16546 slave.cpp:1193] Registered with master master@172.16.10.126:60888; given agent ID 43dd1579-4d4f-42c6-9a02-8ac196135ba1-S0

19:31:28.555630 16542 slave.cpp:993] New master detected at master@172.16.10.126:60888
19:31:28.555724 16546 slave.cpp:1055] Authenticating with master master@172.16.10.126:60888
19:31:28.556710 16546 master.cpp:7966] Successfully authenticated principal 'test-principal' at slave(294)@172.16.10.126:60888
19:31:28.556766 16546 slave.cpp:1147] Successfully authenticated with master master@172.16.10.126:60888
19:31:28.556807 16546 slave.cpp:1626] Will retry registration in 9.658747ms if necessary
19:31:28.557025 16546 master.cpp:6119] Received re-register agent message from agent 43dd1579-4d4f-42c6-9a02-8ac196135ba1-S0 at slave(294)@172.16.10.126:60888 (ip-172-16-10-126.ec2.internal)
19:31:28.557148 16546 master.cpp:6222] Ignoring re-registration attempt from old agent at slave(294)@172.16.10.126:60888: agent version is 0.28.1-rc1, minimum supported agent version is 1.0.0
19:31:28.557402 16522 slave.cpp:869] Agent terminating
{noformat}
Bad run:
{noformat}
04:31:59.541496 15900 slave.cpp:993] New master detected at master@172.17.0.3:42919
04:31:59.545332 15909 slave.cpp:1147] Successfully authenticated with master master@172.17.0.3:42919
04:31:59.545543 15909 slave.cpp:1626] Will retry registration in 1.358313ms if necessary
04:31:59.545763 15906 master.cpp:5801] Received register agent message from slave(411)@172.17.0.3:42919 (37ad05bac4a7)
04:31:59.546460 15904 master.cpp:5954] Registering agent at slave(411)@172.17.0.3:42919 (37ad05bac4a7) with id 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0
04:31:59.548152 15893 slave.cpp:1626] Will retry registration in 15.765509ms if necessary
04:31:59.548413 15912 master.cpp:6032] Registered agent 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0 at slave(411)@172.17.0.3:42919 (37ad05bac4a7) with [{"name":"cpus","scalar":{"value":2.0},"type":"SCALAR"},{"name":"mem","scalar":{"value":1024.0},"type":"SCALAR"},{"name":"disk","scalar":{"value":1024.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"type":"RANGES"}]
04:31:59.548753 15902 slave.cpp:1193] Registered with master master@172.17.0.3:42919; given agent ID 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0
04:31:59.548921 15912 master.cpp:5801] Received register agent message from slave(411)@172.17.0.3:42919 (37ad05bac4a7)

04:31:59.549670 15902 slave.cpp:993] New master detected at master@172.17.0.3:42919
04:31:59.549906 15902 slave.cpp:1193] Registered with master master@172.17.0.3:42919; given agent ID 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0
/mesos/src/tests/master_tests.cpp:8041: Failure
Failed to wait 15secs for reregisterSlaveMessage
04:32:14.557504 15902 slave.cpp:869] Agent terminating
{noformat}
We can see that in the "bad" run, the agent sends {{RegisterSlaveMessage}} twice, likely because the reply from the master has not arrived in {{1.358313ms}} when the agent retries. Because the second {{SlaveRegisteredMessage}} arrives onto the agent during authentication and hence before it attempts to re-register, the {{ReregisterSlaveMessage}} message is never sent out. I suggest we introduce an extra {{REGISTERING}} state, which we enter after a (re)registration request is sent. This way we can drop all registered / reregistered confirmations coming from the previous master during the {{DISCONNECTED}} stage.

Additionally, looking at the code, it seems that after sending a re-registration request in {{ReregisterSlaveMessage}} to the master, the agent is satisfied (given it comes from the expected master), i.e., transitions from {{DISCONNECTED}} to {{RUNNING}}, with {{SlaveRegisteredMessage}}, losing any {{ReconcileTasksMessage}} that might be delivered in the coming and to be ignored {{SlaveReregisteredMessage}}. This seems incorrect to me, I believe the agent should ignore {{SlaveRegisteredMessage}} after sending {{ReregisterSlaveMessage}}.

In production environment, if the agent receives a {{registered}} when it supposes to receive {{reregistered}}, can reconciliation go wrong? Also it seems wrong to accept a {{(re)registered}} while authentication is in progress?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)