You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Qian Zhang (JIRA)" <ji...@apache.org> on 2017/10/29 14:06:00 UTC

[jira] [Commented] (MESOS-8133) Investigate whether an agent should distinguish between registered and reregistered messages.

    [ https://issues.apache.org/jira/browse/MESOS-8133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16224005#comment-16224005 ] 

Qian Zhang commented on MESOS-8133:
-----------------------------------

{quote}
I suggest we introduce an extra REGISTERING state, which we enter after a (re)registration request is sent. This way we can drop all registered / reregistered confirmations coming from the previous master during the DISCONNECTED stage.
{quote}
So you are suggesting that we can set slave's state to {{REGISTERING}} after {{RegisterSlaveMessage}} / {{ReregisterSlaveMessage}} is sent in {{Slave::doReliableRegistration()}}, and in {{Slave::registered()}} / {{Slave::reregistered()}}, we should ignore any registered / reregistered confirmations if the slave is in {{DISCONNECTED}} state but only handle them in {{REGISTERING}} state, right?

I agree it can fix the issue described in this ticket since the second {{SlaveRegisteredMessage}} arrives onto agent before {{ReregisterSlaveMessage}} is sent, i.e., the agent is still in {{DISCONNECTED}} state, so the second {{SlaveRegisteredMessage}} will be ignored. But what if the second {{SlaveRegisteredMessage}} arrives onto agent *after* {{ReregisterSlaveMessage}} is sent? As per your suggestion, we will set agent's state to {{REGISTERING}} when {{ReregisterSlaveMessage}} is sent, so when the second {{SlaveRegisteredMessage}} arrives, we will not ignore it and will set agent's state to {{RUNNING}}, and then when {{SlaveReregisteredMessage}} arrives, it will be ignored which seems incorrect.

And I think in a production environment, there must be multiple masters (e.g., 3 or 5) for HA purpose, in such environment, if {{Slave::detected()}} is invoked, the elected master must be different from the previous one, so when the second {{SlaveRegisteredMessage}} arrives onto agent, it will be ignored in {{Slave::registered()}} because it is not from the expected master (the expected master is already set to a different master in {{Slave::detected()}}). So it seems this issue will only happen when we try to simulate new master detected event with a single master in our test, but not likely happen in the real production environment.

> Investigate whether an agent should distinguish between registered and reregistered messages.
> ---------------------------------------------------------------------------------------------
>
>                 Key: MESOS-8133
>                 URL: https://issues.apache.org/jira/browse/MESOS-8133
>             Project: Mesos
>          Issue Type: Task
>          Components: agent
>            Reporter: Alexander Rukletsov
>         Attachments: IgnoreOldAgentReregistration-badrun.txt, IgnoreOldAgentReregistration-goodrun.txt
>
>
> I have refined logs from good and bad runs of the {{MasterTest.IgnoreOldAgentReregistration}} test before {{MESOS-7726}} has been fixed. Full logs attached.
> Good run:
> {noformat}
> 19:31:28.542207 16548 slave.cpp:993] New master detected at master@172.16.10.126:60888
> 19:31:28.553344 16546 slave.cpp:1147] Successfully authenticated with master master@172.16.10.126:60888
> 19:31:28.553395 16546 slave.cpp:1626] Will retry registration in 1.49694ms if necessary
> 19:31:28.553436 16545 master.cpp:5801] Received register agent message from slave(294)@172.16.10.126:60888 (ip-172-16-10-126.ec2.internal)
> 19:31:28.553725 16545 master.cpp:5954] Registering agent at slave(294)@172.16.10.126:60888 (ip-172-16-10-126.ec2.internal) with id 43dd1579-4d4f-42c6-9a02-8ac196135ba1-S0
> 19:31:28.554425 16546 slave.cpp:1193] Registered with master master@172.16.10.126:60888; given agent ID 43dd1579-4d4f-42c6-9a02-8ac196135ba1-S0
> 19:31:28.555630 16542 slave.cpp:993] New master detected at master@172.16.10.126:60888
> 19:31:28.555724 16546 slave.cpp:1055] Authenticating with master master@172.16.10.126:60888
> 19:31:28.556710 16546 master.cpp:7966] Successfully authenticated principal 'test-principal' at slave(294)@172.16.10.126:60888
> 19:31:28.556766 16546 slave.cpp:1147] Successfully authenticated with master master@172.16.10.126:60888
> 19:31:28.556807 16546 slave.cpp:1626] Will retry registration in 9.658747ms if necessary
> 19:31:28.557025 16546 master.cpp:6119] Received re-register agent message from agent 43dd1579-4d4f-42c6-9a02-8ac196135ba1-S0 at slave(294)@172.16.10.126:60888 (ip-172-16-10-126.ec2.internal)
> 19:31:28.557148 16546 master.cpp:6222] Ignoring re-registration attempt from old agent at slave(294)@172.16.10.126:60888: agent version is 0.28.1-rc1, minimum supported agent version is 1.0.0
> 19:31:28.557402 16522 slave.cpp:869] Agent terminating
> {noformat}
> Bad run:
> {noformat}
> 04:31:59.541496 15900 slave.cpp:993] New master detected at master@172.17.0.3:42919
> 04:31:59.545332 15909 slave.cpp:1147] Successfully authenticated with master master@172.17.0.3:42919
> 04:31:59.545543 15909 slave.cpp:1626] Will retry registration in 1.358313ms if necessary
> 04:31:59.545763 15906 master.cpp:5801] Received register agent message from slave(411)@172.17.0.3:42919 (37ad05bac4a7)
> 04:31:59.546460 15904 master.cpp:5954] Registering agent at slave(411)@172.17.0.3:42919 (37ad05bac4a7) with id 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0
> 04:31:59.548152 15893 slave.cpp:1626] Will retry registration in 15.765509ms if necessary
> 04:31:59.548413 15912 master.cpp:6032] Registered agent 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0 at slave(411)@172.17.0.3:42919 (37ad05bac4a7) with [{"name":"cpus","scalar":{"value":2.0},"type":"SCALAR"},{"name":"mem","scalar":{"value":1024.0},"type":"SCALAR"},{"name":"disk","scalar":{"value":1024.0},"type":"SCALAR"},{"name":"ports","ranges":{"range":[{"begin":31000,"end":32000}]},"type":"RANGES"}]
> 04:31:59.548753 15902 slave.cpp:1193] Registered with master master@172.17.0.3:42919; given agent ID 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0
> 04:31:59.548921 15912 master.cpp:5801] Received register agent message from slave(411)@172.17.0.3:42919 (37ad05bac4a7)
> 04:31:59.549670 15902 slave.cpp:993] New master detected at master@172.17.0.3:42919
> 04:31:59.549906 15902 slave.cpp:1193] Registered with master master@172.17.0.3:42919; given agent ID 1069c7ab-2d90-4a12-996b-ecfe9a871217-S0
> /mesos/src/tests/master_tests.cpp:8041: Failure
> Failed to wait 15secs for reregisterSlaveMessage
> 04:32:14.557504 15902 slave.cpp:869] Agent terminating
> {noformat}
> We can see that in the "bad" run, the agent sends {{RegisterSlaveMessage}} twice, likely because the reply from the master has not arrived in {{1.358313ms}} when the agent retries. Because the second {{SlaveRegisteredMessage}} arrives onto the agent during authentication and hence before it attempts to re-register, the {{ReregisterSlaveMessage}} message is never sent out. I suggest we introduce an extra {{REGISTERING}} state, which we enter after a (re)registration request is sent. This way we can drop all registered / reregistered confirmations coming from the previous master during the {{DISCONNECTED}} stage.
> Additionally, looking at the code, it seems that after sending a re-registration request in {{ReregisterSlaveMessage}} to the master, the agent is satisfied (given it comes from the expected master), i.e., transitions from {{DISCONNECTED}} to {{RUNNING}}, with {{SlaveRegisteredMessage}}, losing any {{ReconcileTasksMessage}} that might be delivered in the coming and to be ignored {{SlaveReregisteredMessage}}. This seems incorrect to me, I believe the agent should ignore {{SlaveRegisteredMessage}} after sending {{ReregisterSlaveMessage}}.
> In production environment, if the agent receives a {{registered}} when it supposes to receive {{reregistered}}, can reconciliation go wrong? Also it seems wrong to accept a {{(re)registered}} while authentication is in progress?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)