You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Neil Conway (JIRA)" <ji...@apache.org> on 2016/10/19 21:05:58 UTC

[jira] [Created] (MESOS-6421) Agent enters re-registration loop after "--recovery=cleanup"

Neil Conway created MESOS-6421:
----------------------------------

             Summary: Agent enters re-registration loop after "--recovery=cleanup"
                 Key: MESOS-6421
                 URL: https://issues.apache.org/jira/browse/MESOS-6421
             Project: Mesos
          Issue Type: Bug
          Components: general
            Reporter: Neil Conway
            Assignee: Neil Conway


Repro:

1. Start a master
2. Start an agent. The agent should register with the master.
3. Ctrl-C agent.
4. Start agent with {{--recover=cleanup}}. Agent will cleanup {{work_dir}} and terminate.
5. Start agent without any recovery flag. Agent will then proceed to register and obtain a new slave ID. However, it will repeatedly disconnect and re-register with the master:

{noformat}
I1019 16:59:21.859954 3747840 slave.cpp:915] New master detected at master@10.0.9.176:5050
I1019 16:59:21.859997 3747840 slave.cpp:936] No credentials provided. Attempting to register without authentication
I1019 16:59:21.860038 3747840 slave.cpp:947] Detecting new master
I1019 16:59:22.712779 528384 slave.cpp:1115] Registered with master master@10.0.9.176:5050; given agent ID 3615d417-b3fd-4e00-b794-9396456c3f6a-S1
I1019 16:59:22.712908 2138112 status_update_manager.cpp:184] Resuming sending status updates
I1019 16:59:22.713953 528384 slave.cpp:1175] Forwarding total oversubscribed resources {}
I1019 16:59:25.082317 3747840 slave.cpp:4147] Master marked the agent as disconnected but the agent considers itself registered! Forcing re-registration.
I1019 16:59:25.082756 3747840 slave.cpp:904] Re-detecting master
I1019 16:59:25.082767 2138112 status_update_manager.cpp:177] Pausing sending status updates
I1019 16:59:25.082801 3747840 slave.cpp:947] Detecting new master
I1019 16:59:25.083293 4284416 status_update_manager.cpp:177] Pausing sending status updates
I1019 16:59:25.083300 528384 slave.cpp:915] New master detected at master@10.0.9.176:5050
I1019 16:59:25.083349 528384 slave.cpp:936] No credentials provided. Attempting to register without authentication
I1019 16:59:25.083395 528384 slave.cpp:947] Detecting new master
I1019 16:59:25.869060 1064960 slave.cpp:1217] Re-registered with master master@10.0.9.176:5050
I1019 16:59:25.869246 4284416 status_update_manager.cpp:184] Resuming sending status updates
I1019 16:59:25.869246 1064960 slave.cpp:1253] Forwarding total oversubscribed resources {}
I1019 16:59:40.087697 4284416 slave.cpp:4147] Master marked the agent as disconnected but the agent considers itself registered! Forcing re-registration.
I1019 16:59:40.088105 4284416 slave.cpp:904] Re-detecting master
I1019 16:59:40.088120 3747840 status_update_manager.cpp:177] Pausing sending status updates
I1019 16:59:40.088145 4284416 slave.cpp:947] Detecting new master
I1019 16:59:40.088599 2138112 status_update_manager.cpp:177] Pausing sending status updates
I1019 16:59:40.088623 1601536 slave.cpp:915] New master detected at master@10.0.9.176:5050
I1019 16:59:40.088656 1601536 slave.cpp:936] No credentials provided. Attempting to register without authentication
I1019 16:59:40.088716 1601536 slave.cpp:947] Detecting new master
I1019 16:59:41.006837 2138112 slave.cpp:1217] Re-registered with master master@10.0.9.176:5050
I1019 16:59:41.007017 2674688 status_update_manager.cpp:184] Resuming sending status updates
I1019 16:59:41.007035 2138112 slave.cpp:1253] Forwarding total oversubscribed resources {}
I1019 16:59:55.089197 2674688 slave.cpp:4147] Master marked the agent as disconnected but the agent considers itself registered! Forcing re-registration.
{noformat}

This continues on, seemingly indefinitely. Master logs during this period:

{noformat}
I1019 16:59:04.307692 528384 master.cpp:5579] Received update of agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S0 at slave(1)@10.0.9.176:5051 (10.0.9.176) with total oversubscribed resources {}
I1019 16:59:04.307929 528384 hierarchical.cpp:555] Agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S0 (10.0.9.176) updated with oversubscribed resources {} (total: cpus(*):8; mem(*):15360; disk(*):470832; ports(*):[31000-32000], allocated: {})
E1019 16:59:06.411664 4820992 process.cpp:2154] Failed to shutdown socket with fd 12: Socket is not connected
I1019 16:59:06.411717 2138112 master.cpp:1259] Agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S0 at slave(1)@10.0.9.176:5051 (10.0.9.176) disconnected
I1019 16:59:06.411767 2138112 master.cpp:2948] Disconnecting agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S0 at slave(1)@10.0.9.176:5051 (10.0.9.176)
E1019 16:59:06.411854 4820992 process.cpp:2154] Failed to shutdown socket with fd 9: Socket is not connected
I1019 16:59:06.411887 2138112 master.cpp:2967] Deactivating agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S0 at slave(1)@10.0.9.176:5051 (10.0.9.176)
I1019 16:59:06.412001 2674688 hierarchical.cpp:584] Agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S0 deactivated
E1019 16:59:10.078735 4820992 process.cpp:2154] Failed to shutdown socket with fd 9: Socket is not connected
I1019 16:59:22.705441 3211264 master.cpp:5070] Removing old disconnected agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S0 at slave(1)@10.0.9.176:5051 (10.0.9.176) because a registration attempt occurred
I1019 16:59:22.705507 3211264 master.cpp:5111] Registering agent at slave(1)@10.0.9.176:5051 (10.0.9.176) with id 3615d417-b3fd-4e00-b794-9396456c3f6a-S1
I1019 16:59:22.705821 3747840 registrar.cpp:461] Applied 1 operations in 90us; attempting to update the registry
I1019 16:59:22.706418 528384 coordinator.cpp:348] Coordinator attempting to write APPEND action at position 5
I1019 16:59:22.707296 2674688 replica.cpp:537] Replica received write request for position 5 from __req_res__(10)@10.0.9.176:5050
I1019 16:59:22.710041 2138112 replica.cpp:691] Replica received learned notice for position 5 from @0.0.0.0:0
I1019 16:59:22.711555 3211264 registrar.cpp:506] Successfully updated the registry in 5.64224ms
I1019 16:59:22.711786 3747840 coordinator.cpp:348] Coordinator attempting to write TRUNCATE action at position 6
I1019 16:59:22.712194 1601536 master.cpp:5182] Registered agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 at slave(1)@10.0.9.176:5051 (10.0.9.176) with cpus(*):8; mem(*):15360; disk(*):470832; ports(*):[31000-32000]
I1019 16:59:22.712391 2138112 hierarchical.cpp:485] Added agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 (10.0.9.176) with cpus(*):8; mem(*):15360; disk(*):470832; ports(*):[31000-32000] (allocated: {})
I1019 16:59:22.712443 3211264 replica.cpp:537] Replica received write request for position 6 from __req_res__(11)@10.0.9.176:5050
I1019 16:59:22.712982 1601536 replica.cpp:691] Replica received learned notice for position 6 from @0.0.0.0:0
I1019 16:59:22.714227 4284416 master.cpp:5579] Received update of agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 at slave(1)@10.0.9.176:5051 (10.0.9.176) with total oversubscribed resources {}
I1019 16:59:22.714383 3747840 hierarchical.cpp:555] Agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 (10.0.9.176) updated with oversubscribed resources {} (total: cpus(*):8; mem(*):15360; disk(*):470832; ports(*):[31000-32000], allocated: {})
I1019 16:59:25.868541 3747840 master.cpp:5259] Re-registering agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 at slave(1)@10.0.9.176:5051 (10.0.9.176)
I1019 16:59:25.868702 3747840 master.cpp:5518] Sending updated checkpointed resources {} to agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 at slave(1)@10.0.9.176:5051 (10.0.9.176)
I1019 16:59:25.869714 1064960 master.cpp:5579] Received update of agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 at slave(1)@10.0.9.176:5051 (10.0.9.176) with total oversubscribed resources {}
I1019 16:59:25.869993 3211264 hierarchical.cpp:555] Agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 (10.0.9.176) updated with oversubscribed resources {} (total: cpus(*):8; mem(*):15360; disk(*):470832; ports(*):[31000-32000], allocated: {})
I1019 16:59:41.006363 3211264 master.cpp:5259] Re-registering agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 at slave(1)@10.0.9.176:5051 (10.0.9.176)
I1019 16:59:41.006510 3211264 master.cpp:5518] Sending updated checkpointed resources {} to agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 at slave(1)@10.0.9.176:5051 (10.0.9.176)
I1019 16:59:41.007469 2674688 master.cpp:5579] Received update of agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 at slave(1)@10.0.9.176:5051 (10.0.9.176) with total oversubscribed resources {}
I1019 16:59:41.007802 2138112 hierarchical.cpp:555] Agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 (10.0.9.176) updated with oversubscribed resources {} (total: cpus(*):8; mem(*):15360; disk(*):470832; ports(*):[31000-32000], allocated: {})
I1019 16:59:55.429986 2674688 master.cpp:5259] Re-registering agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 at slave(1)@10.0.9.176:5051 (10.0.9.176)
I1019 16:59:55.430091 2674688 master.cpp:5518] Sending updated checkpointed resources {} to agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 at slave(1)@10.0.9.176:5051 (10.0.9.176)
I1019 16:59:55.430687 4284416 master.cpp:5579] Received update of agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 at slave(1)@10.0.9.176:5051 (10.0.9.176) with total oversubscribed resources {}
I1019 16:59:55.430851 2674688 hierarchical.cpp:555] Agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 (10.0.9.176) updated with oversubscribed resources {} (total: cpus(*):8; mem(*):15360; disk(*):470832; ports(*):[31000-32000], allocated: {})
I1019 17:00:10.371459 4284416 master.cpp:5259] Re-registering agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 at slave(1)@10.0.9.176:5051 (10.0.9.176)
I1019 17:00:10.371608 4284416 master.cpp:5518] Sending updated checkpointed resources {} to agent 3615d417-b3fd-4e00-b794-9396456c3f6a-S1 at slave(1)@10.0.9.176:5051 (10.0.9.176)
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)