You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Stephen O'Donnell (Jira)" <ji...@apache.org> on 2020/02/28 12:46:00 UTC

[jira] [Created] (HDDS-3107) Pipelines may not be rack aware on cluster startup

Stephen O'Donnell created HDDS-3107:
---------------------------------------

             Summary: Pipelines may not be rack aware on cluster startup
                 Key: HDDS-3107
                 URL: https://issues.apache.org/jira/browse/HDDS-3107
             Project: Hadoop Distributed Data Store
          Issue Type: Sub-task
          Components: SCM
    Affects Versions: 0.6.0
            Reporter: Stephen O'Donnell
            Assignee: Stephen O'Donnell


Given a 6 node cluster with 2 racks so there are 3 nodes per rack, it is possible for the pipeline to be created in a non-rack-aware way on startup.

Using a robot test, like the one in HDDS-3084 intermittently I can see that if all nodes from one rack get registered first, a pipeline creation is triggered on them resulting in a pipeline which is all on one rack. Then the next 3 nodes register and as there are no nodes available on the other rack, they too join a "one rack" pipeline.

This log snippet shows this happening. I will attach the full docker-compose log:

{code}
egrep "Sending CreatePipelineCommand|Registered Data node|Created pipe" docker-ozone-topology-ozone-topology-readdata-scm.log
scm_1         | 2020-02-28 12:27:57,826 [IPC Server handler 6 on 9861] INFO node.SCMNodeManager: Registered Data node : 74084fe6-60a9-45d6-b02c-a9fa7ed24e3a{ip: 10.5.0.6, host: ozone-topology_datanode_3_1.ozone-topology_net, networkLocation: /rack1, certSerialId: null}
scm_1         | 2020-02-28 12:27:57,840 [IPC Server handler 9 on 9861] INFO node.SCMNodeManager: Registered Data node : 32be7fa9-1ff6-4bb3-8bed-8648d276ae07{ip: 10.5.0.5, host: ozone-topology_datanode_2_1.ozone-topology_net, networkLocation: /rack1, certSerialId: null}
scm_1         | 2020-02-28 12:27:57,903 [RatisPipelineUtilsThread] INFO pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for pipeline:PipelineID=16806a56-8e35-46b2-aefd-cb5232d6f5f7 to datanode:32be7fa9-1ff6-4bb3-8bed-8648d276ae07
scm_1         | 2020-02-28 12:27:57,924 [RatisPipelineUtilsThread] INFO pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: 16806a56-8e35-46b2-aefd-cb5232d6f5f7, Nodes: 32be7fa9-1ff6-4bb3-8bed-8648d276ae07{ip: 10.5.0.5, host: ozone-topology_datanode_2_1.ozone-topology_net, networkLocation: /rack1, certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null, CreationTimestamp2020-02-28T12:27:57.891553Z]
scm_1         | 2020-02-28 12:27:57,932 [RatisPipelineUtilsThread] INFO pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for pipeline:PipelineID=5a3edf1e-84f6-48ef-a333-6f3e924898a6 to datanode:74084fe6-60a9-45d6-b02c-a9fa7ed24e3a
scm_1         | 2020-02-28 12:27:57,933 [RatisPipelineUtilsThread] INFO pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: 5a3edf1e-84f6-48ef-a333-6f3e924898a6, Nodes: 74084fe6-60a9-45d6-b02c-a9fa7ed24e3a{ip: 10.5.0.6, host: ozone-topology_datanode_3_1.ozone-topology_net, networkLocation: /rack1, certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null, CreationTimestamp2020-02-28T12:27:57.932422Z]
scm_1         | 2020-02-28 12:27:58,213 [IPC Server handler 8 on 9861] INFO node.SCMNodeManager: Registered Data node : 4ce489a3-e3da-4f2a-9ddc-b01b634a68b6{ip: 10.5.0.4, host: ozone-topology_datanode_1_1.ozone-topology_net, networkLocation: /rack1, certSerialId: null}
scm_1         | 2020-02-28 12:27:58,216 [RatisPipelineUtilsThread] INFO pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for pipeline:PipelineID=ba2034fc-cb11-482a-9843-435294862240 to datanode:4ce489a3-e3da-4f2a-9ddc-b01b634a68b6
scm_1         | 2020-02-28 12:27:58,216 [RatisPipelineUtilsThread] INFO pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: ba2034fc-cb11-482a-9843-435294862240, Nodes: 4ce489a3-e3da-4f2a-9ddc-b01b634a68b6{ip: 10.5.0.4, host: ozone-topology_datanode_1_1.ozone-topology_net, networkLocation: /rack1, certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null, CreationTimestamp2020-02-28T12:27:58.216275Z]
scm_1         | 2020-02-28 12:27:58,218 [RatisPipelineUtilsThread] INFO pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for pipeline:PipelineID=4f16913d-ec06-44b4-a577-6664a517e401 to datanode:4ce489a3-e3da-4f2a-9ddc-b01b634a68b6
scm_1         | 2020-02-28 12:27:58,219 [RatisPipelineUtilsThread] INFO pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for pipeline:PipelineID=4f16913d-ec06-44b4-a577-6664a517e401 to datanode:74084fe6-60a9-45d6-b02c-a9fa7ed24e3a
scm_1         | 2020-02-28 12:27:58,220 [RatisPipelineUtilsThread] INFO pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for pipeline:PipelineID=4f16913d-ec06-44b4-a577-6664a517e401 to datanode:32be7fa9-1ff6-4bb3-8bed-8648d276ae07
scm_1         | 2020-02-28 12:27:58,221 [RatisPipelineUtilsThread] INFO pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: 4f16913d-ec06-44b4-a577-6664a517e401, Nodes: 4ce489a3-e3da-4f2a-9ddc-b01b634a68b6{ip: 10.5.0.4, host: ozone-topology_datanode_1_1.ozone-topology_net, networkLocation: /rack1, certSerialId: null}74084fe6-60a9-45d6-b02c-a9fa7ed24e3a{ip: 10.5.0.6, host: ozone-topology_datanode_3_1.ozone-topology_net, networkLocation: /rack1, certSerialId: null}32be7fa9-1ff6-4bb3-8bed-8648d276ae07{ip: 10.5.0.5, host: ozone-topology_datanode_2_1.ozone-topology_net, networkLocation: /rack1, certSerialId: null}, Type:RATIS, Factor:THREE, State:ALLOCATED, leaderId:null, CreationTimestamp2020-02-28T12:27:58.218896Z]
scm_1         | 2020-02-28 12:27:58,645 [IPC Server handler 7 on 9861] INFO node.SCMNodeManager: Registered Data node : 66ec72b2-4be5-453f-ac44-cc9857bad5f0{ip: 10.5.0.8, host: ozone-topology_datanode_5_1.ozone-topology_net, networkLocation: /rack2, certSerialId: null}
scm_1         | 2020-02-28 12:27:58,645 [RatisPipelineUtilsThread] INFO pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for pipeline:PipelineID=4739840f-8bb3-4742-ac5e-ac519b51e0fd to datanode:66ec72b2-4be5-453f-ac44-cc9857bad5f0
scm_1         | 2020-02-28 12:27:58,647 [RatisPipelineUtilsThread] INFO pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: 4739840f-8bb3-4742-ac5e-ac519b51e0fd, Nodes: 66ec72b2-4be5-453f-ac44-cc9857bad5f0{ip: 10.5.0.8, host: ozone-topology_datanode_5_1.ozone-topology_net, networkLocation: /rack2, certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null, CreationTimestamp2020-02-28T12:27:58.645455Z]
scm_1         | 2020-02-28 12:27:59,339 [IPC Server handler 7 on 9861] INFO node.SCMNodeManager: Registered Data node : 9be38eea-bacc-434a-876d-50b105d4daa2{ip: 10.5.0.9, host: ozone-topology_datanode_6_1.ozone-topology_net, networkLocation: /rack2, certSerialId: null}
scm_1         | 2020-02-28 12:27:59,340 [RatisPipelineUtilsThread] INFO pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for pipeline:PipelineID=555b9a1d-1c4a-4d9f-b198-492da7005ccd to datanode:9be38eea-bacc-434a-876d-50b105d4daa2
scm_1         | 2020-02-28 12:27:59,341 [RatisPipelineUtilsThread] INFO pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: 555b9a1d-1c4a-4d9f-b198-492da7005ccd, Nodes: 9be38eea-bacc-434a-876d-50b105d4daa2{ip: 10.5.0.9, host: ozone-topology_datanode_6_1.ozone-topology_net, networkLocation: /rack2, certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null, CreationTimestamp2020-02-28T12:27:59.340193Z]
scm_1         | 2020-02-28 12:27:59,672 [IPC Server handler 6 on 9861] INFO node.SCMNodeManager: Registered Data node : cc1827a2-e4d2-47b4-a13a-1d990c6e36e1{ip: 10.5.0.7, host: ozone-topology_datanode_4_1.ozone-topology_net, networkLocation: /rack2, certSerialId: null}
scm_1         | 2020-02-28 12:27:59,673 [RatisPipelineUtilsThread] INFO pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for pipeline:PipelineID=a6d77ef7-52c0-4f6a-8c22-f0b405da08a1 to datanode:cc1827a2-e4d2-47b4-a13a-1d990c6e36e1
scm_1         | 2020-02-28 12:27:59,674 [RatisPipelineUtilsThread] INFO pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: a6d77ef7-52c0-4f6a-8c22-f0b405da08a1, Nodes: cc1827a2-e4d2-47b4-a13a-1d990c6e36e1{ip: 10.5.0.7, host: ozone-topology_datanode_4_1.ozone-topology_net, networkLocation: /rack2, certSerialId: null}, Type:RATIS, Factor:ONE, State:ALLOCATED, leaderId:null, CreationTimestamp2020-02-28T12:27:59.673585Z]
scm_1         | 2020-02-28 12:27:59,683 [RatisPipelineUtilsThread] INFO pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for pipeline:PipelineID=70cfd35d-b778-42df-bcba-3ba14bd8ead0 to datanode:9be38eea-bacc-434a-876d-50b105d4daa2
scm_1         | 2020-02-28 12:27:59,683 [RatisPipelineUtilsThread] INFO pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for pipeline:PipelineID=70cfd35d-b778-42df-bcba-3ba14bd8ead0 to datanode:66ec72b2-4be5-453f-ac44-cc9857bad5f0
scm_1         | 2020-02-28 12:27:59,683 [RatisPipelineUtilsThread] INFO pipeline.RatisPipelineProvider: Sending CreatePipelineCommand for pipeline:PipelineID=70cfd35d-b778-42df-bcba-3ba14bd8ead0 to datanode:cc1827a2-e4d2-47b4-a13a-1d990c6e36e1
scm_1         | 2020-02-28 12:27:59,684 [RatisPipelineUtilsThread] INFO pipeline.PipelineStateManager: Created pipeline Pipeline[ Id: 70cfd35d-b778-42df-bcba-3ba14bd8ead0, Nodes: 9be38eea-bacc-434a-876d-50b105d4daa2{ip: 10.5.0.9, host: ozone-topology_datanode_6_1.ozone-topology_net, networkLocation: /rack2, certSerialId: null}66ec72b2-4be5-453f-ac44-cc9857bad
{code}

I believe there are a few things to consider here:

1) Do we need a better way to see if rack awareness is enabled? Currently we check the network topology for a count of rack nodes, but these are only created as the nodes register. Should we use the cluster map to determine the intended number of racks on the cluster?

2) Should we fallback to non-rack-aware so easily? Pipelines are long lived, and if they are created non-rack aware, they will stay that way potential forever. Maybe we need to delay pipeline creation on startup until the node count settles?

3) If a pipeline or new container is being placed non-rack aware in a rack aware cluster should we complain loudly in the logs, JMX, in Recon?

4) Do we need something to check for non-rack aware pipelines and fix them if it can? Eg if we have 2 racks, and stop 1 rack, then we must create a non-rack-aware pipeline to keep on writing, but when the other rack is restarted, that pipeline should be destroyed and a new rack-aware one created.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org