You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Ethan Rose (Jira)" <ji...@apache.org> on 2021/06/11 12:54:00 UTC
[jira] [Updated] (HDDS-5336) Fix datanode capacity related race
condition
[ https://issues.apache.org/jira/browse/HDDS-5336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ethan Rose updated HDDS-5336:
-----------------------------
Description:
After merging master into the upgrade branch in HDDS-5321, an intermittent failure was noticed in TestSCMNodeManager#testLayoutOnHeartbeat: https://github.com/apache/ozone/runs/2787582345
The issue occurs in SCMNodeManager#register, where the node is added to the nodeStateManager firing the NEW_NODE event, before the node report containing storage information for the new node is processed. The event triggers a one shot run on the background pipeline creator which will read the node's storage information to determine if it can hold a pipeline. If the storage report has not yet been processed when this happens, no pipeline will be created to use the new node when it is registered, because the node still appears to have no free space.
Relevant log lines from the test failure:
{code}
2021-06-09 21:04:44,087 [Listener at 0.0.0.0/34005] INFO net.NetworkTopologyImpl (NetworkTopologyImpl.java:add(112)) - Added a new node: /default-rack/b06583c0-2c53-452b-83e4-398ff0104f72
2021-06-09 21:04:44,087 [RatisPipelineUtilsThread - 0] WARN pipeline.PipelinePlacementPolicy (PipelinePlacementPolicy.java:filterViableNodes(151)) - Pipeline creation failed due to no sufficient healthy datanodes. Required 3. Found 2.
2021-06-09 21:04:44,088 [EventQueue-NewNodeForNewNodeHandler] INFO pipeline.BackgroundPipelineCreator (BackgroundPipelineCreatorV2.java:notifyEventTriggered(282)) - trigger a one-shot run on RatisPipelineUtilsThread.
2021-06-09 21:04:44,088 [RatisPipelineUtilsThread - 0] INFO pipeline.RatisPipelineProvider (RatisPipelineProvider.java:lambda$create$0(170)) - Sending CreatePipelineCommand for pipeline:PipelineID=8bfba789-d337-4fed-9eb6-b1debd3d19e8 to datanode:b06583c0-2c53-452b-83e4-398ff0104f72
2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO pipeline.PipelineStateManager (PipelineStateManagerV2Impl.java:addPipeline(101)) - Created pipeline Pipeline[ Id: 8bfba789-d337-4fed-9eb6-b1debd3d19e8, Nodes: b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host: localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: RATIS/ONE, State:ALLOCATED, leaderId:, CreationTimestamp2021-06-09T21:04:44.088Z].
2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO ha.SCMHAInvocationHandler (SCMHAInvocationHandler.java:invokeRatis(113)) - Invoking method public abstract void org.apache.hadoop.hdds.scm.pipeline.StateManager.addPipeline(org.apache.hadoop.hdds.protocol.proto.HddsProtos$Pipeline) throws java.io.IOException on target org.apache.hadoop.hdds.scm.ha.MockSCMHAManager$MockRatisServer@5bf60155, cost 655.117us
2021-06-09 21:04:44,091 [RatisPipelineUtilsThread - 0] WARN pipeline.PipelinePlacementPolicy (PipelinePlacementPolicy.java:filterViableNodes(170)) - Pipeline creation failed due to no sufficient healthy datanodes with enough space for even a single container. Required 3. Found 2. Container size 5368709120.
2021-06-09 21:04:44,092 [Listener at 0.0.0.0/34005] INFO node.SCMNodeManager (SCMNodeManager.java:register(386)) - Registered Data node : b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host: localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 3 pipelines of type RATIS and factor ONE.
2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:45,094 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:46,094 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:47,095 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:48,096 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:49,096 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:50,097 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:51,097 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:52,098 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:53,098 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:54,099 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
{code}
was:
After merging master into the upgrade branch in HDDS-5321, an intermittent failure was noticed in TestSCMNodeManager#testLayoutOnHeartbeat: https://github.com/apache/ozone/runs/2787582345
The issue occurs in SCMNodeManager#register, where the node is added to the nodeStateManager firing the NEW_NODE event, before the node report containing storage information for the new node is processed. The event triggers a one shot run on the background pipeline creator which will read the node's storage information to determine if it can hold a pipeline. If the storage report has not yet been processed when this happens, no pipeline will be created to use the new node when it is registered, because the node still appears to have no free space.
Relevant log lines from the test failure:
{code:log}
2021-06-09 21:04:44,087 [Listener at 0.0.0.0/34005] INFO net.NetworkTopologyImpl (NetworkTopologyImpl.java:add(112)) - Added a new node: /default-rack/b06583c0-2c53-452b-83e4-398ff0104f72
2021-06-09 21:04:44,087 [RatisPipelineUtilsThread - 0] WARN pipeline.PipelinePlacementPolicy (PipelinePlacementPolicy.java:filterViableNodes(151)) - Pipeline creation failed due to no sufficient healthy datanodes. Required 3. Found 2.
2021-06-09 21:04:44,088 [EventQueue-NewNodeForNewNodeHandler] INFO pipeline.BackgroundPipelineCreator (BackgroundPipelineCreatorV2.java:notifyEventTriggered(282)) - trigger a one-shot run on RatisPipelineUtilsThread.
2021-06-09 21:04:44,088 [RatisPipelineUtilsThread - 0] INFO pipeline.RatisPipelineProvider (RatisPipelineProvider.java:lambda$create$0(170)) - Sending CreatePipelineCommand for pipeline:PipelineID=8bfba789-d337-4fed-9eb6-b1debd3d19e8 to datanode:b06583c0-2c53-452b-83e4-398ff0104f72
2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO pipeline.PipelineStateManager (PipelineStateManagerV2Impl.java:addPipeline(101)) - Created pipeline Pipeline[ Id: 8bfba789-d337-4fed-9eb6-b1debd3d19e8, Nodes: b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host: localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: RATIS/ONE, State:ALLOCATED, leaderId:, CreationTimestamp2021-06-09T21:04:44.088Z].
2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO ha.SCMHAInvocationHandler (SCMHAInvocationHandler.java:invokeRatis(113)) - Invoking method public abstract void org.apache.hadoop.hdds.scm.pipeline.StateManager.addPipeline(org.apache.hadoop.hdds.protocol.proto.HddsProtos$Pipeline) throws java.io.IOException on target org.apache.hadoop.hdds.scm.ha.MockSCMHAManager$MockRatisServer@5bf60155, cost 655.117us
2021-06-09 21:04:44,091 [RatisPipelineUtilsThread - 0] WARN pipeline.PipelinePlacementPolicy (PipelinePlacementPolicy.java:filterViableNodes(170)) - Pipeline creation failed due to no sufficient healthy datanodes with enough space for even a single container. Required 3. Found 2. Container size 5368709120.
2021-06-09 21:04:44,092 [Listener at 0.0.0.0/34005] INFO node.SCMNodeManager (SCMNodeManager.java:register(386)) - Registered Data node : b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host: localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 3 pipelines of type RATIS and factor ONE.
2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:45,094 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:46,094 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:47,095 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:48,096 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:49,096 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:50,097 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:51,097 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:52,098 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:53,098 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
2021-06-09 21:04:54,099 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
{code}
> Fix datanode capacity related race condition
> --------------------------------------------
>
> Key: HDDS-5336
> URL: https://issues.apache.org/jira/browse/HDDS-5336
> Project: Apache Ozone
> Issue Type: Sub-task
> Reporter: Ethan Rose
> Assignee: Ethan Rose
> Priority: Major
>
> After merging master into the upgrade branch in HDDS-5321, an intermittent failure was noticed in TestSCMNodeManager#testLayoutOnHeartbeat: https://github.com/apache/ozone/runs/2787582345
> The issue occurs in SCMNodeManager#register, where the node is added to the nodeStateManager firing the NEW_NODE event, before the node report containing storage information for the new node is processed. The event triggers a one shot run on the background pipeline creator which will read the node's storage information to determine if it can hold a pipeline. If the storage report has not yet been processed when this happens, no pipeline will be created to use the new node when it is registered, because the node still appears to have no free space.
> Relevant log lines from the test failure:
> {code}
> 2021-06-09 21:04:44,087 [Listener at 0.0.0.0/34005] INFO net.NetworkTopologyImpl (NetworkTopologyImpl.java:add(112)) - Added a new node: /default-rack/b06583c0-2c53-452b-83e4-398ff0104f72
> 2021-06-09 21:04:44,087 [RatisPipelineUtilsThread - 0] WARN pipeline.PipelinePlacementPolicy (PipelinePlacementPolicy.java:filterViableNodes(151)) - Pipeline creation failed due to no sufficient healthy datanodes. Required 3. Found 2.
> 2021-06-09 21:04:44,088 [EventQueue-NewNodeForNewNodeHandler] INFO pipeline.BackgroundPipelineCreator (BackgroundPipelineCreatorV2.java:notifyEventTriggered(282)) - trigger a one-shot run on RatisPipelineUtilsThread.
> 2021-06-09 21:04:44,088 [RatisPipelineUtilsThread - 0] INFO pipeline.RatisPipelineProvider (RatisPipelineProvider.java:lambda$create$0(170)) - Sending CreatePipelineCommand for pipeline:PipelineID=8bfba789-d337-4fed-9eb6-b1debd3d19e8 to datanode:b06583c0-2c53-452b-83e4-398ff0104f72
> 2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO pipeline.PipelineStateManager (PipelineStateManagerV2Impl.java:addPipeline(101)) - Created pipeline Pipeline[ Id: 8bfba789-d337-4fed-9eb6-b1debd3d19e8, Nodes: b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host: localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}, ReplicationConfig: RATIS/ONE, State:ALLOCATED, leaderId:, CreationTimestamp2021-06-09T21:04:44.088Z].
> 2021-06-09 21:04:44,089 [RatisPipelineUtilsThread - 0] INFO ha.SCMHAInvocationHandler (SCMHAInvocationHandler.java:invokeRatis(113)) - Invoking method public abstract void org.apache.hadoop.hdds.scm.pipeline.StateManager.addPipeline(org.apache.hadoop.hdds.protocol.proto.HddsProtos$Pipeline) throws java.io.IOException on target org.apache.hadoop.hdds.scm.ha.MockSCMHAManager$MockRatisServer@5bf60155, cost 655.117us
> 2021-06-09 21:04:44,091 [RatisPipelineUtilsThread - 0] WARN pipeline.PipelinePlacementPolicy (PipelinePlacementPolicy.java:filterViableNodes(170)) - Pipeline creation failed due to no sufficient healthy datanodes with enough space for even a single container. Required 3. Found 2. Container size 5368709120.
> 2021-06-09 21:04:44,092 [Listener at 0.0.0.0/34005] INFO node.SCMNodeManager (SCMNodeManager.java:register(386)) - Registered Data node : b06583c0-2c53-452b-83e4-398ff0104f72{ip: 187.106.219.59, host: localhost-187.106.219.59, ports: [STANDALONE=0, RATIS=0, REST=0, REPLICATION=0, RATIS_ADMIN=0, RATIS_SERVER=0], networkLocation: /default-rack, certSerialId: null, persistedOpState: IN_SERVICE, persistedOpStateExpiryEpochSec: 0}
> 2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 3 pipelines of type RATIS and factor ONE.
> 2021-06-09 21:04:44,093 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
> 2021-06-09 21:04:45,094 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
> 2021-06-09 21:04:46,094 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
> 2021-06-09 21:04:47,095 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
> 2021-06-09 21:04:48,096 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
> 2021-06-09 21:04:49,096 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
> 2021-06-09 21:04:50,097 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
> 2021-06-09 21:04:51,097 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
> 2021-06-09 21:04:52,098 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
> 2021-06-09 21:04:53,098 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
> 2021-06-09 21:04:54,099 [Listener at 0.0.0.0/34005] INFO node.TestSCMNodeManager (TestSCMNodeManager.java:lambda$assertPipelines$10(463)) - Found 0 pipelines of type RATIS and factor THREE.
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org