You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Li Cheng (Jira)" <ji...@apache.org> on 2020/02/23 16:37:00 UTC

[jira] [Comment Edited] (HDDS-3004) OM HA stability issues

    [ https://issues.apache.org/jira/browse/HDDS-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17042978#comment-17042978 ] 

Li Cheng edited comment on HDDS-3004 at 2/23/20 4:36 PM:
---------------------------------------------------------

After a while, both old pipelines have leader datanodes:

[root@VM_50_210_centos ~]# ./ozone-0.5.0-SNAPSHOT/bin/ozone scmcli datanode list
 Datanode: 316c3dd3-470b-4ebc-a139-766e2f1b8593 (/default-rack/9.134.51.25/9.134.51.25/2 pipelines) 
 Related pipelines: 
 {color:#ff0000}4461d34e-c509-4175-944f-83fbe8ae1095{color}/THREE/RATIS/ALLOCATED/{color:#ff0000}Leader{color}
 e2b8a389-bfd2-4b34-b43d-361cbc02c7f9/THREE/RATIS/ALLOCATED/Follower

Datanode: 5ab5305d-f733-44e9-9dcd-ace391b5a9dc (/default-rack/9.134.51.232/9.134.51.232/2 pipelines) 
 Related pipelines: 
 4461d34e-c509-4175-944f-83fbe8ae1095/THREE/RATIS/ALLOCATED/Follower
 e2b8a389-bfd2-4b34-b43d-361cbc02c7f9/THREE/RATIS/ALLOCATED/Follower

Datanode: 6da6b84b-3d8e-4309-ab28-7cc72b4e7293 (/default-rack/9.134.51.215/ozone.s3/2 pipelines) 
 Related pipelines: 
 4461d34e-c509-4175-944f-83fbe8ae1095/THREE/RATIS/ALLOCATED/Follower
 {color:#ff0000}e2b8a389-bfd2-4b34-b43d-361cbc02c7f9{color}/THREE/RATIS/ALLOCATED/{color:#ff0000}Leader{color}

 

{color:#172b4d}Whereas now SCM sees same pipelines, but still cannot move them to OPEN state.{color}

{color:#172b4d}[root@VM_50_210_centos ~]# ./ozone-0.5.0-SNAPSHOT/bin/ozone scmcli pipeline list
 Pipeline[ Id: 4461d34e-c509-4175-944f-83fbe8ae1095,{color} Nodes: 5ab5305d-f733-44e9-9dcd-ace391b5a9dc\{ip: 9.134.51.232, host: 9.134.51.232, networkLocation: /default-rack, certSerialId: null}316c3dd3-470b-4ebc-a139-766e2f1b8593\{ip: 9.134.51.25, host: 9.134.51.25, networkLocation: /default-rack, certSerialId: null}6da6b84b-3d8e-4309-ab28-7cc72b4e7293\{ip: 9.134.51.215, host: ozone.s3, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:ALLOCATED, leaderId:316c3dd3-470b-4ebc-a139-766e2f1b8593, CreationTimestamp2020-02-23T16:28:52.131Z]
 Pipeline[ Id: {color:#de350b}e2b8a389-bfd2-4b34-b43d-361cbc02c7f9{color}, Nodes: 316c3dd3-470b-4ebc-a139-766e2f1b8593\{ip: 9.134.51.25, host: 9.134.51.25, networkLocation: /default-rack, certSerialId: null}5ab5305d-f733-44e9-9dcd-ace391b5a9dc\{ip: 9.134.51.232, host: 9.134.51.232, networkLocation: /default-rack, certSerialId: null}6da6b84b-3d8e-4309-ab28-7cc72b4e7293\{ip: 9.134.51.215, host: ozone.s3, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:ALLOCATED, leaderId:6da6b84b-3d8e-4309-ab28-7cc72b4e7293, CreationTimestamp2020-02-23T16:28:52.130Z]

 

SCM logs keep showing:

2020-02-24 00:36:18,028 [EventQueue-ContainerReportForContainerReportHandler] WARN org.apache.hadoop.hdds.scm.container.ContainerReportHandler: Container #12 is in OPEN state, but the datanode 316c3dd3-470b-4ebc-a139-766e2f1b8593\{ip: 9.134.51.25, host: 9.134.51.25, networkLocation: /default-rack, certSerialId: null} reports an QUASI_CLOSED replica.


was (Author: timmylicheng):
After a while, both old pipelines have leader datanodes:

[root@VM_50_210_centos ~]# ./ozone-0.5.0-SNAPSHOT/bin/ozone scmcli datanode list
Datanode: 316c3dd3-470b-4ebc-a139-766e2f1b8593 (/default-rack/9.134.51.25/9.134.51.25/2 pipelines) 
Related pipelines: 
{color:#FF0000}4461d34e-c509-4175-944f-83fbe8ae1095{color}/THREE/RATIS/ALLOCATED/{color:#FF0000}Leader{color}
e2b8a389-bfd2-4b34-b43d-361cbc02c7f9/THREE/RATIS/ALLOCATED/Follower

Datanode: 5ab5305d-f733-44e9-9dcd-ace391b5a9dc (/default-rack/9.134.51.232/9.134.51.232/2 pipelines) 
Related pipelines: 
4461d34e-c509-4175-944f-83fbe8ae1095/THREE/RATIS/ALLOCATED/Follower
e2b8a389-bfd2-4b34-b43d-361cbc02c7f9/THREE/RATIS/ALLOCATED/Follower

Datanode: 6da6b84b-3d8e-4309-ab28-7cc72b4e7293 (/default-rack/9.134.51.215/ozone.s3/2 pipelines) 
Related pipelines: 
4461d34e-c509-4175-944f-83fbe8ae1095/THREE/RATIS/ALLOCATED/Follower
{color:#FF0000}e2b8a389-bfd2-4b34-b43d-361cbc02c7f9{color}/THREE/RATIS/ALLOCATED/{color:#FF0000}Leader{color}

 

{color:#172b4d}Whereas now SCM sees same pipelines, but still cannot move them to OPEN state.{color}

{color:#172b4d}[root@VM_50_210_centos ~]# ./ozone-0.5.0-SNAPSHOT/bin/ozone scmcli pipeline list
Pipeline[ Id: {color:#de350b}4461d34e-c509-4175-944f-83fbe8ae1095,{color} Nodes: 5ab5305d-f733-44e9-9dcd-ace391b5a9dc\{ip: 9.134.51.232, host: 9.134.51.232, networkLocation: /default-rack, certSerialId: null}316c3dd3-470b-4ebc-a139-766e2f1b8593\{ip: 9.134.51.25, host: 9.134.51.25, networkLocation: /default-rack, certSerialId: null}6da6b84b-3d8e-4309-ab28-7cc72b4e7293\{ip: 9.134.51.215, host: ozone.s3, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:ALLOCATED, leaderId:316c3dd3-470b-4ebc-a139-766e2f1b8593, CreationTimestamp2020-02-23T16:28:52.131Z]
Pipeline[ Id: {color:#de350b}e2b8a389-bfd2-4b34-b43d-361cbc02c7f9{color}, Nodes: 316c3dd3-470b-4ebc-a139-766e2f1b8593\{ip: 9.134.51.25, host: 9.134.51.25, networkLocation: /default-rack, certSerialId: null}5ab5305d-f733-44e9-9dcd-ace391b5a9dc\{ip: 9.134.51.232, host: 9.134.51.232, networkLocation: /default-rack, certSerialId: null}6da6b84b-3d8e-4309-ab28-7cc72b4e7293\{ip: 9.134.51.215, host: ozone.s3, networkLocation: /default-rack, certSerialId: null}, Type:RATIS, Factor:THREE, State:ALLOCATED, leaderId:6da6b84b-3d8e-4309-ab28-7cc72b4e7293, CreationTimestamp2020-02-23T16:28:52.130Z]{color}

> OM HA stability issues
> ----------------------
>
>                 Key: HDDS-3004
>                 URL: https://issues.apache.org/jira/browse/HDDS-3004
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>          Components: om
>    Affects Versions: 0.4.0
>            Reporter: Li Cheng
>            Assignee: Bharat Viswanadham
>            Priority: Blocker
>
> To conclude a little, _+{color:#ff0000}major issues{color}+_ that I find:
>  # When I do a long running s3g writing to cluster with OM HA and I stop the Om leader to force a re-election, the writing will stop and can never recover.
> --updates 2020-02-20:
> https://issues.apache.org/jira/browse/HDDS-3031 {color:#FF0000}fixes{color} this issue.
>  
> 2. If I force a OM re-election and do a scm restart after that, the cluster cannot see any leader datanode and no datanodes are able to send pipeline reports, which makes the cluster unavailable as well. I consider this a multi-failover case when the leader OM and SCM are on the same node and there is a short outage happen to the node.
>  
> --updates 2020-02-20:
>  When you do a jar swap for a new version of Ozone and enable OM HA while keeping the same ozone-site.xml as last time, if you've written some data into the last Ozone cluster (and therefore there are existing versions and metadata for om and scm), SCM cannot be up after the jar swap.
> {color:#FF0000}Error logs{color}: PipelineID=aae4f728-82ef-4bbb-a0a5-7b3f2af030cc not found in scm out logs when scm process cannot be started.
>  
> Original posting:
> Use S3 gateway to keep writing data into a specific s3 gateway endpoint. After the writer starts to work, I kill the OM process on the OM leader host. After that, the s3 gateway can never allow writing data and keeps reporting InternalError for all new coming keys.
> Process Process-488:
>  S3UploadFailedError: Failed to upload ./20191204/file1056.dat to ozone-test-reproduce-123/./20191204/file1056.dat: An error occurred (500) when calling the PutObject operation (reached max retries: 4): Internal Server Error
>  Process Process-489:
>  S3UploadFailedError: Failed to upload ./20191204/file9631.dat to ozone-test-reproduce-123/./20191204/file9631.dat: An error occurred (500) when calling the PutObject operation (reached max retries: 4): Internal Server Error
>  Process Process-490:
>  S3UploadFailedError: Failed to upload ./20191204/file7520.dat to ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) when calling the PutObject operation (reached max retries: 4): Internal Server Error
>  Process Process-491:
>  S3UploadFailedError: Failed to upload ./20191204/file4220.dat to ozone-test-reproduce-123/./20191204/file4220.dat: An error occurred (500) when calling the PutObject operation (reached max retries: 4): Internal Server Error
>  Process Process-492:
>  S3UploadFailedError: Failed to upload ./20191204/file5523.dat to ozone-test-reproduce-123/./20191204/file5523.dat: An error occurred (500) when calling the PutObject operation (reached max retries: 4): Internal Server Error
>  Process Process-493:
>  S3UploadFailedError: Failed to upload ./20191204/file7520.dat to ozone-test-reproduce-123/./20191204/file7520.dat: An error occurred (500) when calling the PutObject operation (reached max retries: 4): Internal Server Error
> That's a partial list and note that all keys are different. I also tried re-enable the OM process on previous leader OM, but it doesn't help since the leader has changed. Also attach partial OM logs:
>  2020-02-12 14:57:11,128 [IPC Server handler 72 on 9862] INFO org.apache.hadoop.ipc.Server: IPC Server handler 72 on 9862, call Call#4859 Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 9.134.50.210:36561
>  org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the leader. Suggested leader is OM:om2.
>  at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
>  at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
>  at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
>  at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
>  at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
>  at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>  at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  2020-02-12 14:57:11,918 [IPC Server handler 159 on 9862] INFO org.apache.hadoop.ipc.Server: IPC Server handler 159 on 9862, call Call#4864 Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 9.134.50.210:36561
>  org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the leader. Suggested leader is OM:om2.
>  at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
>  at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
>  at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
>  at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
>  at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
>  at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>  at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  2020-02-12 14:57:15,395 [IPC Server handler 23 on 9862] INFO org.apache.hadoop.ipc.Server: IPC Server handler 23 on 9862, call Call#4869 Retry#0 org.apache.hadoop.ozone.om.protocol.OzoneManagerProtocol.submitRequest from 9.134.50.210:36561
>  org.apache.hadoop.ozone.om.exceptions.OMNotLeaderException: OM:om1 is not the leader. Suggested leader is OM:om2.
>  at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.createNotLeaderException(OzoneManagerProtocolServerSideTranslatorPB.java:183)
>  at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:171)
>  at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:107)
>  at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:72)
>  at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:97)
>  at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>  at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
>  at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:876)
>  at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:822)
>  at java.security.AccessController.doPrivileged(Native Method)
>  at javax.security.auth.Subject.doAs(Subject.java:422)
>  at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)
>  at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2682)
>  
>  
> Also attach the ozone-site.xml config to enable OM HA:
> <property>
>  <name>ozone.om.service.ids</name>
>  <value>OMHA</value>
>  </property>
>  <property>
>  <name>ozone.om.nodes.OMHA</name>
>  <value>om1,om2,om3</value>
>  </property>
>  <property>
>  <name>ozone.om.node.id</name>
>  <value>om1</value>
>  </property>
>  <property>
>  <name>ozone.om.address.OMHA.om1</name>
>  <value>9.134.50.210:9862</value>
>  </property>
>  <property>
>  <name>ozone.om.address.OMHA.om2</name>
>  <value>9.134.51.215:9862</value>
>  </property>
>  <property>
>  <name>ozone.om.address.OMHA.om3</name>
>  <value>9.134.51.25:9862</value>
>  </property>
>  <property>
>  <name>ozone.om.ratis.enable</name>
>  <value>true</value>
>  </property>
>  <property>
>  <name>ozone.enabled</name>
>  <value>true</value>
>  <tag>OZONE, REQUIRED</tag>
>  <description>
>  Status of the Ozone Object Storage service is enabled.
>  Set to true to enable Ozone.
>  Set to false to disable Ozone.
>  Unless this value is set to true, Ozone services will not be started in
>  the cluster.
> Please note: By default ozone is disabled on a hadoop cluster.
>  </description>
>  </property>



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: ozone-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: ozone-issues-help@hadoop.apache.org