You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Bharat Viswanadham (Jira)" <ji...@apache.org> on 2021/04/02 02:02:00 UTC

[jira] [Resolved] (HDDS-5058) Make getScmInfo retry for a duration

     [ https://issues.apache.org/jira/browse/HDDS-5058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bharat Viswanadham resolved HDDS-5058.
--------------------------------------
    Fix Version/s: 1.2.0
       Resolution: Fixed

> Make getScmInfo retry for a duration
> ------------------------------------
>
>                 Key: HDDS-5058
>                 URL: https://issues.apache.org/jira/browse/HDDS-5058
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Bharat Viswanadham
>            Assignee: Bharat Viswanadham
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.2.0
>
>
> Previously during init of OM for getScmInfo we used to do RetryForEverWithFixedSleep, but during SCM HA we have removed this.
> This Jira proposes to add a ceration duration to try getScmInfo, instead of retry forever with fixed sleep.
> In a few docker tests CI run, we have seen this issue, after 15 retries Om init failed, as SCM is started later.
> {code:java}
> om1_1       | 2021-03-31 17:03:48,184 [main] WARN server.ServerUtils: ozone.om.db.dirs is not configured. We recommend adding this setting. Falling back to ozone.metadata.dirs instead.
> om1_1       | 2021-03-31 17:03:52,453 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From om1/172.20.0.4 to scm2:9863 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send over nodeId=scm2,nodeAddress=scm2/172.20.0.6:9863 after 1 failover attempts. Trying to failover after sleeping for 2000ms.
> om1_1       | 2021-03-31 17:03:54,455 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From om1/172.20.0.4 to scm3:9863 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send over nodeId=scm3,nodeAddress=scm3/172.20.0.7:9863 after 2 failover attempts. Trying to failover after sleeping for 2000ms.
> om1_1       | 2021-03-31 17:03:56,457 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From om1/172.20.0.4 to scm1:9863 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send over nodeId=scm1,nodeAddress=scm1/172.20.0.8:9863 after 3 failover attempts. Trying to failover after sleeping for 2000ms.
> om1_1       | 2021-03-31 17:03:58,466 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From om1/172.20.0.4 to scm2:9863 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send over nodeId=scm2,nodeAddress=scm2/172.20.0.6:9863 after 4 failover attempts. Trying to failover after sleeping for 2000ms.
> om1_1       | 2021-03-31 17:04:00,498 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From om1/172.20.0.4 to scm3:9863 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send over nodeId=scm3,nodeAddress=scm3/172.20.0.7:9863 after 5 failover attempts. Trying to failover after sleeping for 2000ms.
> om1_1       | 2021-03-31 17:04:02,522 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From om1/172.20.0.4 to scm1:9863 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send over nodeId=scm1,nodeAddress=scm1/172.20.0.8:9863 after 6 failover attempts. Trying to failover after sleeping for 2000ms.
> om1_1       | 2021-03-31 17:04:04,533 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From om1/172.20.0.4 to scm2:9863 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send over nodeId=scm2,nodeAddress=scm2/172.20.0.6:9863 after 7 failover attempts. Trying to failover after sleeping for 2000ms.
> om1_1       | 2021-03-31 17:04:06,535 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From om1/172.20.0.4 to scm3:9863 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send over nodeId=scm3,nodeAddress=scm3/172.20.0.7:9863 after 8 failover attempts. Trying to failover after sleeping for 2000ms.
> om1_1       | 2021-03-31 17:04:08,537 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From om1/172.20.0.4 to scm1:9863 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send over nodeId=scm1,nodeAddress=scm1/172.20.0.8:9863 after 9 failover attempts. Trying to failover after sleeping for 2000ms.
> om1_1       | 2021-03-31 17:04:10,541 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From om1/172.20.0.4 to scm2:9863 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send over nodeId=scm2,nodeAddress=scm2/172.20.0.6:9863 after 10 failover attempts. Trying to failover after sleeping for 2000ms.
> om1_1       | 2021-03-31 17:04:12,543 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From om1/172.20.0.4 to scm3:9863 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send over nodeId=scm3,nodeAddress=scm3/172.20.0.7:9863 after 11 failover attempts. Trying to failover after sleeping for 2000ms.
> om1_1       | 2021-03-31 17:04:14,546 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From om1/172.20.0.4 to scm1:9863 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send over nodeId=scm1,nodeAddress=scm1/172.20.0.8:9863 after 12 failover attempts. Trying to failover after sleeping for 2000ms.
> om1_1       | 2021-03-31 17:04:16,550 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From om1/172.20.0.4 to scm2:9863 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send over nodeId=scm2,nodeAddress=scm2/172.20.0.6:9863 after 13 failover attempts. Trying to failover after sleeping for 2000ms.
> om1_1       | 2021-03-31 17:04:18,553 [main] INFO retry.RetryInvocationHandler: com.google.protobuf.ServiceException: java.net.ConnectException: Call From om1/172.20.0.4 to scm3:9863 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused, while invoking $Proxy31.send over nodeId=scm3,nodeAddress=scm3/172.20.0.7:9863 after 14 failover attempts. Trying to failover after sleeping for 2000ms.
> om1_1       | 2021-03-31 17:04:20,795 [main] ERROR om.OzoneManager: Could not initialize OM version file
> om1_1       | org.apache.hadoop.ipc.RemoteException(org.apache.ratis.protocol.exceptions.NotLeaderException): Server 9cb7a7ae-4c40-401c-b1c6-55728c1f0907@group-C35E1BD0DE21 is not the leader
> om1_1       | 	at org.apache.hadoop.hdds.scm.ha.SCMRatisServerImpl.triggerNotLeaderException(SCMRatisServerImpl.java:245)
> om1_1       | 	at org.apache.hadoop.hdds.scm.protocol.ScmBlockLocationProtocolServerSideTranslatorPB.send(ScmBlockLocationProtocolServerSideTranslatorPB.java:108)
> om1_1       | 	at org.apache.hadoop.hdds.protocol.proto.ScmBlockLocationProtocolProtos$ScmBlockLocationProtocolService$2.callBlockingMethod(ScmBlockLocationProtocolProtos.java:13874)
> om1_1       | 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> om1_1       | 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1086)
> om1_1       | 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1029)
> om1_1       | 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:957)
> om1_1       | 	at java.base/java.security.AccessController.doPrivileged(Native Method)
> om1_1       | 	at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> om1_1       | 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1762)
> om1_1       | 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2957)
> om1_1       | 
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org