You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Tsz-wo Sze (Jira)" <ji...@apache.org> on 2023/05/02 15:44:00 UTC

[jira] [Assigned] (HDDS-8366) OzoneManager hangs when submitRequestToRatis

     [ https://issues.apache.org/jira/browse/HDDS-8366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz-wo Sze reassigned HDDS-8366:
--------------------------------

    Assignee: Sumit Agrawal  (was: Tsz-wo Sze)

> OzoneManager hangs when submitRequestToRatis
> --------------------------------------------
>
>                 Key: HDDS-8366
>                 URL: https://issues.apache.org/jira/browse/HDDS-8366
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM, Ozone Manager
>    Affects Versions: 1.3.0
>            Reporter: Hongbing Wang
>            Assignee: Sumit Agrawal
>            Priority: Major
>         Attachments: om.abnormal.jstack, om.normal.jstack, om_rpc_callqueue_ accumulation.png
>
>
> OM all rpc handlers hang when calling `OzoneManagerRatisServer#submitRequestToRatis`, the key stack as follows:
> {noformat}
> "IPC Server handler 99 on 9862" #187 daemon prio=5 os_prio=0 tid=0x00007f1897b4c000 nid=0x10fa63 waiting on condition [0x00007f05a5b48000]
>    java.lang.Thread.State: WAITING (parking)
> 	at sun.misc.Unsafe.park(Native Method)
> 	- parking to wait for  <0x00007f08a185e050> (a java.util.concurrent.CompletableFuture$Signaller)
> 	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 	at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1693)
> 	at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
> 	at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1729)
> 	at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
> 	at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:285)
> 	at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:247)
> 	at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:217)
> 	at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:198)
> 	at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$696/251832800.apply(Unknown Source)
> 	at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> 	at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
> 	at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> 	at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:524)
> 	at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1025)
> 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:886)
> 	at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:828)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:422)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1903)
> 	at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2716)
>    Locked ownable synchronizers:
> 	- None
> {noformat}
> The complete abnormal stack see: [^om.abnormal.jstack] (also see [web link|https://github.com/whbing/issue_logs/blob/main/ozone/omrpc20230323/om.abnormal.jstack])
> Compare the normal stack see:  [^om.normal.jstack] (also see [web link|https://github.com/whbing/issue_logs/blob/main/ozone/omrpc20230323/om.normal.jstack])
> ipc debug log as follow:
> {noformat}
> 2023-03-22 13:17:56,135 [Socket Reader #1 for port 9862] DEBUG org.apache.hadoop.ipc.Server: Successfully authorized userInfo {
>   effectiveUser: "xxx"
> }
> protocol: "org.apache.hadoop.hdds.protocol.GenericRefreshProtocol"
> 2023-03-22 13:17:56,135 [Socket Reader #1 for port 9862] DEBUG org.apache.hadoop.ipc.Server:  got #0
> 2023-03-22 13:17:57,143 [IPC Server idle connection scanner for port 9862] DEBUG org.apache.hadoop.ipc.Server: IPC Server idle connection scanner for port 9862: task running
> 2023-03-22 13:17:57,946 [Socket Reader #1 for port 9862] DEBUG org.apache.hadoop.ipc.Server:  got #-4
> 2023-03-22 13:17:57,946 [Socket Reader #1 for port 9862] DEBUG org.apache.hadoop.ipc.Server: Received ping message
> 2023-03-22 13:18:07,143 [IPC Server idle connection scanner for port 9862] DEBUG org.apache.hadoop.ipc.Server: IPC Server idle connection scanner for port 9862: task running
> 2023-03-22 13:18:13,536 [Socket Reader #1 for port 9862] DEBUG org.apache.hadoop.ipc.Server:  got #-4
> 2023-03-22 13:18:13,536 [Socket Reader #1 for port 9862] DEBUG org.apache.hadoop.ipc.Server: Received ping message
> {noformat}
> RPCs are backlogged in callQueue: 
>  !om_rpc_callqueue_ accumulation.png! 
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org