You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Tsz-wo Sze (Jira)" <ji...@apache.org> on 2023/05/04 03:48:00 UTC
[jira] [Resolved] (HDDS-7925) Deadlocks among all OMs in OM HA

     [ https://issues.apache.org/jira/browse/HDDS-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tsz-wo Sze resolved HDDS-7925.
------------------------------
    Resolution: Duplicate

> Deadlocks among all OMs in OM HA
> --------------------------------
>
>                 Key: HDDS-7925
>                 URL: https://issues.apache.org/jira/browse/HDDS-7925
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM HA
>    Affects Versions: 1.3.0
>         Environment: Configuration: FSO enabled, OM HA, SCM HA
>            Reporter: Kohei Sugihara
>            Assignee: Duong
>            Priority: Major
>         Attachments: HDDS-7925-logs-2023-01-27.tar.gz, om-audit.2.tar.gz, om-logs.2.tar.gz
>
>
> h2. Overview
>  * All S3 requests were timeout and not responding.
>  * No failover to the follower OMs occurs. The deadlock sticks all OMs.
>  * All OM APIs could not serve any responses by the deadlock.
>  * Switching to non-HA does not recurs the deadlock
> h2. Environment
>  * OM x3 with HA: nodeA, nodeB, nodeC
>  * SCM x3 with HA: nodeA, nodeB, nodeC
>  * DN x36: node[0:35]
>  * S3G x36: node[0:35]
> h2. Problem Details
> All requests in IPC Server Handler were stuck in a wait state, waiting to acquire a bucket lock for read/write keys or to get a response from OM Ratis. However, which component has a lock is unclear because the lock is spread across all OMs. 
> The attached archive includes stacktraces about each OM when the deadlock occurred. Each stacktraces were taken ten times (one-liner is attached in a comment below) at a second interval. OM nodes are assigned the following roles:
>  * Leader: nodeB
>  * Follower: nodeA, nodeC
> In the log file (nodeB.jstack.log.2.0), we can see a jstack result in nodeB (OM leader) and two kinds of waiting threads. We checked some threads waiting for a bucket lock with the jdb debugger, and all threads wait for a lock about the same bucket to be freed. However, who acquired the lock is unclear.
> h3. Case #1: Waiting to acquire a bucket lock
> {quote}"IPC Server handler 33 on default port 9862" #119 daemon prio=5 os_prio=0 cpu=62093.42ms elapsed=13319.52s allocated=5558M defined_classes=60 tid=0x00007f615f680890 nid=0x3d3b2e waiting on condition  [0x00007f60d2214000]
> ...
>     at org.apache.hadoop.ozone.om.lock.OzoneManagerLock.acquireReadLock(OzoneManagerLock.java:146)
>     at org.apache.hadoop.ozone.om.BucketManagerImpl.getBucketInfo(BucketManagerImpl.java:64)
>     at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4272)
>     at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4290)
>     at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4237)
>     at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4208)
>     at org.apache.hadoop.ozone.om.OzoneManager.lambda$1(OzoneManager.java:2819)
>     at org.apache.hadoop.ozone.om.OzoneManager$$Lambda$1096/0x00000008013f2c10.get(Unknown Source)
>     at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:36)
>     at org.apache.hadoop.ozone.om.OzoneManager.lookupKey(OzoneManager.java:2817)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lookupKey(OzoneManagerRequestHandler.java:502)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:191)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:226)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:175)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown Source)
>     at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
> ...
> {quote}
> h3. Case #2: Waiting for Ratis
> {quote}"IPC Server handler 35 on default port 9862" #121 daemon prio=5 os_prio=0 cpu=59113.87ms elapsed=13319.52s allocated=5395M defined_classes=7 tid=0x00007f615f6839f0 nid=0x3d3b30 waiting on condition  [0x00007f60d2014000]
> ...
>     at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:285)
>     at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:247)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:217)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:198)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown Source)
>     at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> ...
> {quote}
> h2. Workaround
> We found two ways as a quick fix:
>  * Reboot all OMs in an appropriate order; When stuck, rebooting all OMs starting with the old OM leader will recover the timeout. But it recurs within a few days.
>  * Switch to the non-HA; After switching to the non-HA, we have not met the timeout issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org