You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Duong (Jira)" <ji...@apache.org> on 2023/02/18 20:37:00 UTC
[jira] [Comment Edited] (HDDS-7925) Deadlocks among all OMs in OM HA

    [ https://issues.apache.org/jira/browse/HDDS-7925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690816#comment-17690816 ] 

Duong edited comment on HDDS-7925 at 2/18/23 8:36 PM:
------------------------------------------------------

Thanks for being patient in helping us investigate this, [~ksugihara].

From the fork info and commit hash, it's likely related to a problem that has been detected and fixed upstream, as per HDDS-7755. It's from [this line|https://github.com/pfnet/ozone/blob/9c61a8aa497ab96c014ad3bb7b1ee4f731ebfaf8/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L362-L362], or copied below.
{code:java}
metadataManager.getLock().acquireReadLock(BUCKET_LOCK, volumeName,
    bucketName);

BucketLayout bucketLayout =
    getBucketLayout(metadataManager, args.getVolumeName(),
        args.getBucketName());
keyName = OMClientRequest
    .validateAndNormalizeKey(enableFileSystemPaths, keyName,
        bucketLayout);

OmKeyInfo value = null;
try {
  ....
} finally {
  metadataManager.getLock().releaseReadLock(BUCKET_LOCK, volumeName,
      bucketName);
} {code}
It's indeed related to the stacktrace in your audit log.
{code:java}
$ cat om-audit.2/nodeB.om-audit.2.log | grep '2023-01-27 13:' -A1 | grep ERROR -A1 | grep -v 2023- | sort | uniq -c
   2756         at org.apache.hadoop.ozone.om.KeyManagerImpl.getOzoneFileStatusFSO(KeyManagerImpl.java:1379)
   2208         at org.apache.hadoop.ozone.om.KeyManagerImpl.readKeyInfo(KeyManagerImpl.java:391)
      1         at org.apache.hadoop.ozone.om.request.OMClientRequest.isValidKeyPath(OMClientRequest.java:564)
   4964 --{code}
The OMClientRequest.isValidKeyPath is called (transitively) by OMClientRequest.validateAndNormalizeKey, which intervenes between the bucket lock acquisition and lock release, without adequately being put in the try-finally block. 

Coincidently, a read request with an invalid key path triggers the error that OM failed to handle gracefully, resulting in the bucket lock not being released by the relevant thread. 

Cherry-picking HDDS-7755 would fix the issue. Can you give it a try?


was (Author: JIRAUSER290990):
Thanks for being patient in helping us investigate this, [~ksugihara].

From the fork info and commit hash, it's likely related to a problem that has been detected and fixed upstream, as per HDDS-7755. It's from [this line|https://github.com/pfnet/ozone/blob/9c61a8aa497ab96c014ad3bb7b1ee4f731ebfaf8/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/KeyManagerImpl.java#L362-L362], or copied below.
{code:java}
metadataManager.getLock().acquireReadLock(BUCKET_LOCK, volumeName,
    bucketName);

BucketLayout bucketLayout =
    getBucketLayout(metadataManager, args.getVolumeName(),
        args.getBucketName());
keyName = OMClientRequest
    .validateAndNormalizeKey(enableFileSystemPaths, keyName,
        bucketLayout);

OmKeyInfo value = null;
try {
  ....
} finally {
  metadataManager.getLock().releaseReadLock(BUCKET_LOCK, volumeName,
      bucketName);
} {code}
It's indeed related to the stacktrace in your audit log.
{code:java}
$ cat om-audit.2/nodeB.om-audit.2.log | grep '2023-01-27 13:' -A1 | grep ERROR -A1 | grep -v 2023- | sort | uniq -c
   2756         at org.apache.hadoop.ozone.om.KeyManagerImpl.getOzoneFileStatusFSO(KeyManagerImpl.java:1379)
   2208         at org.apache.hadoop.ozone.om.KeyManagerImpl.readKeyInfo(KeyManagerImpl.java:391)
      1         at org.apache.hadoop.ozone.om.request.OMClientRequest.isValidKeyPath(OMClientRequest.java:564)
   4964 --{code}
The OMClientRequest.isValidKeyPath is called (transitively) by OMClientRequest.validateAndNormalizeKey, which intervenes between the bucket lock acquisition and lock release, without adequately being put in the try-finally block. 
 
Cherry-picking HDDS-7755 would fix the issue. Can you give it a try?
 

> Deadlocks among all OMs in OM HA
> --------------------------------
>
>                 Key: HDDS-7925
>                 URL: https://issues.apache.org/jira/browse/HDDS-7925
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: OM HA
>    Affects Versions: 1.3.0
>         Environment: Configuration: FSO enabled, OM HA, SCM HA
>            Reporter: Kohei Sugihara
>            Assignee: Duong
>            Priority: Critical
>         Attachments: HDDS-7925-logs-2023-01-27.tar.gz, om-audit.2.tar.gz, om-logs.2.tar.gz
>
>
> h2. Overview
>  * All S3 requests were timeout and not responding.
>  * No failover to the follower OMs occurs. The deadlock sticks all OMs.
>  * All OM APIs could not serve any responses by the deadlock.
>  * Switching to non-HA does not recurs the deadlock
> h2. Environment
>  * OM x3 with HA: nodeA, nodeB, nodeC
>  * SCM x3 with HA: nodeA, nodeB, nodeC
>  * DN x36: node[0:35]
>  * S3G x36: node[0:35]
> h2. Problem Details
> All requests in IPC Server Handler were stuck in a wait state, waiting to acquire a bucket lock for read/write keys or to get a response from OM Ratis. However, which component has a lock is unclear because the lock is spread across all OMs. 
> The attached archive includes stacktraces about each OM when the deadlock occurred. Each stacktraces were taken ten times (one-liner is attached in a comment below) at a second interval. OM nodes are assigned the following roles:
>  * Leader: nodeB
>  * Follower: nodeA, nodeC
> In the log file (nodeB.jstack.log.2.0), we can see a jstack result in nodeB (OM leader) and two kinds of waiting threads. We checked some threads waiting for a bucket lock with the jdb debugger, and all threads wait for a lock about the same bucket to be freed. However, who acquired the lock is unclear.
> h3. Case #1: Waiting to acquire a bucket lock
> {quote}"IPC Server handler 33 on default port 9862" #119 daemon prio=5 os_prio=0 cpu=62093.42ms elapsed=13319.52s allocated=5558M defined_classes=60 tid=0x00007f615f680890 nid=0x3d3b2e waiting on condition  [0x00007f60d2214000]
> ...
>     at org.apache.hadoop.ozone.om.lock.OzoneManagerLock.acquireReadLock(OzoneManagerLock.java:146)
>     at org.apache.hadoop.ozone.om.BucketManagerImpl.getBucketInfo(BucketManagerImpl.java:64)
>     at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4272)
>     at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4290)
>     at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4237)
>     at org.apache.hadoop.ozone.om.OzoneManager.resolveBucketLink(OzoneManager.java:4208)
>     at org.apache.hadoop.ozone.om.OzoneManager.lambda$1(OzoneManager.java:2819)
>     at org.apache.hadoop.ozone.om.OzoneManager$$Lambda$1096/0x00000008013f2c10.get(Unknown Source)
>     at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:36)
>     at org.apache.hadoop.ozone.om.OzoneManager.lookupKey(OzoneManager.java:2817)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.lookupKey(OzoneManagerRequestHandler.java:502)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:191)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitReadRequestToOM(OzoneManagerProtocolServerSideTranslatorPB.java:226)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:175)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown Source)
>     at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:147)
> ...
> {quote}
> h3. Case #2: Waiting for Ratis
> {quote}"IPC Server handler 35 on default port 9862" #121 daemon prio=5 os_prio=0 cpu=59113.87ms elapsed=13319.52s allocated=5395M defined_classes=7 tid=0x00007f615f6839f0 nid=0x3d3b30 waiting on condition  [0x00007f60d2014000]
> ...
>     at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:285)
>     at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:247)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:217)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:198)
>     at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$690/0x0000000801311278.apply(Unknown Source)
>     at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> ...
> {quote}
> h2. Workaround
> We found two ways as a quick fix:
>  * Reboot all OMs in an appropriate order; When stuck, rebooting all OMs starting with the old OM leader will recover the timeout. But it recurs within a few days.
>  * Switch to the non-HA; After switching to the non-HA, we have not met the timeout issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org