You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@zookeeper.apache.org by "Mohammad Arshad (Jira)" <ji...@apache.org> on 2022/04/05 08:16:00 UTC
[jira] [Updated] (ZOOKEEPER-4504) ZKUtil#deleteRecursive causing deadlock in HDFS HA functionality

     [ https://issues.apache.org/jira/browse/ZOOKEEPER-4504?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mohammad Arshad updated ZOOKEEPER-4504:
---------------------------------------
    Description: 
*Problem and Analysis:*
After integrating ZooKeeper 3.6.3 we observed deadlock in HDFS HA functionality as shown in below thread dumps.
{code:java}
"main-EventThread" #33 daemon prio=5 os_prio=0 tid=0x00007f9c017f1000 nid=0x101b waiting for monitor entry [0x00007f9bda8a6000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.hadoop.ha.ActiveStandbyElector.processWatchEvent(ActiveStandbyElector.java:603)
	- waiting to lock <0x00000000c17986c0> (a org.apache.hadoop.ha.ActiveStandbyElector)
	at org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.process(ActiveStandbyElector.java:1193)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:626)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:582)
{code}
{code:java}
"main" #1 prio=5 os_prio=0 tid=0x00007f9c00060000 nid=0xea3 waiting on condition [0x00007f9c06404000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000c1b383c8> (a java.util.concurrent.Semaphore$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:838)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:999)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1306)
	at java.util.concurrent.Semaphore.acquire(Semaphore.java:467)
	at org.apache.zookeeper.ZKUtil.deleteInBatch(ZKUtil.java:122)
	at org.apache.zookeeper.ZKUtil.deleteRecursive(ZKUtil.java:64)
	at org.apache.zookeeper.ZKUtil.deleteRecursive(ZKUtil.java:76)
	at org.apache.hadoop.ha.ActiveStandbyElector$1.run(ActiveStandbyElector.java:386)
	at org.apache.hadoop.ha.ActiveStandbyElector$1.run(ActiveStandbyElector.java:383)
	at org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1103)
	at org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095)
	at org.apache.hadoop.ha.ActiveStandbyElector.clearParentZNode(ActiveStandbyElector.java:383)
	- locked <0x00000000c17986c0> (a org.apache.hadoop.ha.ActiveStandbyElector)
	at org.apache.hadoop.ha.ZKFailoverController.formatZK(ZKFailoverController.java:290)
	at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:227)
	at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:66)
	at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:186)
	at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:182)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:360)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1741)
	at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:498)
	at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:182)
	at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:220)
{code}
org.apache.hadoop.ha.ActiveStandbyElector#clearParentZNode is instance synchronized and calls ZKUtil.deleteRecursive(zk, pathRoot)

ZKUtil.deleteRecursive is making async delete API call with MultiCallback as it callback.
As processWatchEvent is being processed, pathRoot or one of the child paths must have set watcher for delete notification.

When delete API is called, notification comes first to client then the actual delete response.
In this case both notification and delete response are processed through callback and through common waitingEvents queue one by one.

First notification is processed, but it cannot complete as it cannot take lock on processWatchEvent() method as lock is already taken by another thread while calling clearParentZNode()
As delete notification cannot be processed, MultiCallback is not taken from queue for processing. It stays there in the queue forever.

 

*Why this problem was not happening with earlier versions (3.5.x)?*

In earlier ZK versions, ZKUtil.deleteRecursive was using sync delete API. So delete response was processed directly not though the callback. 
Sot both clearParentZNode and processWatchEvent were completing independently. 


*Proposed Fix:*
There are two approaches to fix this problem. 
1. We can fix the problem in HDFS, modify the HDFS code to avoid the deadlock. But we may get similar bugs in other projects.
2. Fix the problem in ZK. Make the API behavior same as the old behavior(use sync API to delete the ZK node) and provide new overloaded API with new behavior(use async API to delete the ZK node)

I propose to fix the problem with 2nd approach.

  was:
*Problem and Analysis:*
After integrating ZooKeeper 3.6.3 we observed deadlock in HDFS HA functionality as shown in below thread dumps.
{code:java}
"main-EventThread" #33 daemon prio=5 os_prio=0 tid=0x00007f9c017f1000 nid=0x101b waiting for monitor entry [0x00007f9bda8a6000]
   java.lang.Thread.State: BLOCKED (on object monitor)
	at org.apache.hadoop.ha.ActiveStandbyElector.processWatchEvent(ActiveStandbyElector.java:603)
	- waiting to lock <0x00000000c17986c0> (a org.apache.hadoop.ha.ActiveStandbyElector)
	at org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.process(ActiveStandbyElector.java:1193)
	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:626)
	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:582)
{code}
{code:java}
"main" #1 prio=5 os_prio=0 tid=0x00007f9c00060000 nid=0xea3 waiting on condition [0x00007f9c06404000]
   java.lang.Thread.State: WAITING (parking)
	at sun.misc.Unsafe.park(Native Method)
	- parking to wait for  <0x00000000c1b383c8> (a java.util.concurrent.Semaphore$NonfairSync)
	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:838)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:999)
	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1306)
	at java.util.concurrent.Semaphore.acquire(Semaphore.java:467)
	at org.apache.zookeeper.ZKUtil.deleteInBatch(ZKUtil.java:122)
	at org.apache.zookeeper.ZKUtil.deleteRecursive(ZKUtil.java:64)
	at org.apache.zookeeper.ZKUtil.deleteRecursive(ZKUtil.java:76)
	at org.apache.hadoop.ha.ActiveStandbyElector$1.run(ActiveStandbyElector.java:386)
	at org.apache.hadoop.ha.ActiveStandbyElector$1.run(ActiveStandbyElector.java:383)
	at org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1103)
	at org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095)
	at org.apache.hadoop.ha.ActiveStandbyElector.clearParentZNode(ActiveStandbyElector.java:383)
	- locked <0x00000000c17986c0> (a org.apache.hadoop.ha.ActiveStandbyElector)
	at org.apache.hadoop.ha.ZKFailoverController.formatZK(ZKFailoverController.java:290)
	at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:227)
	at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:66)
	at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:186)
	at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:182)
	at java.security.AccessController.doPrivileged(Native Method)
	at javax.security.auth.Subject.doAs(Subject.java:360)
	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1741)
	at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:498)
	at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:182)
	at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:220)
{code}
org.apache.hadoop.ha.ActiveStandbyElector#clearParentZNode is instance synchronized and calls ZKUtil.deleteRecursive(zk, pathRoot)

ZKUtil.deleteRecursive is async API call and in callback it is invoking ActiveStandbyElector#processWatchEvent which is synchronized on ActiveStandbyElector instance.

So there is deadlock, clearParentZNode() is waiting processWatchEvent() to complete and processWatchEvent() is waiting clearParentZNode to complete

 

*Why this problem was not happening with earlier versions (3.5.x)?*

In earlier zk versions, ZKUtil.deleteRecursive was using sync zk API intnernally. So there was no callback (processWatchEvent) coming into the scenario.


*Proposed Fix:*
There are two approaches to fix this problem. 
1. We can fix the problem in HDFS, modify the HDFS code to avoid the deadlock. But we may get similar bugs in other projects.
2. Fix the problem in ZK. Make the API behavior same as the old behavior(use sync API to delete the ZK node) and provide new overloaded API with new behavior(use async API to delete the ZK node)

I propose to fix the problem with 2nd approach.


> ZKUtil#deleteRecursive causing deadlock in HDFS HA functionality
> ----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-4504
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-4504
>             Project: ZooKeeper
>          Issue Type: Bug
>            Reporter: Mohammad Arshad
>            Assignee: Mohammad Arshad
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 3.7.1, 3.6.4, 3.9.0, 3.8.1
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> *Problem and Analysis:*
> After integrating ZooKeeper 3.6.3 we observed deadlock in HDFS HA functionality as shown in below thread dumps.
> {code:java}
> "main-EventThread" #33 daemon prio=5 os_prio=0 tid=0x00007f9c017f1000 nid=0x101b waiting for monitor entry [0x00007f9bda8a6000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
> 	at org.apache.hadoop.ha.ActiveStandbyElector.processWatchEvent(ActiveStandbyElector.java:603)
> 	- waiting to lock <0x00000000c17986c0> (a org.apache.hadoop.ha.ActiveStandbyElector)
> 	at org.apache.hadoop.ha.ActiveStandbyElector$WatcherWithClientRef.process(ActiveStandbyElector.java:1193)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:626)
> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:582)
> {code}
> {code:java}
> "main" #1 prio=5 os_prio=0 tid=0x00007f9c00060000 nid=0xea3 waiting on condition [0x00007f9c06404000]
>    java.lang.Thread.State: WAITING (parking)
> 	at sun.misc.Unsafe.park(Native Method)
> 	- parking to wait for  <0x00000000c1b383c8> (a java.util.concurrent.Semaphore$NonfairSync)
> 	at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:838)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:999)
> 	at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1306)
> 	at java.util.concurrent.Semaphore.acquire(Semaphore.java:467)
> 	at org.apache.zookeeper.ZKUtil.deleteInBatch(ZKUtil.java:122)
> 	at org.apache.zookeeper.ZKUtil.deleteRecursive(ZKUtil.java:64)
> 	at org.apache.zookeeper.ZKUtil.deleteRecursive(ZKUtil.java:76)
> 	at org.apache.hadoop.ha.ActiveStandbyElector$1.run(ActiveStandbyElector.java:386)
> 	at org.apache.hadoop.ha.ActiveStandbyElector$1.run(ActiveStandbyElector.java:383)
> 	at org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1103)
> 	at org.apache.hadoop.ha.ActiveStandbyElector.zkDoWithRetries(ActiveStandbyElector.java:1095)
> 	at org.apache.hadoop.ha.ActiveStandbyElector.clearParentZNode(ActiveStandbyElector.java:383)
> 	- locked <0x00000000c17986c0> (a org.apache.hadoop.ha.ActiveStandbyElector)
> 	at org.apache.hadoop.ha.ZKFailoverController.formatZK(ZKFailoverController.java:290)
> 	at org.apache.hadoop.ha.ZKFailoverController.doRun(ZKFailoverController.java:227)
> 	at org.apache.hadoop.ha.ZKFailoverController.access$000(ZKFailoverController.java:66)
> 	at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:186)
> 	at org.apache.hadoop.ha.ZKFailoverController$1.run(ZKFailoverController.java:182)
> 	at java.security.AccessController.doPrivileged(Native Method)
> 	at javax.security.auth.Subject.doAs(Subject.java:360)
> 	at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1741)
> 	at org.apache.hadoop.security.SecurityUtil.doAsLoginUserOrFatal(SecurityUtil.java:498)
> 	at org.apache.hadoop.ha.ZKFailoverController.run(ZKFailoverController.java:182)
> 	at org.apache.hadoop.hdfs.tools.DFSZKFailoverController.main(DFSZKFailoverController.java:220)
> {code}
> org.apache.hadoop.ha.ActiveStandbyElector#clearParentZNode is instance synchronized and calls ZKUtil.deleteRecursive(zk, pathRoot)
> ZKUtil.deleteRecursive is making async delete API call with MultiCallback as it callback.
> As processWatchEvent is being processed, pathRoot or one of the child paths must have set watcher for delete notification.
> When delete API is called, notification comes first to client then the actual delete response.
> In this case both notification and delete response are processed through callback and through common waitingEvents queue one by one.
> First notification is processed, but it cannot complete as it cannot take lock on processWatchEvent() method as lock is already taken by another thread while calling clearParentZNode()
> As delete notification cannot be processed, MultiCallback is not taken from queue for processing. It stays there in the queue forever.
>  
> *Why this problem was not happening with earlier versions (3.5.x)?*
> In earlier ZK versions, ZKUtil.deleteRecursive was using sync delete API. So delete response was processed directly not though the callback. 
> Sot both clearParentZNode and processWatchEvent were completing independently. 
> *Proposed Fix:*
> There are two approaches to fix this problem. 
> 1. We can fix the problem in HDFS, modify the HDFS code to avoid the deadlock. But we may get similar bugs in other projects.
> 2. Fix the problem in ZK. Make the API behavior same as the old behavior(use sync API to delete the ZK node) and provide new overloaded API with new behavior(use async API to delete the ZK node)
> I propose to fix the problem with 2nd approach.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)