You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by GitBox <gi...@apache.org> on 2022/06/21 12:33:25 UTC

[GitHub] [ozone] siddhantsangwan opened a new pull request, #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

siddhantsangwan opened a new pull request, #3535:
URL: https://github.com/apache/ozone/pull/3535

   ## What changes were proposed in this pull request?
   
   ContainerBalancer has `balancingThread.join()` being called in ContainerBalancer#stopBalancingThread. Callers of this method acquire but don't release the only lock in this class when calling this method. If at this time another thread is trying to acquire the lock, we have a deadlock.
   
   For example, SCMClientProtocolServer#stopContainerBalancer() will lead to the calling thread wait for the balancing thread to join in ContainerBalancer#stopBalancingThread. If the balancing thread now checks for `isBalancerRunning()` in ContainerBalancer#balance, the two threads will get into a deadlock. The balancing thread is disabled and waiting to acquire the lock, while the other thread is waiting for balancing thread to finish.
   
   Changes: Release lock in callers of ContainerBalancer#stopBalancingThread before this method is called. Remove locking in `isBalancerRunning()`.
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-6928
   
   ## How was this patch tested?
   
   A basic UT that starts and then immediately stops balancer. In the existing code, this leads to a deadlock.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on pull request #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on PR #3535:
URL: https://github.com/apache/ozone/pull/3535#issuecomment-1161698181

   @lokeshj1703 @JacksonYao287 @symious @rakeshadr Can you please review?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on pull request #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on PR #3535:
URL: https://github.com/apache/ozone/pull/3535#issuecomment-1169853874

   @lokeshj1703 Thanks for the review. I've merged this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] lokeshj1703 commented on a diff in pull request #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

Posted by GitBox <gi...@apache.org>.
lokeshj1703 commented on code in PR #3535:
URL: https://github.com/apache/ozone/pull/3535#discussion_r909377859


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -1066,6 +1062,12 @@ public void stop() {
             balancingThread.getId() != Thread.currentThread().getId()) {
       balancingThread.interrupt();
       try {
+        if (lock.isHeldByCurrentThread()) {
+          LOG.warn("Waiting for balancing thread \"{}\" to join while current" +
+                  " thread holds lock. This could result in a deadlock if" +
+                  " balancing thread is also waiting to acquire the lock.",
+              balancingThread.getName());

Review Comment:
   Also print the current thread name.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] lokeshj1703 commented on a diff in pull request #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

Posted by GitBox <gi...@apache.org>.
lokeshj1703 commented on code in PR #3535:
URL: https://github.com/apache/ozone/pull/3535#discussion_r908222198


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -1032,16 +1048,30 @@ private void validateState(boolean expectedRunning)
    */
   @Override
   public void stop() {
+    Thread balancingThread;
     lock.lock();
     try {
       if (!isBalancerRunning()) {
         LOG.warn("Cannot stop Container Balancer because it's not running");
         return;
       }
-      stopBalancingThread();
+      balancingThread = currentBalancingThread;
+      currentBalancingThread = null;
     } finally {
       lock.unlock();
     }
+
+    // wait for balancingThread to die
+    if (balancingThread != null &&

Review Comment:
   Can we add a preconditions check here that lock is not held by the current thread? There is an api for it.



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -839,20 +850,34 @@ private void resetState() {
   @Override
   public void notifyStatusChanged() {
     if (!scmContext.isLeader() || scmContext.isInSafeMode()) {
-      if (isBalancerRunning()) {
+      boolean shouldStop = false;
+
+      // lock here to ensure no change is made to the balancing thread while
+      // we're reading it
+      lock.lock();
+      try {
+        shouldStop = isBalancerRunning();

Review Comment:
   I would suggest merging the two lock sections and afterwards stop can be called.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on a diff in pull request #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on code in PR #3535:
URL: https://github.com/apache/ozone/pull/3535#discussion_r909434267


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -1066,6 +1062,12 @@ public void stop() {
             balancingThread.getId() != Thread.currentThread().getId()) {
       balancingThread.interrupt();
       try {
+        if (lock.isHeldByCurrentThread()) {
+          LOG.warn("Waiting for balancing thread \"{}\" to join while current" +
+                  " thread holds lock. This could result in a deadlock if" +
+                  " balancing thread is also waiting to acquire the lock.",
+              balancingThread.getName());

Review Comment:
   Done!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on pull request #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on PR #3535:
URL: https://github.com/apache/ozone/pull/3535#issuecomment-1165286957

   > @siddhantsangwan We are calling balancingThread.interrupt right now in stopBalancingThread. I think we should use lockInterruptibly api here.
   
   @lokeshj1703 Do you mean using `lockInterruptibly` in addition to the current changes?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] lokeshj1703 commented on pull request #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

Posted by GitBox <gi...@apache.org>.
lokeshj1703 commented on PR #3535:
URL: https://github.com/apache/ozone/pull/3535#issuecomment-1165288286

   @siddhantsangwan I think using lockInterruptibly alone should solve the deadlock issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on pull request #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on PR #3535:
URL: https://github.com/apache/ozone/pull/3535#issuecomment-1165332059

   > @siddhantsangwan I think using lockInterruptibly alone should solve the deadlock issue.
   
   We should probably take the `thread.join` call out of the lock anyway? If someone in the future uses `lock.lock()` instead of interruptibly, we could have a deadlock again.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on pull request #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on PR #3535:
URL: https://github.com/apache/ozone/pull/3535#issuecomment-1169592680

   @lokeshj1703 I've updated the PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] lokeshj1703 commented on a diff in pull request #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

Posted by GitBox <gi...@apache.org>.
lokeshj1703 commented on code in PR #3535:
URL: https://github.com/apache/ozone/pull/3535#discussion_r909300104


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -1066,6 +1062,12 @@ public void stop() {
             balancingThread.getId() != Thread.currentThread().getId()) {
       balancingThread.interrupt();
       try {
+        if (lock.isHeldByCurrentThread()) {
+          LOG.warn("Waiting for balancing thread \"{}\" to join while current" +
+                  " thread holds lock. This could result in a deadlock if" +
+                  " balancing thread is also waiting to acquire the lock.",
+              balancingThread.getName());

Review Comment:
   Can we also print the stacktrace here?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on a diff in pull request #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on code in PR #3535:
URL: https://github.com/apache/ozone/pull/3535#discussion_r909440236


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -1066,6 +1062,12 @@ public void stop() {
             balancingThread.getId() != Thread.currentThread().getId()) {
       balancingThread.interrupt();
       try {
+        if (lock.isHeldByCurrentThread()) {
+          LOG.warn("Waiting for balancing thread \"{}\" to join while current" +
+                  " thread holds lock. This could result in a deadlock if" +
+                  " balancing thread is also waiting to acquire the lock.",
+              balancingThread.getName());

Review Comment:
   This is what the log looks like on introducing a lock just for testing:
   ```
   2022-06-29 15:25:49,096 [main] WARN  balancer.ContainerBalancer (ContainerBalancer.java:stop(1072)) - Waiting for balancing thread "ContainerBalancer" to join while current thread "main" holds lock. This could result in a deadlock if balancing thread is also waiting to acquire the lock. 
   [java.lang.Thread.getStackTrace(Thread.java:1559)
    org.apache.hadoop.hdds.scm.container.balancer.ContainerBalancer.stop(ContainerBalancer.java:1070)
    org.apache.hadoop.hdds.scm.container.balancer.ContainerBalancer.stopBalancer(ContainerBalancer.java:1103)
    org.apache.hadoop.hdds.scm.container.balancer.TestContainerBalancer.stopBalancer(TestContainerBalancer.java:942)
    org.apache.hadoop.hdds.scm.container.balancer.TestContainerBalancer.testStartAndImmediateStopForDeadlock(TestContainerBalancer.java:747)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan merged pull request #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

Posted by GitBox <gi...@apache.org>.
siddhantsangwan merged PR #3535:
URL: https://github.com/apache/ozone/pull/3535


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] lokeshj1703 commented on pull request #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

Posted by GitBox <gi...@apache.org>.
lokeshj1703 commented on PR #3535:
URL: https://github.com/apache/ozone/pull/3535#issuecomment-1165380123

   > We should probably take the `thread.join` call out of the lock anyway? If someone in the future uses `lock.lock()` instead of interruptibly, we could have a deadlock again.
   
   Yeah. Makes sense. Would consistency still be guaranteed for functions and callers of notifyStatusChanged and isBalancerRunning?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on pull request #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on PR #3535:
URL: https://github.com/apache/ozone/pull/3535#issuecomment-1167244527

   @lokeshj1703 thanks for reviewing! I've added locking in `notifyStatusChanged` after considering the following scenario:
   
   The current SCM is the leader, and `startBalancer` is called. The lock will be acquired, `startBalancer` will validate state, persist configuration and call `startBalancingThread`. 
   
   ```
     public void startBalancer(ContainerBalancerConfiguration configuration)
         throws IllegalContainerBalancerStateException,
         InvalidContainerBalancerConfigurationException, IOException {
       lock.lock();
       try {
         // validates state, config, and then saves config
         setBalancerConfigOnStartBalancer(configuration);
         startBalancingThread();
       } finally {
         lock.unlock();
       }
     }
   
   ```
   
   Suppose at this instant there's a leader change and `notifyStatusChange` is called from another thread. Container Balancer cannot run in this SCM since it's a follower now.
   
   ```
    public void notifyStatusChanged() {
       if (!scmContext.isLeader() || scmContext.isInSafeMode()) {
         if (isBalancerRunning()) {
           LOG.info("Stopping ContainerBalancer in this scm on status change");
           stop();
         }
       } else {
         if (shouldRun()) {
           try {
             LOG.info("Starting ContainerBalancer in this scm on status change");
             start();
           } catch (IllegalContainerBalancerStateException |
                   InvalidContainerBalancerConfigurationException e) {
             LOG.warn("Could not start ContainerBalancer on raft/safe-mode " +
                     "status change.", e);
           }
         }
       }
     }
   ```
   
   The first `if` condition `if (!scmContext.isLeader() || scmContext.isInSafeMode())` will evaluate to `true`, but `if (isBalancerRunning())` will be false because the former thread has still not started the balancing thread. So, `notifyStatusChanged` will not take any action. Now, `startBalancingThread` will end up starting balancer in this SCM, which is incorrect. The latest change will prevent this scenario from happening.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


[GitHub] [ozone] siddhantsangwan commented on a diff in pull request #3535: HDDS-6928. ozone container balancer CLI went in hung state due to deadlock

Posted by GitBox <gi...@apache.org>.
siddhantsangwan commented on code in PR #3535:
URL: https://github.com/apache/ozone/pull/3535#discussion_r909440236


##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/container/balancer/ContainerBalancer.java:
##########
@@ -1066,6 +1062,12 @@ public void stop() {
             balancingThread.getId() != Thread.currentThread().getId()) {
       balancingThread.interrupt();
       try {
+        if (lock.isHeldByCurrentThread()) {
+          LOG.warn("Waiting for balancing thread \"{}\" to join while current" +
+                  " thread holds lock. This could result in a deadlock if" +
+                  " balancing thread is also waiting to acquire the lock.",
+              balancingThread.getName());

Review Comment:
   Just to test what the logs would look like, I introduced a lock in `stop` (and then removed it). The warning looks like this:
   ```
   2022-06-29 15:25:49,096 [main] WARN  balancer.ContainerBalancer (ContainerBalancer.java:stop(1072)) - Waiting for balancing thread "ContainerBalancer" to join while current thread "main" holds lock. This could result in a deadlock if balancing thread is also waiting to acquire the lock. 
   [java.lang.Thread.getStackTrace(Thread.java:1559)
    org.apache.hadoop.hdds.scm.container.balancer.ContainerBalancer.stop(ContainerBalancer.java:1070)
    org.apache.hadoop.hdds.scm.container.balancer.ContainerBalancer.stopBalancer(ContainerBalancer.java:1103)
    org.apache.hadoop.hdds.scm.container.balancer.TestContainerBalancer.stopBalancer(TestContainerBalancer.java:942)
    org.apache.hadoop.hdds.scm.container.balancer.TestContainerBalancer.testStartAndImmediateStopForDeadlock(TestContainerBalancer.java:747)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org