You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "aswinshakil (via GitHub)" <gi...@apache.org> on 2023/12/08 20:53:55 UTC

[PR] HDDS-9871. [Snapshot] Fix Deadlock in SnapshotCache. [ozone]

aswinshakil opened a new pull request, #5751:
URL: https://github.com/apache/ozone/pull/5751

   ## What changes were proposed in this pull request?
   When accessing the snapshot cache, Multiple threads fall into a deadlock between ConcurrentHashMap(`dbMap`) and SynchronizedSet(`pendingEvictionList`).
   
   ### Scenario: 
   
   1. `dbMap.compute()` holds the ConcurrentHashMap lock while trying to hold the lock for `pendingEvictionList` Set.
   2. `cleanup()` method holds the lock for `pendingEvictionList` Set while trying to hold the lock for the ConcurrentHashMap (`dbMap`).
   
   ```
   "snapshot-diff-job-thread-id-8":
     waiting to lock monitor 0x00007fa7964788d8 (object 0x00000006d658eed0, a java.util.Collections$SynchronizedSet),
     which is held by "KeyDeletingService#0"
   "KeyDeletingService#0":
     waiting to lock monitor 0x0000000000e99e68 (object 0x00000006db6bcba8, a java.util.concurrent.ConcurrentHashMap$Node),
     which is held by "SstFilteringService#0"
   "SstFilteringService#0":
     waiting to lock monitor 0x00007fa7964788d8 (object 0x00000006d658eed0, a java.util.Collections$SynchronizedSet),
     which is held by "KeyDeletingService#0"
   ```
   One way to avoid this is to follow lock ordering. `dbMap.compute()` is removed so that we never get lock in the 1st order. 
   
   ## What is the link to the Apache JIRA
   
   https://issues.apache.org/jira/browse/HDDS-9871
   
   ## How was this patch tested?
   
   Existing Test. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9871. [Snapshot] Fix Deadlock in SnapshotCache. [ozone]

Posted by "smengcl (via GitHub)" <gi...@apache.org>.
smengcl commented on code in PR #5751:
URL: https://github.com/apache/ozone/pull/5751#discussion_r1425776306


##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java:
##########
@@ -230,21 +231,19 @@ public ReferenceCounted<IOmMetadataReader, SnapshotCache> get(String key,
    * @param key snapshot table key
    */
   public void release(String key) {
-    dbMap.compute(key, (k, v) -> {
-      if (v == null) {
-        throw new IllegalArgumentException(
-            "Key '" + key + "' does not exist in cache");
-      }
+    ReferenceCounted<IOmMetadataReader, SnapshotCache>
+        rcOmSnapshot = dbMap.get(key);
+    if (rcOmSnapshot == null) {
+      throw new IllegalArgumentException(
+          "Key '" + key + "' does not exist in cache");
+    }
 
-      if (v.decrementRefCount() == 0L) {
-        synchronized (pendingEvictionList) {
-          // v is eligible to be evicted and closed
-          pendingEvictionList.add(v);
-        }
+    if (rcOmSnapshot.decrementRefCount() == 0L) {
+      synchronized (pendingEvictionList) {
+        // v is eligible to be evicted and closed
+        pendingEvictionList.add(rcOmSnapshot);
       }
-
-      return v;
-    });
+    }

Review Comment:
   @aswinshakil Indeed `SnapshotCache#release` is not called directly right now and is only kept to conform to Guava cache interface (and just in case we would need it). Ref count is currently decremented by `ReferenceCounted#close` because we encourage using try-with-resource on snapshot instances.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9871. [Snapshot] Fix Deadlock in SnapshotCache. [ozone]

Posted by "smengcl (via GitHub)" <gi...@apache.org>.
smengcl commented on code in PR #5751:
URL: https://github.com/apache/ozone/pull/5751#discussion_r1425776306


##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java:
##########
@@ -230,21 +231,19 @@ public ReferenceCounted<IOmMetadataReader, SnapshotCache> get(String key,
    * @param key snapshot table key
    */
   public void release(String key) {
-    dbMap.compute(key, (k, v) -> {
-      if (v == null) {
-        throw new IllegalArgumentException(
-            "Key '" + key + "' does not exist in cache");
-      }
+    ReferenceCounted<IOmMetadataReader, SnapshotCache>
+        rcOmSnapshot = dbMap.get(key);
+    if (rcOmSnapshot == null) {
+      throw new IllegalArgumentException(
+          "Key '" + key + "' does not exist in cache");
+    }
 
-      if (v.decrementRefCount() == 0L) {
-        synchronized (pendingEvictionList) {
-          // v is eligible to be evicted and closed
-          pendingEvictionList.add(v);
-        }
+    if (rcOmSnapshot.decrementRefCount() == 0L) {
+      synchronized (pendingEvictionList) {
+        // v is eligible to be evicted and closed
+        pendingEvictionList.add(rcOmSnapshot);
       }
-
-      return v;
-    });
+    }

Review Comment:
   @aswinshakil Indeed `SnapshotCache#release` is not called directly right now and is only kept to conform to Guava cache convention (and just in case we would need it). Ref count is currently decremented by `ReferenceCounted#close` because we encourage using try-with-resource on snapshot instances.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9871. [Snapshot] Fix Deadlock in SnapshotCache. [ozone]

Posted by "smengcl (via GitHub)" <gi...@apache.org>.
smengcl commented on code in PR #5751:
URL: https://github.com/apache/ozone/pull/5751#discussion_r1423765648


##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java:
##########
@@ -230,21 +231,19 @@ public ReferenceCounted<IOmMetadataReader, SnapshotCache> get(String key,
    * @param key snapshot table key
    */
   public void release(String key) {
-    dbMap.compute(key, (k, v) -> {
-      if (v == null) {
-        throw new IllegalArgumentException(
-            "Key '" + key + "' does not exist in cache");
-      }
+    ReferenceCounted<IOmMetadataReader, SnapshotCache>
+        rcOmSnapshot = dbMap.get(key);
+    if (rcOmSnapshot == null) {
+      throw new IllegalArgumentException(
+          "Key '" + key + "' does not exist in cache");
+    }
 
-      if (v.decrementRefCount() == 0L) {
-        synchronized (pendingEvictionList) {
-          // v is eligible to be evicted and closed
-          pendingEvictionList.add(v);
-        }
+    if (rcOmSnapshot.decrementRefCount() == 0L) {
+      synchronized (pendingEvictionList) {
+        // v is eligible to be evicted and closed
+        pendingEvictionList.add(rcOmSnapshot);
       }
-
-      return v;
-    });
+    }

Review Comment:
   I figure I would outline two possible race condition scenarios without the `dbMap` lock here. Though I believe both are already protected by my existing implementation.
   
   
   Case 1:
   
   Since `pendingEvictionList` is a `Set`, it should be fine if multiple threads are calling `release(key)` as the same time while some of those threads observed that `rcOmSnapshot`'s ref count reached `0` at the same time. So the logic here would still be correct in this case.
   
   
   Case 2: Multiple threads interleaving `get()` and `release()`
   
   1. Thread 1 executes `get(key)` right past `rcOmSnapshot.incrementRefCount()` (after the change in this PR). ref count is 1 at this point.
   2. Thread 2 executes `release(key)` past `rcOmSnapshot.decrementRefCount()`. ref count is 0 at this point. (*)
   3. Thread 2 continues to execute `pendingEvictionList.add(rcOmSnapshot)`, pendingEvictionList now has the snapshot, if `cacheSizeLimit` is exceeded, `cleanup()` will pick up the snapshot immediately.
   4. Thread 1 executes `pendingEvictionList.remove(rcOmSnapshot)`. snapshot is (supposed to be) removed from eviction list while the snapshot is already removed from `dbMap` (cleaned up) by `cleanup()`.
   
   (*) While actually Step 2 in Case 2 above would be prevented because `decrementRefCount()` throws if a thread tries to `release()` a snapshot it had not `get()` before, as a safety mechanism. And this safety, together with atomic ref count should also prevent Case 1 from happening in the first place (for which multiple threads would observe that ref count reached `0` at the same time).
   
   
   `cleanup()`, as of this time, is only called at the end of each `get()` and `release()`. As it doesn't run in a seperate thread it should not pose extra correctness issue at least for now.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9871. [Snapshot] Fix Deadlock in SnapshotCache. [ozone]

Posted by "aswinshakil (via GitHub)" <gi...@apache.org>.
aswinshakil commented on PR #5751:
URL: https://github.com/apache/ozone/pull/5751#issuecomment-1849167615

   In `cleanup()`, We iterate over `pendingEvictionList` to remove entries from `dbMap` with `dbMap.remove`(Take a lock). We cannot change that order. I don't think we need that level of atomicity to increase and decrease reference count while holding `dbMap` lock. Please let me know what are your thoughts on this? Thanks! @smengcl 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9871. [Snapshot] Fix Deadlock in SnapshotCache. [ozone]

Posted by "smengcl (via GitHub)" <gi...@apache.org>.
smengcl commented on code in PR #5751:
URL: https://github.com/apache/ozone/pull/5751#discussion_r1425776306


##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java:
##########
@@ -230,21 +231,19 @@ public ReferenceCounted<IOmMetadataReader, SnapshotCache> get(String key,
    * @param key snapshot table key
    */
   public void release(String key) {
-    dbMap.compute(key, (k, v) -> {
-      if (v == null) {
-        throw new IllegalArgumentException(
-            "Key '" + key + "' does not exist in cache");
-      }
+    ReferenceCounted<IOmMetadataReader, SnapshotCache>
+        rcOmSnapshot = dbMap.get(key);
+    if (rcOmSnapshot == null) {
+      throw new IllegalArgumentException(
+          "Key '" + key + "' does not exist in cache");
+    }
 
-      if (v.decrementRefCount() == 0L) {
-        synchronized (pendingEvictionList) {
-          // v is eligible to be evicted and closed
-          pendingEvictionList.add(v);
-        }
+    if (rcOmSnapshot.decrementRefCount() == 0L) {
+      synchronized (pendingEvictionList) {
+        // v is eligible to be evicted and closed
+        pendingEvictionList.add(rcOmSnapshot);
       }
-
-      return v;
-    });
+    }

Review Comment:
   @aswinshakil Indeed `SnapshotCache#release` is not called directly right now and is only kept to conform to Guava cache convention (and just in case we would need it). Ref count is currently decremented by `ReferenceCounted#close`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9871. [Snapshot] Fix Deadlock in SnapshotCache. [ozone]

Posted by "aswinshakil (via GitHub)" <gi...@apache.org>.
aswinshakil commented on code in PR #5751:
URL: https://github.com/apache/ozone/pull/5751#discussion_r1424334482


##########
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/snapshot/SnapshotCache.java:
##########
@@ -230,21 +231,19 @@ public ReferenceCounted<IOmMetadataReader, SnapshotCache> get(String key,
    * @param key snapshot table key
    */
   public void release(String key) {
-    dbMap.compute(key, (k, v) -> {
-      if (v == null) {
-        throw new IllegalArgumentException(
-            "Key '" + key + "' does not exist in cache");
-      }
+    ReferenceCounted<IOmMetadataReader, SnapshotCache>
+        rcOmSnapshot = dbMap.get(key);
+    if (rcOmSnapshot == null) {
+      throw new IllegalArgumentException(
+          "Key '" + key + "' does not exist in cache");
+    }
 
-      if (v.decrementRefCount() == 0L) {
-        synchronized (pendingEvictionList) {
-          // v is eligible to be evicted and closed
-          pendingEvictionList.add(v);
-        }
+    if (rcOmSnapshot.decrementRefCount() == 0L) {
+      synchronized (pendingEvictionList) {
+        // v is eligible to be evicted and closed
+        pendingEvictionList.add(rcOmSnapshot);
       }
-
-      return v;
-    });
+    }

Review Comment:
   @smengcl From the code I don't see `release()` being used anywhere. How do we decrement reference count if we don't use `release()`. The only other place we decrement reference count is in `get()`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9871. [Snapshot] Fix Deadlock in SnapshotCache. [ozone]

Posted by "smengcl (via GitHub)" <gi...@apache.org>.
smengcl merged PR #5751:
URL: https://github.com/apache/ozone/pull/5751


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9871. [Snapshot] Fix Deadlock in SnapshotCache. [ozone]

Posted by "smengcl (via GitHub)" <gi...@apache.org>.
smengcl commented on PR #5751:
URL: https://github.com/apache/ozone/pull/5751#issuecomment-1848248175

   Good catch @aswinshakil .
   
   Is it possible to fix this by adjusting the lock order in `cleanup()`? So that `dbMap` object lock is acquired first?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org


Re: [PR] HDDS-9871. [Snapshot] Fix Deadlock in SnapshotCache. [ozone]

Posted by "smengcl (via GitHub)" <gi...@apache.org>.
smengcl commented on PR #5751:
URL: https://github.com/apache/ozone/pull/5751#issuecomment-1851729040

   Thanks @aswinshakil for the patch. Thanks @hemantk-12 for the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org