You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Aswin Shakil (Jira)" <ji...@apache.org> on 2023/12/08 20:30:00 UTC

[jira] [Commented] (HDDS-9871) [snapshot] Snapshot diff request failing when setting ozone.om.snapshot.db.max.open.files=-1

    [ https://issues.apache.org/jira/browse/HDDS-9871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17794851#comment-17794851 ] 

Aswin Shakil commented on HDDS-9871:
------------------------------------

I looked into the LEADER OM node. There is a deadlock between {{SSTFilteringService}} and {{KeyDeletingService}}. This can also happen between any part of the code that accesses the snapshot cache. 
{code:java|title=KeyDeletingService}
java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.concurrent.ConcurrentHashMap.replaceNode(ConcurrentHashMap.java:1117)
- waiting to lock <0x00000006db6bcba8> (a java.util.concurrent.ConcurrentHashMap$Node)
at java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1097)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:312)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:286)
- locked <0x00000006d658eed0> (a java.util.Collections$SynchronizedSet)
- locked <0x00000006d6fabf38> (a org.apache.hadoop.ozone.om.snapshot.SnapshotCache)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:219)
at org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625)
at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getLatestActiveSnapshot(OmMetadataManagerImpl.java:1691)
at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getPendingDeletionKeys(OmMetadataManagerImpl.java:1544)
at org.apache.hadoop.ozone.om.KeyManagerImpl.getPendingDeletionKeys(KeyManagerImpl.java:607)
at org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:183)
{code}
{code:java|title=SSTFilteringService}
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.callback(SnapshotCache.java:266)
- waiting to lock <0x00000006d658eed0> (a java.util.Collections$SynchronizedSet)
at org.apache.hadoop.ozone.om.snapshot.ReferenceCounted.incrementRefCount(ReferenceCounted.java:102)
- locked <0x00000006d6b36580> (a java.lang.Object)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$1(SnapshotCache.java:185)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache$$Lambda$957/821305613.apply(Unknown Source)
at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1892)
- locked <0x00000006db6bcba8> (a java.util.concurrent.ConcurrentHashMap$Node)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:165)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:151)
at org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:185)
{code}
{code:java|title=Snapshot Diff Threads}
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.callback(SnapshotCache.java:266)
- waiting to lock <0x00000006d658eed0> (a java.util.Collections$SynchronizedSet)
at org.apache.hadoop.ozone.om.snapshot.ReferenceCounted.incrementRefCount(ReferenceCounted.java:102)
- locked <0x00000006db699550> (a java.lang.Object)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$1(SnapshotCache.java:185)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache$$Lambda$957/821305613.apply(Unknown Source)
at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1877)
- locked <0x00000006db6995d8> (a java.util.concurrent.ConcurrentHashMap$Node)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:165)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:151)

java.lang.Thread.State: BLOCKED (on object monitor)
at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1868)
- waiting to lock <0x00000006db41c1f8> (a java.util.concurrent.ConcurrentHashMap$ReservationNode)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:165)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:151)
at org.apache.hadoop.ozone.om.snapshot.SnapshotDiffManager.generateSnapshotDiffReport(SnapshotDiffManager.java:840)

java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.callback(SnapshotCache.java:266)
- waiting to lock <0x00000006d658eed0> (a java.util.Collections$SynchronizedSet)
at org.apache.hadoop.ozone.om.snapshot.ReferenceCounted.incrementRefCount(ReferenceCounted.java:102)
- locked <0x00000006db699568> (a java.lang.Object)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$1(SnapshotCache.java:185)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache$$Lambda$957/821305613.apply(Unknown Source)
at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1877)
- locked <0x00000006db699638> (a java.util.concurrent.ConcurrentHashMap$Node)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:165)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:151)
at org.apache.hadoop.ozone.om.snapshot.SnapshotDiffManager.generateSnapshotDiffReport(SnapshotDiffManager.java:840)

stackTrace:
java.lang.Thread.State: BLOCKED (on object monitor)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.callback(SnapshotCache.java:266)
- waiting to lock <0x00000006d658eed0> (a java.util.Collections$SynchronizedSet)
at org.apache.hadoop.ozone.om.snapshot.ReferenceCounted.incrementRefCount(ReferenceCounted.java:102)
- locked <0x00000006db41e440> (a java.lang.Object)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$1(SnapshotCache.java:185)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache$$Lambda$957/821305613.apply(Unknown Source)
at java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1853)
- locked <0x00000006db41c1f8> (a java.util.concurrent.ConcurrentHashMap$ReservationNode)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:165)
at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:151)
at org.apache.hadoop.ozone.om.snapshot.SnapshotDiffManager.generateSnapshotDiffReport(SnapshotDiffManager.java:840)
{code}
All these Diff threads are blocked because of this. I believe this is the root cause of this issue. The solution is not to hold locks from both ConcurrentHashMap and Synchronized Set or have an ordering of locks. 

> [snapshot] Snapshot diff request failing when setting ozone.om.snapshot.db.max.open.files=-1
> --------------------------------------------------------------------------------------------
>
>                 Key: HDDS-9871
>                 URL: https://issues.apache.org/jira/browse/HDDS-9871
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Snapshot
>            Reporter: Jyotirmoy Sinha
>            Assignee: Aswin Shakil
>            Priority: Major
>              Labels: ozone-snapshot
>
> Snapshot diff request failing when setting ozone.om.snapshot.db.max.open.files=-1, due to double flush buffer issue.
> Snapshot creation request:
> 2023-11-27 00:40:23,345 INFO [OM StateMachine ApplyTransaction Thread - 0]-org.apache.hadoop.ozone.om.request.snapshot.OMSnapshotCreateRequest: Created snapshot: 'snap-ay36z' with snapshotId: 'bf0c6141-4185-4361-b15f-c4aa71c5c6d8' under path 'vol-2xd36/buck-id806'
> Double Buffer flush logs:
> ...
> 2023-11-27 00:10:23,826 INFO [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager: Created checkpoint in rocksDB at /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-b2e9acb3-fee2-4190-8272-0649edca8d93 in 30 milliseconds
> 2023-11-27 00:10:23,827 INFO [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils: Waited for 1 milliseconds for checkpoint directory /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-b2e9acb3-fee2-4190-8272-0649edca8d93 availability.
> 2023-11-27 00:10:23,828 INFO [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.OmSnapshotManager: Created checkpoint : /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-b2e9acb3-fee2-4190-8272-0649edca8d93 for snapshot snap-mswq9
> 2023-11-27 00:10:39,586 INFO [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager: Created checkpoint in rocksDB at /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-3369ac3a-61e1-4eca-b3cf-eb2de0b2d688 in 30 milliseconds
> 2023-11-27 00:10:39,586 INFO [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils: Waited for 0 milliseconds for checkpoint directory /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-3369ac3a-61e1-4eca-b3cf-eb2de0b2d688 availability.
> 2023-11-27 00:10:39,587 INFO [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.OmSnapshotManager: Created checkpoint : /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-3369ac3a-61e1-4eca-b3cf-eb2de0b2d688 for snapshot snap-f5u3t
> 2023-11-27 00:10:55,949 INFO [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager: Created checkpoint in rocksDB at /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-3a690c8f-f3ef-415d-b25c-3aaf763c9507 in 22 milliseconds
> 2023-11-27 00:10:55,950 INFO [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils: Waited for 1 milliseconds for checkpoint directory /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-3a690c8f-f3ef-415d-b25c-3aaf763c9507 availability.
> 2023-11-27 00:10:55,950 INFO [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.OmSnapshotManager: Created checkpoint : /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-3a690c8f-f3ef-415d-b25c-3aaf763c9507 for snapshot snap-jfktn
> 2023-11-29 08:52:24,698 INFO [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager: Created checkpoint in rocksDB at /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-c3ba17ef-d947-454e-9c4f-b9063ae65650 in 15 milliseconds
> 2023-11-29 08:52:24,715 INFO [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils: Waited for 16 milliseconds for checkpoint directory /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-c3ba17ef-d947-454e-9c4f-b9063ae65650 availability.
> 2023-11-29 08:52:24,717 WARN [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.OmSnapshotManager: Took 614733 ns to find endKey. Caller is deleteKeysFromDelKeyTableInSnapshotScope
> 2023-11-29 08:52:24,718 INFO [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.OmSnapshotManager: Created checkpoint : /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-c3ba17ef-d947-454e-9c4f-b9063ae65650 for snapshot snap-ay36z
> 2023-11-29 08:52:24,745 INFO [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointManager: Created checkpoint in rocksDB at /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-bf0c6141-4185-4361-b15f-c4aa71c5c6d8 in 12 milliseconds
> 2023-11-29 08:52:24,746 INFO [OMDoubleBufferFlushThread]-org.apache.hadoop.hdds.utils.db.RDBCheckpointUtils: Waited for 0 milliseconds for checkpoint directory /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-bf0c6141-4185-4361-b15f-c4aa71c5c6d8 availability.
> 2023-11-29 08:52:24,747 INFO [OMDoubleBufferFlushThread]-org.apache.hadoop.ozone.om.OmSnapshotManager: Created checkpoint : /var/lib/hadoop-ozone/om/data035525/db.snapshots/checkpointState/om.db-bf0c6141-4185-4361-b15f-c4aa71c5c6d8 for snapshot snap-ay36z
> ...



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org