You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ozone.apache.org by "Wei-Chiu Chuang (Jira)" <ji...@apache.org> on 2024/01/22 17:08:00 UTC

[jira] [Resolved] (HDDS-9965) [Snapshot] SnapshotCache is in inconsistent state

     [ https://issues.apache.org/jira/browse/HDDS-9965?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wei-Chiu Chuang resolved HDDS-9965.
-----------------------------------
    Resolution: Duplicate

> [Snapshot] SnapshotCache is in inconsistent state
> -------------------------------------------------
>
>                 Key: HDDS-9965
>                 URL: https://issues.apache.org/jira/browse/HDDS-9965
>             Project: Apache Ozone
>          Issue Type: Bug
>          Components: Snapshot
>            Reporter: Hemant Kumar
>            Assignee: Aswin Shakil
>            Priority: Blocker
>              Labels: ozone-snapshot
>
> Seems like fix for deadlock (in HDDS-9871) caused cache inconsistency. Due to which *SnapDiff* and background services *KeyDeletingService* and *SstFilteringService* are failing.
> *KeyDeletingService* logs:
> {code:java}
> 2023-12-18 19:19:29,563 WARN [KeyDeletingService#0]-org.apache.hadoop.hdds.utils.BackgroundService: Background task execution failed
> java.lang.IllegalStateException: Cache map entry removal failure. The cache is in an inconsistent state. Expected OmSnapshot instance: org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: null
>         at com.google.common.base.Preconditions.checkState(Preconditions.java:512)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220)
>         at org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625)
>         at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getLatestActiveSnapshot(OmMetadataManagerImpl.java:1691)
>         at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getPendingDeletionKeys(OmMetadataManagerImpl.java:1544)
>         at org.apache.hadoop.ozone.om.KeyManagerImpl.getPendingDeletionKeys(KeyManagerImpl.java:607)
>         at org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:183)
>         at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
>         at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
>         at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834) {code}
> {code:java}
> 2023-12-18 19:20:28,768 WARN [KeyDeletingService#0]-org.apache.hadoop.hdds.utils.BackgroundService: Background task execution failed
> java.lang.IllegalStateException: Cache map entry removal failure. The cache is in an inconsistent state. Expected OmSnapshot instance: org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@7656056
>         at com.google.common.base.Preconditions.checkState(Preconditions.java:512)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220)
>         at org.apache.hadoop.ozone.om.OmSnapshotManager.checkForSnapshot(OmSnapshotManager.java:625)
>         at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getLatestActiveSnapshot(OmMetadataManagerImpl.java:1691)
>         at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.getPendingDeletionKeys(OmMetadataManagerImpl.java:1544)
>         at org.apache.hadoop.ozone.om.KeyManagerImpl.getPendingDeletionKeys(KeyManagerImpl.java:607)
>         at org.apache.hadoop.ozone.om.service.KeyDeletingService$KeyDeletingTask.call(KeyDeletingService.java:183)
>         at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
>         at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
>         at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> 2023-12-18 19:20:28,768 WARN [SstFilteringService#0]-org.apache.hadoop.hdds.utils.BackgroundService: Background task execution failed
> java.lang.IllegalStateException: Cache map entry removal failure. The cache is in an inconsistent state. Expected OmSnapshot instance: org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: null
>         at com.google.common.base.Preconditions.checkState(Preconditions.java:512)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153)
>         at org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:185)
>         at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
>         at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
>         at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> {code}
> *SSTFilteringService* logs:
> {code:java}
> 2023-12-18 19:21:28,742 ERROR [SstFilteringService#0]-org.apache.hadoop.ozone.om.OmSnapshotManager: Failed to retrieve snapshot: /vol-t2gj8/buck-07uux/snap-5griw
> java.io.IOException: Failed init RocksDB, db path : /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b, exception :org.rocksdb.RocksDBException lock hold by current process, acquire time 1702927228 acquiring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available
>         at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:180)
>         at org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:220)
>         at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.loadDB(OmMetadataManagerImpl.java:598)
>         at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:406)
>         at org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:357)
>         at org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$0(SnapshotCache.java:171)
>         at java.base/java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:167)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153)
>         at org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:185)
>         at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
>         at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
>         at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.io.IOException: class org.apache.hadoop.hdds.utils.db.RocksDatabase: Failed to open /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b; status : IOError; message : lock hold by current process, acquire time 1702927228 a
> cquiring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available
>         at org.apache.hadoop.hdds.utils.HddsServerUtil.toIOException(HddsServerUtil.java:667)
>         at org.apache.hadoop.hdds.utils.db.RocksDatabase.toIOException(RocksDatabase.java:91)
>         at org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:168)
>         at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:113)
>         ... 18 more
> Caused by: org.rocksdb.RocksDBException: lock hold by current process, acquire time 1702927228 acquiring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available
>         at org.rocksdb.RocksDB.open(Native Method)
>         at org.rocksdb.RocksDB.open(RocksDB.java:307)
>         at org.apache.hadoop.hdds.utils.db.managed.ManagedRocksDB.open(ManagedRocksDB.java:75)
>         at org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:154)
>         ... 19 more
> 2023-12-18 19:21:28,743 WARN [SstFilteringService#0]-org.apache.hadoop.hdds.utils.BackgroundService: Background task execution failed
> java.lang.IllegalStateException: java.io.IOException: Failed init RocksDB, db path : /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b, exception :org.rocksdb.RocksDBException lock hold by current process, acquire time 1702927228 acqu
> iring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$0(SnapshotCache.java:179)
>         at java.base/java.util.concurrent.ConcurrentHashMap.compute(ConcurrentHashMap.java:1908)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:167)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153)
>         at org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:185)
>         at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
>         at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
>         at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> Caused by: java.io.IOException: Failed init RocksDB, db path : /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b, exception :org.rocksdb.RocksDBException lock hold by current process, acquire time 1702927228 acquiring thread 139777345
> 017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available
>         at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:180)
>         at org.apache.hadoop.hdds.utils.db.DBStoreBuilder.build(DBStoreBuilder.java:220)
>         at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.loadDB(OmMetadataManagerImpl.java:598)
>         at org.apache.hadoop.ozone.om.OmMetadataManagerImpl.<init>(OmMetadataManagerImpl.java:406)
>         at org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:357)
>         at org.apache.hadoop.ozone.om.OmSnapshotManager$1.load(OmSnapshotManager.java:1)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.lambda$0(SnapshotCache.java:171)
>         ... 12 more
> Caused by: java.io.IOException: class org.apache.hadoop.hdds.utils.db.RocksDatabase: Failed to open /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b; status : IOError; message : lock hold by current process, acquire time 1702927228 a
> cquiring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available
>         at org.apache.hadoop.hdds.utils.HddsServerUtil.toIOException(HddsServerUtil.java:667)
>         at org.apache.hadoop.hdds.utils.db.RocksDatabase.toIOException(RocksDatabase.java:91)
>         at org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:168)
>         at org.apache.hadoop.hdds.utils.db.RDBStore.<init>(RDBStore.java:113)
>         ... 18 more
> Caused by: org.rocksdb.RocksDBException: lock hold by current process, acquire time 1702927228 acquiring thread 139777345017600: /var/lib/hadoop-ozone/om/data913140/db.snapshots/checkpointState/om.db-4e72e3fd-58e4-4814-b8fa-869fb3e8741b/LOCK: No locks available
>         at org.rocksdb.RocksDB.open(Native Method)
>         at org.rocksdb.RocksDB.open(RocksDB.java:307)
>         at org.apache.hadoop.hdds.utils.db.managed.ManagedRocksDB.open(ManagedRocksDB.java:75)
>         at org.apache.hadoop.hdds.utils.db.RocksDatabase.open(RocksDatabase.java:154)
>         ... 19 more{code}
> {code:java}
> 2023-12-18 19:19:28,741 ERROR [SstFilteringService#0]-org.apache.hadoop.ozone.om.SstFilteringService: Exception encountered while filtering a snapshot
> java.io.IOException: Rocks Database is closed
>         at org.apache.hadoop.hdds.utils.db.RocksDatabase.assertClose(RocksDatabase.java:445)
>         at org.apache.hadoop.hdds.utils.db.RocksDatabase.deleteFilesNotMatchingPrefix(RocksDatabase.java:958)
>         at org.apache.hadoop.ozone.om.SstFilteringService$SstFilteringTask.call(SstFilteringService.java:193)
>         at org.apache.hadoop.hdds.utils.BackgroundService$PeriodicalTask.lambda$run$0(BackgroundService.java:121)
>         at java.base/java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1736)
>         at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
>         at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
>         at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> {code}
> *SnapDiff* logs:
> {code:java}
> 2023-12-18 19:33:12,222 ERROR [snapshot-diff-job-thread-id-8]-org.apache.hadoop.ozone.om.snapshot.SnapshotDiffManager: Caught unchecked exception during diff report generation for volume: voljngnm bucket: bucketjngnm, fromSnapshot: snap-t1gfn and toSnapshot: snap-h1xmg
> java.lang.IllegalStateException: Cache map entry removal failure. The cache is in an inconsistent state. Expected OmSnapshot instance: org.apache.hadoop.ozone.om.snapshot.ReferenceCounted@4f63f85e, actual: null
>         at com.google.common.base.Preconditions.checkState(Preconditions.java:512)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanupInternal(SnapshotCache.java:313)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.cleanup(SnapshotCache.java:285)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:220)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotCache.get(SnapshotCache.java:153)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotDiffManager.generateSnapshotDiffReport(SnapshotDiffManager.java:841)
>         at org.apache.hadoop.ozone.om.snapshot.SnapshotDiffManager.lambda$2(SnapshotDiffManager.java:722)
>         at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at java.base/java.lang.Thread.run(Thread.java:834)
> {code}
> I suspect it might be a race condition between *cleanupInternal()* and *invalidate()*. It wasn't a problem earlier because they were synchronize by ConcurrentHashMap's lock. But it could be something else.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@ozone.apache.org
For additional commands, e-mail: issues-help@ozone.apache.org