You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Mirza Aliev (Jira)" <ji...@apache.org> on 2021/01/28 14:12:00 UTC
[jira] [Created] (IGNITE-14093) ttl-cleanup-worker falls with
AssertionError and leads to CorruptiedTreeException
Mirza Aliev created IGNITE-14093:
------------------------------------
Summary: ttl-cleanup-worker falls with AssertionError and leads to CorruptiedTreeException
Key: IGNITE-14093
URL: https://issues.apache.org/jira/browse/IGNITE-14093
Project: Ignite
Issue Type: Bug
Affects Versions: 2.9.1
Reporter: Mirza Aliev
Assignee: Mirza Aliev
Attachments: IgnitePdsWithTtlDeferredDeleteOnRestartTest (1).java
This issue is very rare, it's quite hard to reproduce on mac, some windows users reproduced it a bit often
Scenario:
# 2 baseline nodes, cache with expiry policy = 60 sec.
# Put some entries in the cache, stop one node immediately.
# Remove node from baseline.
# Wait until expiration.
# Start the stopped node — NPE on node start.
{code:java}
[2020-05-08 16:07:17,925][ERROR][ttl-cleanup-worker-#43][root] Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=java.lang.NullPointerException]]
java.lang.NullPointerException
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2765)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2696)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1073)
at org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:242)
at org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:178)
at java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
at org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:177)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
at java.lang.Thread.run(Thread.java:748)
{code}
In some cases, it is possible to get this stacktrace
{code:java}
[2020-05-25 10:49:29,677][ERROR][ttl-cleanup-worker-#242%db.IgnitePdsWithTtlDeferredDeleteOnRestartTest2%][IgniteTestResources] Critical system error detected. Will be handled accordingly to configured handler [hnd=NoOpFailureHandler [super=AbstractFailureHandler [ignoredFailureTypes=UnmodifiableSet [SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=CRITICAL_ERROR, err=class o.a.i.i.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-1237460590, val2=0]], groupName=group1, msg=Runtime failure on bounds: [lower=PendingRow [], upper=PendingRow []]]]]
class org.apache.ignite.internal.processors.cache.persistence.tree.CorruptedTreeException: B+Tree is corrupted [pages(groupId, pageId)=[IgniteBiTuple [val1=-1237460590, val2=0]], groupName=group1, msg=Runtime failure on bounds: [lower=PendingRow [], upper=PendingRow []]]
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.corruptedTreeException(BPlusTree.java:6110)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1119)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1083)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1078)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpiredInternal(GridCacheOffheapManager.java:2742)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager$GridCacheDataStore.purgeExpired(GridCacheOffheapManager.java:2696)
at org.apache.ignite.internal.processors.cache.persistence.GridCacheOffheapManager.expire(GridCacheOffheapManager.java:1073)
at org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:242)
at org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.lambda$body$0(GridCacheSharedTtlCleanupManager.java:178)
at java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1769)
at org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:177)
at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:119)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.AssertionError: FullPageId [pageId=0001000100000007, effectivePageId=0000000100000007, grpId=-1237460590]
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:822)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:696)
at org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl.acquirePage(PageMemoryImpl.java:685)
at org.apache.ignite.internal.processors.cache.persistence.DataStructure.acquirePage(DataStructure.java:156)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.acquirePage(BPlusTree.java:6041)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.findDown(BPlusTree.java:1420)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.doFind(BPlusTree.java:1397)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.access$8200(BPlusTree.java:98)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree$AbstractForwardCursor.find(BPlusTree.java:5563)
at org.apache.ignite.internal.processors.cache.persistence.tree.BPlusTree.find(BPlusTree.java:1103)
... 11 more
{code}
To increase chances to reproduce, it might help to add
{code:java}
else if (relPtr == OUTDATED_REL_PTR) {
try {
Thread.sleep(1000);
}
catch (InterruptedException e) {
e.printStackTrace();
}
assert PageIdUtils.pageIndex(pageId) == 0 : fullId;
{code}
in {{org.apache.ignite.internal.processors.cache.persistence.pagemem.PageMemoryImpl}}
The root cause of this problem was the fact, that node that was removed from baseline has the gap between restarting and the moment where partition exchange future makes initCachesOnLocalJoin and stops caches for the node, that was removed from baseline. TTL cleanup worker has worked in that gap and continued working even after caches were stopped because TTL manager (GridCacheSharedTtlCleanupManager) caches a mapping between caches and managers. The solution is to unregister managers for all caches before onBaselineChange in initCachesOnLocalJoin
--
This message was sent by Atlassian Jira
(v8.3.4#803005)