You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@ignite.apache.org by "Abhishek Gupta (BLOOMBERG/ 919 3RD A)" <ag...@bloomberg.net> on 2020/02/03 16:13:32 UTC

Re: "Adding entry to partition that is concurrently evicted" error

Thanks Andrei.  Looking at my exception (see below), it seem like it is related to https://issues.apache.org/jira/browse/IGNITE-11620 in that it occurred while expiration was going on. 

1. As a workaround, would it be valid to increase my ttl to reduce the possibility of this occurring ? 
2. My worry about using "NoOpFailureHandler" is that the error would still have occurred and it might have put the node in a bad situation which might be just as bad or worse than just killing the node. 

If you can confirm 1. is a valid line of defense (albeit not air-tight), that would be great.

Thanks,
Abhishek

P.S. My exception below. See it occurs on 'expire()' - similar stack trace as the one in 11620


 [ERROR] ttl-cleanup-worker-#159 - Critical system error detected. Will be handled accordingly to configured handler [hnd=StopNodeOrHaltFailureHandler [tryStop=false, timeout=0, super=AbstractFailureHandler [ignoredFailureTypes=[SYSTEM_WORKER_BLOCKED, SYSTEM_CRITICAL_OPERATION_TIMEOUT]]], failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.i.processors.cache.distributed.dht.topology.GridDhtInvalidPartitionException [part=1013, msg=Adding entry to partition that is concurrently evicted [grp=mainCache, part=1013, shouldBeMoving=, belongs=false, topVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1], curTopVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1]]]]] org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtInvalidPartitionException: Adding entry to partition that is concurrently evicted [grp=mainCache, part=1013, shouldBeMoving=, belongs=false, topVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1], curTopVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1]] at org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.localPartition0(GridDhtPartitionTopologyImpl.java:950) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtPartitionTopologyImpl.localPartition(GridDhtPartitionTopologyImpl.java:825) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.distributed.dht.GridCachePartitionedConcurrentMap.localPartition(GridCachePartitionedConcurrentMap.java:70) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.distributed.dht.GridCachePartitionedConcurrentMap.putEntryIfObsoleteOrAbsent(GridCachePartitionedConcurrentMap.java:89) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.GridCacheAdapter.entryEx(GridCacheAdapter.java:1008) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtCacheAdapter.entryEx(GridDhtCacheAdapter.java:544) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.GridCacheAdapter.entryEx(GridCacheAdapter.java:999) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.expireInternal(IgniteCacheOffheapManagerImpl.java:1403) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.IgniteCacheOffheapManagerImpl.expire(IgniteCacheOffheapManagerImpl.java:1347) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.GridCacheTtlManager.expire(GridCacheTtlManager.java:207) ~[ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.processors.cache.GridCacheSharedTtlCleanupManager$CleanupWorker.body(GridCacheSharedTtlCleanupManager.java:139) [ignite-core-2.7.5-0-2.jar:2.7.5] at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120) [ignite-core-2.7.5-0-2.jar:2.7.5] at java.lang.Thread.run(Thread.java:748) [?:1.8.0_222]

From: user@ignite.apache.org At: 01/31/20 05:11:57To:  user@ignite.apache.org
Subject: Re: "Adding entry to partition that is concurrently evicted" error

                  
Hi,
      
      Current problem should be solved in ignite-2.8. I am not sure why       this fix isn't a part of ignite-2.7.6.
      
      https://issues.apache.org/jira/browse/IGNITE-11127
      
      Your cluster was stopped because of failure handler work.
      
https://apacheignite.readme.io/docs/critical-failures-handling#section-failure-handling
      
      I am not sure about possible workarounds here (probably you can       set the NoOpFailureHandler). You also can try to create the thread       on developer user list:
      
http://apache-ignite-developers.2346864.n4.nabble.com/Apache-Ignite-2-7-release-td34076i40.html
      
      BR,
      Andrei     
1/29/2020 1:58 AM, Abhishek Gupta       (BLOOMBERG/ 919 3RD A) пишет:
         
                          
Hello!      I've got a 6 node Ignite 2.7.5 grid. I had this strange issue where multiple nodes hit the following exception -   [ERROR] [sys-stripe-53-#54] GridCacheIoManager - Failed to process message [senderId=f4a736b6-cfff-4548-a8b4-358d54d19ac6, messageType=class o.a.i.i.processors.cache.distributed.near.GridNearGetRequest] org.apache.ignite.internal.processors.cache.distributed.dht.topology.GridDhtInvalidPartitionException: Adding entry to partition that is concurrently evicted [grp=mainCache, part=733, shouldBeMoving=, belongs=false, topVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1], curTopVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1]]  and then died after  2020-01-27 13:30:19.849 [ERROR] [ttl-cleanup-worker-#159]  - JVM will be halted immediately due to the failure: [failureCtx=FailureContext [type=SYSTEM_WORKER_TERMINATION, err=class o.a.i.i.processors.cache.distributed.dht.topology.GridDhtInvalidPartitionException [part=1013, msg=Adding entry to partition that is concurrently evicted [grp=mainCache, part=1013, shouldBeMoving=, belongs=false, topVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1], curTopVer=AffinityTopologyVersion [topVer=1978, minorTopVer=1]]]]]  The sequence of events was simply the following - 
One of the nodes (lets call it node 1) was down for 2.5 hours and restarted. After a configured delay of 20 mins, it started to rebalance from the other 5 nodes. There were no other nodes that joined or left in this period. 40 minutes into the rebalance the the above errors started showing in the other nodes and they just bounced, and therefore there was data loss.   I found a few links related to this but nothing that explained the root cause or what my work around could be -   * http://apache-ignite-users.70518.x6.nabble.com/Adding-entry-to-partition-that-is-concurrently-evicted-td24782.html#a24786 * https://issues.apache.org/jira/browse/IGNITE-9803
* https://issues.apache.org/jira/browse/IGNITE-11620
 Thanks, Abhishek