You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Henrique (Jira)" <ji...@apache.org> on 2020/09/02 03:32:00 UTC

[jira] [Commented] (IGNITE-11974) infinite loop and 100% cpu in GridDhtPartitionsEvictor: Eviction in progress ...

    [ https://issues.apache.org/jira/browse/IGNITE-11974?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17188952#comment-17188952 ] 

Henrique commented on IGNITE-11974:
-----------------------------------

Hi [~v.pyatkov],

I've set *RebalanceMode.SYNC* for all my distributed caches and it seems to have "solved" it. At least it mitigated the issue to a point where I can't reproduce it.

> infinite loop and 100% cpu in GridDhtPartitionsEvictor: Eviction in progress ...
> --------------------------------------------------------------------------------
>
>                 Key: IGNITE-11974
>                 URL: https://issues.apache.org/jira/browse/IGNITE-11974
>             Project: Ignite
>          Issue Type: Bug
>          Components: cache
>    Affects Versions: 2.5
>            Reporter: Igor Kamyshnikov
>            Priority: Blocker
>         Attachments: eviction-in-progress-dumps.zip, image-2019-07-10-16-07-37-185.png, server-node-restarts-1.png
>
>
> Note: RCA was not done:
> Sometimes ignite server nodes fall into infinite loop and consume 100% cpu:
> {noformat}
> "sys-#260008" #260285 prio=5 os_prio=0 tid=0x00007fabb020a800 nid=0x1e850 runnable [0x00007fab26fef000]
>    java.lang.Thread.State: RUNNABLE
> 	at java.util.concurrent.ConcurrentHashMap$Traverser.advance(ConcurrentHashMap.java:3339)
> 	at java.util.concurrent.ConcurrentHashMap$ValueIterator.next(ConcurrentHashMap.java:3439)
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionsEvictor$1.call(GridDhtPartitionsEvictor.java:84)
> 	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionsEvictor$1.call(GridDhtPartitionsEvictor.java:73)
> 	at org.apache.ignite.internal.util.IgniteUtils.wrapThreadLoader(IgniteUtils.java:6695)
> 	at org.apache.ignite.internal.processors.closure.GridClosureProcessor$2.body(GridClosureProcessor.java:967)
> 	at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:110)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> 	at java.lang.Thread.run(Thread.java:748)
>    Locked ownable synchronizers:
> 	- <0x0000000649b9cba0> (a java.util.concurrent.ThreadPoolExecutor$Worker)
> {noformat}
> the following appears in logs each 2 minutes:
> {noformat}
> #### INFO  2019-07-08 12:21:45.081 (1562581305081) [sys-#98168] [GridDhtPartitionsEvictor] > Eviction in progress [grp=CUSTPRODINVOICEDISCUSAGE, remainingCnt=102]
> {noformat}
> remainingCnt remains the same once it reached 102 (the very first line in the logs was with value equal to 101).
> Some other facts:
> we have a heapdump taken for *topVer = 900* . the problem appeared after *topVer = 790*, but it looks like it was silently waiting from *topVer = 641* (about 24 hours back).
> There were 259 topology changes between 900 and 641.
> All 102 GridDhtLocalPartitions can be found in the heapdump:
> {noformat}
> select * from "org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition" t where delayedRenting = true
> {noformat}
> They all have status = 65537 , which means (according to org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtLocalPartition#state):
> reservations(65537) = 1
> getPartState(65537) = OWNING
> There are also 26968 instances of org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl$$Lambda$70, that are created by org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionTopologyImpl#checkEvictions method.
> 26418 of 26968 refer to AtomicInteger instance with value = 102:
> 26418/102 = 259 = 900 - 641 (see topology info above).
> The key thing seen from the heapdump is that topVer = 641 or topVer = 642 was the last topology where these 102 partitions were assigned to the current ignite server node.
> {noformat}
> select
>   t.this
>  ,t.this['clientEvtChange'] as clientEvtChange
>  ,t.this['topVer.topVer'] as topVer
>  ,t.this['assignment.elementData'][555]['elementData'][0]['hostNames.elementData'][0] as primary_part
>  ,t.this['assignment.elementData'][555]['elementData'][1]['hostNames.elementData'][0] as secondary_part
> from org.apache.ignite.internal.processors.affinity.HistoryAffinityAssignment t where length(t.this['assignment.elementData']) = 1024
> order by topVer
> {noformat}
>  !image-2019-07-10-16-07-37-185.png! 
> The connection of a client node at topVer = 790 somehow triggered the GridDhtPartitionsEvictor loop to execute.
> Summary:
> 1) it is seen that 102 partitions has one reservation and OWNING state.
> 2) they were backup partitions.
> 3) for some reason their eviction has been silently delaying (because of reservations), but each topology change seemed to trigger eviction attempt.
> 4) something managed to make org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtPartitionsEvictor#evictPartitionAsync to run never exiting.
> Additional info:
> topVer = 641 was in chain of sever nodes restarts (not sure if rebalancing actually succeeded):
>  !server-node-restarts-1.png! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)