You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Sergey Uttsel (Jira)" <ji...@apache.org> on 2023/01/03 07:48:00 UTC

[jira] [Commented] (IGNITE-18448) Deadlock on node stop.

    [ https://issues.apache.org/jira/browse/IGNITE-18448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17653842#comment-17653842 ] 

Sergey Uttsel commented on IGNITE-18448:
----------------------------------------

I run tests from https://github.com/apache/ignite-3/pull/1465 at main branch. I encountered the stack trace from description and I see that there is no deadlock because in 'DistributionZoneManager#initMetaStorageKeysOnStart' sync invocation 'metaStorageManager.get(zonesLogicalTopologyVersionKey()).get()' is failed with TimeoutException, unblock DistributionZoneManager#busyLock and so on.
Also in https://github.com/apache/ignite-3/pull/1426 I removed sync invocation of '.get()' on future from metastorage which was the root cause of the issue.

> Deadlock on node stop.
> ----------------------
>
>                 Key: IGNITE-18448
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18448
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Andrey Mashenkov
>            Priority: Major
>              Labels: ignite-3
>
> Two threads fall into deadlock when trying to remove nodes from collection.
> See stacktraces below.
> 1. ConcurrentHashMap.compute methods must not use blocking operations.
> 2. IgnitionImpl.doStart() adds a node to _readyForInitNodes_ collection under race. 
> The call _nodeToStart.start(cfgContent)_ returns a future may fails before the _nodeToStart_ object will be added to collection.
> 3. If the future (_nodeToStart.start(cfgContent)_) fails, it possible, some components are started and hold resources, which will never be released. Seems, in case of failure, _nodeToStart.stop()_ has to be called.
> {noformat}
> "%node1%Raft-Group-Client-12@21385" prio=5 tid=0x127e nid=NA waiting for monitor entry
>   java.lang.Thread.State: BLOCKED
> 	 waiting for Test worker@1 to release lock on <0x59df> (a java.util.concurrent.ConcurrentHashMap$Node)
> 	  at java.util.concurrent.ConcurrentHashMap.replaceNode(ConcurrentHashMap.java:1122)
> 	  at java.util.concurrent.ConcurrentHashMap.remove(ConcurrentHashMap.java:1102)
> 	  at org.apache.ignite.internal.app.IgnitionImpl.handleStartException(IgnitionImpl.java:235)
> {noformat}
> {noformat}
> "Test worker@1" prio=5 tid=0x1 nid=NA sleeping
>   java.lang.Thread.State: TIMED_WAITING
> 	 blocks %node1%Raft-Group-Client-12@21385
> 	  at java.lang.Thread.sleep(Thread.java:-1)
> 	  at org.apache.ignite.internal.util.IgniteSpinReadWriteLock.writeLock(IgniteSpinReadWriteLock.java:255)
> 	  at org.apache.ignite.internal.util.IgniteSpinBusyLock.block(IgniteSpinBusyLock.java:68)
> 	  at org.apache.ignite.internal.distributionzones.DistributionZoneManager.stop(DistributionZoneManager.java:288)
> 	  at org.apache.ignite.internal.app.LifecycleManager.lambda$stopAllComponents$1(LifecycleManager.java:133)
> 	  at org.apache.ignite.internal.app.LifecycleManager$$Lambda$3032.1586776480.accept(Unknown Source:-1)
> 	  at java.util.Iterator.forEachRemaining(Iterator.java:133)
> 	  at org.apache.ignite.internal.app.LifecycleManager.stopAllComponents(LifecycleManager.java:131)
> 	  - locked <0x59de> (a org.apache.ignite.internal.app.LifecycleManager)
> 	  at org.apache.ignite.internal.app.LifecycleManager.stopNode(LifecycleManager.java:115)
> 	  at org.apache.ignite.internal.app.IgniteImpl.stop(IgniteImpl.java:642)
> 	  at org.apache.ignite.internal.app.IgnitionImpl.lambda$stop$0(IgnitionImpl.java:145)
> 	  at org.apache.ignite.internal.app.IgnitionImpl$$Lambda$3001.460355950.apply(Unknown Source:-1)
> 	  at java.util.concurrent.ConcurrentHashMap.computeIfPresent(ConcurrentHashMap.java:1822)
> 	  - locked <0x59df> (a java.util.concurrent.ConcurrentHashMap$Node)
> 	  at org.apache.ignite.internal.app.IgnitionImpl.stop(IgnitionImpl.java:144)
> 	  at org.apache.ignite.IgnitionManager.stop(IgnitionManager.java:116)
> {noformat}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)