You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Anton Kalashnikov (Jira)" <ji...@apache.org> on 2021/01/25 12:37:00 UTC

[jira] [Updated] (IGNITE-14055) Deadlock in timeoutObjectProcessor between 'send message' & 'handshake timeout'

     [ https://issues.apache.org/jira/browse/IGNITE-14055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Anton Kalashnikov updated IGNITE-14055:
---------------------------------------
    Summary: Deadlock in timeoutObjectProcessor between 'send message' & 'handshake timeout'  (was: Deadlock in timeoutObjectProcessor between 'send messag'e & 'handshake timeout')

> Deadlock in timeoutObjectProcessor between 'send message' & 'handshake timeout'
> -------------------------------------------------------------------------------
>
>                 Key: IGNITE-14055
>                 URL: https://issues.apache.org/jira/browse/IGNITE-14055
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Anton Kalashnikov
>            Assignee: Anton Kalashnikov
>            Priority: Major
>         Attachments: StartServerWithTxPuts (1).java, freeze (1).sh
>
>
> Cluster hangs after jvm pauses on one of server nodes.
>  Scenario:
>  1. Start three server nodes with put operations using StartServerWithTxPuts.
>  2. Emulate jvm freezes on one server node by running the attached script:
>  {{*sh freeze.sh <jvm process pid>*}}
>  3. Wait until the script has finished.
> Result:
>  The cluster hangs on tx put operations.
> The first server node continuously prints:
> {noformat}
> [2020-11-03 09:36:01,719][INFO ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:57714][2020-11-03 09:36:01,720][INFO ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 09:36:01,922][INFO ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:57716][2020-11-03 09:36:01,922][INFO ][grid-nio-worker-tcp-comm-0-#23%TcpCommunicationSpi%][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 09:36:02,124][INFO ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:57718][2020-11-03 09:36:02,125][INFO ][grid-nio-worker-tcp-comm-1-#24%TcpCommunicationSpi%][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 09:36:02,326][INFO ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:57720][2020-11-03 09:36:02,327][INFO ][grid-nio-worker-tcp-comm-2-#25%TcpCommunicationSpi%][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3][2020-11-03 09:36:02,528][INFO ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] Accepted incoming communication connection [locAddr=/127.0.0.1:47100, rmtAddr=/127.0.0.1:57722][2020-11-03 09:36:02,529][INFO ][grid-nio-worker-tcp-comm-3-#26%TcpCommunicationSpi%][TcpCommunicationSpi] Received incoming connection from remote node while connecting to this node, rejecting [locNode=5defd32f-5bdb-4b9e-8a6e-5ee268edac42, locNodeOrder=1, rmtNode=07583a9d-36c8-4100-a69c-8cbd26ca82c9, rmtNodeOrder=3]}}
> {noformat}
>  The second node prints long running transactions in prepared state ignoring the default tx timeout:
>  
> {noformat}
> [2020-11-03 09:36:46,199][WARN ][sys-#83%56b4f715-82d6-4d63-ba99-441ffcd673b4%][diagnostic] >>> Future [startTime=09:33:08.496, curTime=09:36:46.181, fut=GridNearTxFinishFuture [futId=425decc8571-4ce98554-8c56-4daf-a7a9-5b9bff52fa08, tx=GridNearTxLocal [mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], cacheId=-923393186], val=TxEntryValueHolder [val=CacheObjectByteArrayImpl [arrLen=1048576], op=CREATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null, filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=true, entry=GridDhtDetachedCacheEntry [super=GridDistributedCacheEntry [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=833, extras=null, flags=0]]], prepared=0, locked=false, nodeId=07583a9d-36c8-4100-a69c-8cbd26ca82c9, locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion [topVer=215865159, order=1604385188157, nodeOrder=2]]], explicitLock=false, queryUpdate=false, dhtVer=null, last=false, nearEntries=0, clientFirst=false, node=07583a9d-36c8-4100-a69c-8cbd26ca82c9]], nearLocallyMapped=false, colocatedLocallyMapped=false, needCheckBackup=null, hasRemoteLocks=false, trackTimeout=false, systemTime=39312675, systemStartTime=0, prepareStartTime=340253826207680, prepareTime=39265601, commitOrRollbackStartTime=340253865473498, commitOrRollbackTime=46297, lb=null, mvccOp=false, qryId=-1, crdVer=0, thread=pool-2-thread-2, mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], cacheId=-923393186], val=TxEntryValueHolder [val=CacheObjectByteArrayImpl [arrLen=1048576], op=CREATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null, filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=true, entry=GridDhtDetachedCacheEntry [super=GridDistributedCacheEntry [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=833, extras=null, flags=0]]], prepared=0, locked=false, nodeId=07583a9d-36c8-4100-a69c-8cbd26ca82c9, locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion [topVer=215865159, order=1604385188157, nodeOrder=2]]], explicitLock=false, queryUpdate=false, dhtVer=null, last=false, nearEntries=0, clientFirst=false, node=07583a9d-36c8-4100-a69c-8cbd26ca82c9]], super=GridDhtTxLocalAdapter [nearOnOriginatingNode=false, nearNodes=KeySetView [], dhtNodes=KeySetView [], explicitLock=false, super=IgniteTxLocalAdapter [completedBase=null, sndTransformedVals=false, depEnabled=false, txState=IgniteTxImplicitSingleStateImpl [init=true, recovery=false, useMvccCaching=false], super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=215865159, order=1604385188157, nodeOrder=2], writeVer=null, implicit=true, loc=true, threadId=123, startTime=1604385188496, nodeId=61733533-7c90-4383-ab98-93f41d7059f4, isolation=READ_COMMITTED, concurrency=OPTIMISTIC, timeout=10000, sysInvalidate=false, sys=false, plc=2, commitVer=GridCacheVersion [topVer=215865159, order=1604385188157, nodeOrder=2], finalizing=NONE, invalidParts=null, state=COMMITTED, timedOut=false, topVer=AffinityTopologyVersion [topVer=3, minorTopVer=1], mvccSnapshot=null, skipCompletedVers=false, parentTx=null, duration=217695ms, onePhaseCommit=false], size=1]]], commit=true, mappings=IgniteTxMappingsSingleImpl [mapping=GridDistributedTxMapping [entries=LinkedHashSet [IgniteTxEntry [txKey=IgniteTxKey [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], cacheId=-923393186], val=TxEntryValueHolder [val=CacheObjectByteArrayImpl [arrLen=1048576], op=CREATE], prevVal=TxEntryValueHolder [val=null, op=NOOP], oldVal=TxEntryValueHolder [val=null, op=NOOP], entryProcessorsCol=null, ttl=-1, conflictExpireTime=-1, conflictVer=null, explicitVer=null, dhtVer=null, filters=CacheEntryPredicate[] [], filtersPassed=false, filtersSet=true, entry=GridDhtDetachedCacheEntry [super=GridDistributedCacheEntry [super=GridCacheMapEntry [key=KeyCacheObjectImpl [part=833, val=833, hasValBytes=true], val=null, ver=GridCacheVersion [topVer=0, order=0, nodeOrder=0], hash=833, extras=null, flags=0]]], prepared=0, locked=false, nodeId=07583a9d-36c8-4100-a69c-8cbd26ca82c9, locMapped=false, expiryPlc=null, transferExpiryPlc=false, flags=0, partUpdateCntr=0, serReadVer=null, xidVer=GridCacheVersion [topVer=215865159, order=1604385188157, nodeOrder=2]]], explicitLock=false, queryUpdate=false, dhtVer=null, last=false, nearEntries=0, clientFirst=false, node=07583a9d-36c8-4100-a69c-8cbd26ca82c9]], trackable=true, finishOnePhaseCalled=false, innerFuts=TransformCollectionView [FinishFuture[node=07583a9d-36c8-4100-a69c-8cbd26ca82c9, loc=false, done=false]], super=GridCompoundIdentityFuture [super=GridCompoundFuture [rdc=AlwaysTrueReducer [], initFlag=1, lsnrCalls=0, done=false, cancelled=false, err=null, futs=TransformCollectionView [false]]]]]{noformat}
>  The third node indicated starvation in the striped pool:
> {noformat}
> [2020-11-03 09:39:25,852][WARN ][grid-timeout-worker-#22%6d42f793-e8d7-4fc1-812a-5336f850d9ff%][G] >>> Possible starvation in striped pool.Thread name: sys-stripe-4-#5%6d42f793-e8d7-4fc1-812a-5336f850d9ff%Queue: []Deadlock: falseCompleted: 359Thread [name="sys-stripe-4-#5%6d42f793-e8d7-4fc1-812a-5336f850d9ff%", id=16, state=TIMED_WAITING, blockCnt=33, waitCnt=1229]        at sun.misc.Unsafe.park(Native Method)        at java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:338)        at o.a.i.i.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:227)        at o.a.i.i.util.future.GridFutureAdapter.get(GridFutureAdapter.java:168)        at o.a.i.spi.communication.tcp.internal.ConnectionClientPool.reserveClient(ConnectionClientPool.java:290)        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage0(TcpCommunicationSpi.java:1146)        at o.a.i.spi.communication.tcp.TcpCommunicationSpi.sendMessage(TcpCommunicationSpi.java:1095)        at o.a.i.i.managers.communication.GridIoManager.send(GridIoManager.java:1812)        at o.a.i.i.managers.communication.GridIoManager.sendToGridTopic(GridIoManager.java:1927)        at o.a.i.i.processors.cache.GridCacheIoManager.send(GridCacheIoManager.java:1257)        at o.a.i.i.processors.cache.distributed.dht.GridDhtTxFinishFuture.finish(GridDhtTxFinishFuture.java:528)        at o.a.i.i.processors.cache.distributed.dht.GridDhtTxFinishFuture.finish(GridDhtTxFinishFuture.java:329)        at o.a.i.i.processors.cache.distributed.dht.GridDhtTxLocal.finishTx(GridDhtTxLocal.java:504)        at o.a.i.i.processors.cache.distributed.dht.GridDhtTxLocal.rollbackDhtLocalAsync(GridDhtTxLocal.java:586)        at o.a.i.i.processors.cache.transactions.IgniteTxHandler.finishDhtLocal(IgniteTxHandler.java:1102)        at o.a.i.i.processors.cache.transactions.IgniteTxHandler.finish(IgniteTxHandler.java:970)        at o.a.i.i.processors.cache.transactions.IgniteTxHandler.processNearTxFinishRequest(IgniteTxHandler.java:925)        at o.a.i.i.processors.cache.transactions.IgniteTxHandler.access$200(IgniteTxHandler.java:134)        at o.a.i.i.processors.cache.transactions.IgniteTxHandler$3.apply(IgniteTxHandler.java:231)        at o.a.i.i.processors.cache.transactions.IgniteTxHandler$3.apply(IgniteTxHandler.java:229)        at o.a.i.i.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1142)        at o.a.i.i.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591)        at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392)        at o.a.i.i.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318)        at o.a.i.i.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109)        at o.a.i.i.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:308)        at o.a.i.i.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1721)        at o.a.i.i.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1328)        at o.a.i.i.managers.communication.GridIoManager.access$4600(GridIoManager.java:157)        at o.a.i.i.managers.communication.GridIoManager$8.execute(GridIoManager.java:1213)        at o.a.i.i.managers.communication.TraceRunnable.run(TraceRunnable.java:54)        at o.a.i.i.util.StripedExecutor$Stripe.body(StripedExecutor.java:559)        at o.a.i.i.util.worker.GridWorker.run(GridWorker.java:119)        at java.lang.Thread.run(Thread.java:748)
> {noformat}
>  {{The request is sent from the timeout object processor. During this request, the connection hangs. To resolve the hanging, the handshake timeout object should be executed but it is impossible due to it awaits the finishing of the initial request.}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)