You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Pavel Kovalenko (JIRA)" <ji...@apache.org> on 2019/04/19 14:34:00 UTC

[jira] [Commented] (IGNITE-11762) Test testClientStartCloseServersRestart causes hang of the whole Cache 2 suite in master

    [ https://issues.apache.org/jira/browse/IGNITE-11762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16821959#comment-16821959 ] 

Pavel Kovalenko commented on IGNITE-11762:
------------------------------------------

Investigation of cause showed that the failure is not related to IGNITE-10799
Here is stack trace that is the cause of transaction hanging:

{noformat}
java.lang.AssertionError:
	at org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter.calculatePartitionUpdateCounters(IgniteTxLocalAdapter.java:498)
	at org.apache.ignite.internal.processors.cache.transactions.IgniteTxLocalAdapter.userPrepare(IgniteTxLocalAdapter.java:438)
	at org.apache.ignite.internal.processors.cache.distributed.dht.GridDhtTxLocal.prepareAsync(GridDhtTxLocal.java:403)
	at org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.prepareNearTx(IgniteTxHandler.java:570)
	at org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.prepareNearTx(IgniteTxHandler.java:367)
	at org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.processNearTxPrepareRequest0(IgniteTxHandler.java:178)
	at org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.processNearTxPrepareRequest(IgniteTxHandler.java:156)
	at org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler.access$000(IgniteTxHandler.java:118)
	at org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler$1.apply(IgniteTxHandler.java:198)
	at org.apache.ignite.internal.processors.cache.transactions.IgniteTxHandler$1.apply(IgniteTxHandler.java:196)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.processMessage(GridCacheIoManager.java:1141)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.onMessage0(GridCacheIoManager.java:591)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:392)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.handleMessage(GridCacheIoManager.java:318)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager.access$100(GridCacheIoManager.java:109)
	at org.apache.ignite.internal.processors.cache.GridCacheIoManager$1.onMessage(GridCacheIoManager.java:308)
	at org.apache.ignite.internal.managers.communication.GridIoManager.invokeListener(GridIoManager.java:1561)
	at org.apache.ignite.internal.managers.communication.GridIoManager.processRegularMessage0(GridIoManager.java:1189)
	at org.apache.ignite.internal.managers.communication.GridIoManager.access$4200(GridIoManager.java:127)
	at org.apache.ignite.internal.managers.communication.GridIoManager$8.run(GridIoManager.java:1086)
	at org.apache.ignite.internal.util.StripedExecutor$Stripe.body(StripedExecutor.java:550)
	at org.apache.ignite.internal.util.worker.GridWorker.run(GridWorker.java:120)
{noformat}

It means that our transaction is mapped to the primary node with a partition in MOVING state which is generally impossible.
How we can move into such situation? Here is scenario:

Environment:
Cache with 1 backup, In-memory mode, Partition lost policy is IGNORE.
Nodes: Crd, Node1, Node2.

1. Partition has the following state:
Node1 (OWNING) [Primary], Node2(MOVING)
Node2 is currently rebalancing data from Node1

2. Node1 is left from topology, rebalancing is canceled.

3. Crd node assigned this partition by affinity and created it in MOVING state here:
Method:
{code:java}
org.apache.ignite.internal.processors.cache.distributed.dht.preloader.GridDhtPartitionsExchangeFuture#finishExchangeOnCoordinator
{code}
Section:
{code:java}
doInParallel(
                    parallelismLvl,
                    cctx.kernalContext().getSystemExecutorService(),
                    cctx.affinity().cacheGroups().values(),
                    desc -> {
                        if (desc.config().getCacheMode() == CacheMode.LOCAL)
                            return null;

                        CacheGroupContext grp = cctx.cache().cacheGroup(desc.groupId());

                        GridDhtPartitionTopology top = grp != null ? grp.topology() :
                            cctx.exchange().clientTopology(desc.groupId(), events().discoveryCache());

                        top.beforeExchange(this, true, true);

                        return null;
                    });
{code}

4. After that partition has following state:
Node2 (MOVING) [Primary], Crd (MOVING)

5. This partition is immediately owned on the coordinator node due to lost policy in the following code block:
{code:java}
                if (exchCtx.events().hasServerLeft())
                    detectLostPartitions(resTopVer);
{code}

6. FullMap with OWNED partition is sent to Node2.

7. Node2 can't mark this partition as LOST, because it sees one OWNER which is coordinator node.

8. Exchange is finished. Node2 started to rebalance that partition from Crd node.

9. A transaction is mapped to Node2 MOVING partition and fails with assertion above.

Correct behavior:
We should provide correct information to Node2 about LOST partition that was immediately owned after step. 5
We should either send MOVING for that partition in FullMap to give a possibility to detect such partition as LOST independently on Node2.
or we should explicitly send information in FullMap that this partition is marked as lost and Node2 should mark it as lost without fake rebalancing.


> Test testClientStartCloseServersRestart causes hang of the whole Cache 2 suite in master
> ----------------------------------------------------------------------------------------
>
>                 Key: IGNITE-11762
>                 URL: https://issues.apache.org/jira/browse/IGNITE-11762
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Ivan Rakov
>            Assignee: Pavel Kovalenko
>            Priority: Major
>             Fix For: 2.8
>
>
> Attempt to restart server node in test hangs:
> {code:java}
> [2019-04-16 19:56:45,049][WARN ][restart-1][GridCachePartitionExchangeManager] Failed to wait for initial partition map exchange. Possible reasons are:
> ^-- Transactions in deadlock.
> ^-- Long running transactions (ignore if this is the case).
> ^-- Unreleased explicit locks.
> {code}
> The reason is that previous PME (late affinity assignment) still hangs due to pending transaction:
> {code:java}
> [2019-04-16 19:56:23,717][WARN ][exchange-worker-#1039%cache.IgniteClientCacheStartFailoverTest3%][diagnostic] Pending transactions:
> [2019-04-16 19:56:23,718][WARN ][exchange-worker-#1039%cache.IgniteClientCacheStartFailoverTest3%][diagnostic] >>> [txVer=AffinityTopologyVersion [topVer=11, minorTopVer=0], exchWait=true, tx=GridDhtTxLocal [nearNodeId=8559bfe0-3d4a-4090-a457-6df0eba00005, nearFutId=1edc7172a61-941f9dde-2b60-4a1f-8213-7d23d738bf33, nearMiniId=1, nearFinFutId=null, nearFinMiniId=0, nearXidVer=GridCacheVersion [topVer=166913752, order=1555433759036, nodeOrder=6], lb=null, super=GridDhtTxLocalAdapter [nearOnOriginatingNode=false, nearNodes=KeySetView [], dhtNodes=KeySetView [9ef33532-0e4a-4561-b57e-042afe100000], explicitLock=false, super=IgniteTxLocalAdapter [completedBase=null, sndTransformedVals=false, depEnabled=false, txState=IgniteTxStateImpl [activeCacheIds=[-1062368467], recovery=false, mvccEnabled=true, mvccCachingCacheIds=[], txMap=HashSet []], super=IgniteTxAdapter [xidVer=GridCacheVersion [topVer=166913752, order=1555433759045, nodeOrder=10], writeVer=null, implicit=false, loc=true, threadId=1210, startTime=1555433762847, nodeId=0088e9b8-f859-4d14-8071-6388e4700003, startVer=GridCacheVersion [topVer=166913752, order=1555433759045, nodeOrder=10], endVer=null, isolation=REPEATABLE_READ, concurrency=PESSIMISTIC, timeout=0, sysInvalidate=false, sys=false, plc=2, commitVer=GridCacheVersion [topVer=166913752, order=1555433759045, nodeOrder=10], finalizing=NONE, invalidParts=null, state=MARKED_ROLLBACK, timedOut=false, topVer=AffinityTopologyVersion [topVer=11, minorTopVer=0], mvccSnapshot=MvccSnapshotResponse [futId=292, crdVer=1555433741506, cntr=395, opCntr=1, txs=[394], cleanupVer=390, tracking=0], skipCompletedVers=false, parentTx=null, duration=20866ms, onePhaseCommit=false], size=0]]]]
> {code}
> However, load threads don't start any explicit transactions: they either hang on put()/get() or on clientCache.close().
> Rolling back IGNITE-10799 resolves the issue (however, test remains flaky with ~10% fail rate due to unhandled TransactionSerializationException).
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)