You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@geode.apache.org by "Bruce Schuchardt (JIRA)" <ji...@apache.org> on 2018/05/01 20:38:01 UTC
[jira] [Closed] (GEODE-5155) hang recovering transaction state for crashed server

     [ https://issues.apache.org/jira/browse/GEODE-5155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bruce Schuchardt closed GEODE-5155.
-----------------------------------

> hang recovering transaction state for crashed server
> ----------------------------------------------------
>
>                 Key: GEODE-5155
>                 URL: https://issues.apache.org/jira/browse/GEODE-5155
>             Project: Geode
>          Issue Type: Bug
>          Components: distributed lock service, transactions
>    Affects Versions: 1.7.0
>            Reporter: Bruce Schuchardt
>            Assignee: Bruce Schuchardt
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.7.0
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> A concourse job failed in DlockAndTxlockRegressionTest.testDLockProtectsAgainstTransactionConflict with two threads stuck in this state:
> {noformat}[vm2] "Pooled Waiting Message Processor 2" tid=0x71
> [vm2]     java.lang.Thread.State: WAITING
> [vm2] 	at java.lang.Object.wait(Native Method)
> [vm2] 	-  waiting on org.apache.geode.internal.cache.TXCommitMessage@2105ce6
> [vm2] 	at java.lang.Object.wait(Object.java:502)
> [vm2] 	at org.apache.geode.internal.cache.TXFarSideCMTracker.waitToProcess(TXFarSideCMTracker.java:176)
> [vm2] 	at org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage.processTXOriginatorRecoveryMessage(TXOriginatorRecoveryProcessor.java:160)
> [vm2] 	at org.apache.geode.internal.cache.locks.TXOriginatorRecoveryProcessor$TXOriginatorRecoveryMessage$1.run(TXOriginatorRecoveryProcessor.java:144)
> [vm2] 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [vm2] 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [vm2] 	at org.apache.geode.distributed.internal.ClusterDistributionManager.runUntilShutdown(ClusterDistributionManager.java:1121)
> [vm2] 	at org.apache.geode.distributed.internal.ClusterDistributionManager.access$000(ClusterDistributionManager.java:109)
> [vm2] 	at org.apache.geode.distributed.internal.ClusterDistributionManager$6$1.run(ClusterDistributionManager.java:865)
> [vm2] 	at java.lang.Thread.run(Thread.java:748)
> {noformat}
> I modified the test to tighten up its forcedDisconnect and performOps methods to get transaction recovery to happen more reliably.
> {code}
>   public void forceDisconnect() throws Exception {
>     Cache existingCache = basicGetCache();
>     synchronized(commitLock) {
>       committing = false;
>       while (!committing) {
>         commitLock.wait();
>       }
>     }
>     if (existingCache != null && !existingCache.isClosed()) {
>       DistributedTestUtils.crashDistributedSystem(getCache().getDistributedSystem());
>     }
>   }
>   public void performOps() {
>     Cache cache = getCache();
>     Region region = cache.getRegion("TestRegion");
>     DistributedLockService dlockService = DistributedLockService.getServiceNamed("Bulldog");
>     Random random = new Random();
>     while (!cache.isClosed()) {
>       boolean locked = false;
>       try {
>         locked = dlockService.lock("testDLock", 500, 60_000);
>         if (!locked) {
>           // this could happen if we're starved out for 30sec by other VMs
>           continue;
>         }
>         cache.getCacheTransactionManager().begin();
>         region.put("TestKey", "TestValue" + random.nextInt(100000));
>         TXManagerImpl mgr = (TXManagerImpl) getCache().getCacheTransactionManager();
>         TXStateProxyImpl txProxy = (TXStateProxyImpl) mgr.getTXState();
>         TXState txState = (TXState) txProxy.getRealDeal(null, null);
>         txState.setBeforeSend(() -> {
>           synchronized(commitLock) {
>             committing = true;
>             commitLock.notifyAll();
>           }});
>         try {
>           cache.getCacheTransactionManager().commit();
>         } catch (CommitConflictException e) {
>           throw new RuntimeException("dlock failed to prevent a transaction conflict", e);
>         }
>         int txCount = getBlackboard().getMailbox(TRANSACTION_COUNT);
>         getBlackboard().setMailbox(TRANSACTION_COUNT, txCount + 1);
>       } catch (CancelException | IllegalStateException e) {
>         // okay to ignore
>       } finally {
>         if (locked) {
>           try {
>             dlockService.unlock("testDLock");
>           } catch (CancelException | IllegalStateException e) {
>             // shutting down
>           }
>         }
>       }
>     }
>   }
> {code}
> The problem is that the membership listener in TXCommitMessage is removing itself from the transaction map in TXFarSideCMTracker without setting any state that the recovery message can check.  The recovery method is waiting like this:
> {code}
>     synchronized (this.txInProgress) {
>       mess = (TXCommitMessage) this.txInProgress.get(lk);
>     }
>     if (mess != null) {
>       synchronized (mess) {
>         // tx in progress, we must wait until its done
>         while (!mess.wasProcessed()) {
>           try {
>             mess.wait();
>           } catch (InterruptedException ie) {
>             Thread.currentThread().interrupt();
>             logger.error(LocalizedMessage.create(
>                 LocalizedStrings.TxFarSideTracker_WAITING_TO_COMPLETE_ON_MESSAGE_0_CAUGHT_AN_INTERRUPTED_EXCEPTION,
>                 mess), ie);
>             break;
>           }
>         }
>       }
> {code}
> We could probably change this method to make sure that the message is still in the map instead of only checking wasProcessed().



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)