You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jackrabbit.apache.org by Ro...@sungard.com on 2012/08/27 12:29:23 UTC

Deadlock with a twist

Dear jackrabit-DEVs,

first of all, my apologies for this rather long post. But while investigating a deadlock scenario at a customer site we observed program state that should - according to the source code - not be (obviously) possible. We have an idea how this happened and I would kindly like to ask for either confirmation or correction.

The deadlock itself matches JCR-2820, which we already backported to the 1.6.4 codebase that's being used in our scenario (I know it's old, but it serves its purpose, generally).

But while trying to recreate the error condition to actually verify the fix, we discovered in the HEAP dump taken from the deadlocked system the following:
- the JR TX is about to be committed, already in the commit phase
- the actual thread performing the commit is taken from the app server's work manager thread pool and executing asynchronously (this is why JCR-2820 hit)
- while trying to send out post-commit notifications the thread gets blocked within FineGrainedISMLocking#acquireReadLock

The twist is: going back the invocation stack, event notifications get triggered from line 780 in SharedItemStateManager (see below). Being in this line, the write lock held during commit should already be downgraded (see lines 764 and 765).
However, a heap dump we got from the blocked application shows that the writeLock member of SharedItemStateManager still refers to a lock, the very one that prevents acquireReadLock from proceeding.
The write lock being held is the correct one, namely the one obtained during the previous prepare of the same TX.

All this happens on an IBM JVM on AIX. It happens very sporadically, between occurrences can be several days, but only minutes as well.

The only explanation I can think of so far is that some broad out-of-order execution might be happening (due to the JVMs JIT applying optimizations). Is there anything known? Or any obvious other explanation I'm missing?
I'm mostly concerned that the currently available fix might not cover such scenario and some further synchronization might be required.

Many thanks in advance and kind regards
Robert

[snip class="SharedItemStateManager"]
            ISMLocking.ReadLock readLock = null;
            try {
                // downgrade to read lock
764             readLock = writeLock.downgrade();
765             writeLock = null;

                // Let the shared item listeners know about the change
                // JCR-2171: This must happen after downgrading the lock!
                shared.persisted();

                /* notify virtual providers about node references */
                for (int i = 0; i < virtualNodeReferences.length; i++) {
                    ChangeLog virtualRefs = virtualNodeReferences[i];
                    if (virtualRefs != null) {
                        virtualProviders[i].setNodeReferences(virtualRefs);
                    }
                }

                /* dispatch the events */
780             events.dispatch();

                /* let listener know about finished operation */
                if (eventChannel != null) {
                    String path = events.getSession().getUserID() + "@" + events.getCommonPath();
                    eventChannel.updateCommitted(this, path);
                }
[/snip]

[snip]
3XMTHREADINFO      "[STUCK] ExecuteThread: '11' for queue: 'weblogic.kernel.Default (self-tuning)'" J9VMThread:0x0000010033796D00, j9thread_t:0x000001003630E380, java/lang/Thread:0x070000003A97B798, state:CW, prio=1
3XMTHREADINFO1            (native thread ID:0x69F00F1, native priority:0x1, native policy:UNKNOWN)
3XMTHREADINFO3           Java callstack:
4XESTACKTRACE                at java/lang/Object.wait(Native Method)
4XESTACKTRACE                at java/lang/Object.wait(Object.java:167(Compiled Code))
4XESTACKTRACE                at EDU/oswego/cs/dl/util/concurrent/Latch.acquire(Bytecode PC:22)
4XESTACKTRACE                at org/apache/jackrabbit/core/state/FineGrainedISMLocking.acquireReadLock(FineGrainedISMLocking.java:112(Compiled Code))
4XESTACKTRACE                at org/apache/jackrabbit/core/state/SharedItemStateManager.acquireReadLock(SharedItemStateManager.java:1822(Compiled Code))
4XESTACKTRACE                at org/apache/jackrabbit/core/state/SharedItemStateManager.getItemState(SharedItemStateManager.java:253(Compiled Code))
4XESTACKTRACE                at org/apache/jackrabbit/core/SearchManager$1.nextNodeState(SearchManager.java:450)
4XESTACKTRACE                at org/apache/jackrabbit/core/SearchManager$1.next(SearchManager.java:443)
4XESTACKTRACE                at org/apache/commons/collections/iterators/TransformIterator.next(TransformIterator.java:87(Compiled Code))
4XESTACKTRACE                at org/apache/commons/collections/IteratorUtils.toList(IteratorUtils.java:848(Compiled Code))
4XESTACKTRACE                at org/apache/commons/collections/IteratorUtils.toList(IteratorUtils.java:824)
4XESTACKTRACE                at org/apache/jackrabbit/core/query/lucene/SearchIndex.updateNodes(SearchIndex.java:588)
4XESTACKTRACE                at org/apache/jackrabbit/core/SearchManager.onEvent(SearchManager.java:486(Compiled Code))
4XESTACKTRACE                at org/apache/jackrabbit/core/observation/EventConsumer.consumeEvents(EventConsumer.java:244(Compiled Code))
4XESTACKTRACE                at org/apache/jackrabbit/core/observation/ObservationDispatcher.dispatchEvents(ObservationDispatcher.java:201(Compiled Code))
4XESTACKTRACE                at org/apache/jackrabbit/core/observation/EventStateCollection.dispatch(EventStateCollection.java:474)
4XESTACKTRACE                at org/apache/jackrabbit/core/observation/DelegatingObservationDispatcher.dispatch(DelegatingObservationDispatcher.java:127)
4XESTACKTRACE                at org/apache/jackrabbit/core/observation/DelegatingObservationDispatcher.dispatchEvents(DelegatingObservationDispatcher.java:99)
4XESTACKTRACE                at org/apache/jackrabbit/core/observation/EventStateCollection.dispatch(EventStateCollection.java:474(Compiled Code))
4XESTACKTRACE                at org/apache/jackrabbit/core/state/SharedItemStateManager$Update.end(SharedItemStateManager.java:780(Compiled Code))
4XESTACKTRACE                at org/apache/jackrabbit/core/state/XAItemStateManager.commit(XAItemStateManager.java:181(Compiled Code))
4XESTACKTRACE                at org/apache/jackrabbit/core/version/XAVersionManager.commit(XAVersionManager.java:521)
4XESTACKTRACE                at org/apache/jackrabbit/core/TransactionContext.commit(TransactionContext.java:205(Compiled Code))
4XESTACKTRACE                at org/apache/jackrabbit/core/XASessionImpl.commit(XASessionImpl.java:346)
4XESTACKTRACE                at org/apache/jackrabbit/jca/TransactionBoundXAResource.commit(TransactionBoundXAResource.java:39)
[/snip]


Robert Sauer . Chief Architect - Infinity . SunGard . Global Business Services and Technology . Solmsstraße 18, 60486 Frankfurt am Main, Germany
Tel +49 69 70768 572 . Mobile +49 176 1111 0181 . robert.sauer@sungard.com . www.sungard.com/infinity  

Join the online conversation with SunGard's customers, partners and industry experts and find an event near you at: www.sungard.com/ten. 
CONFIDENTIALITY: This e-mail (including any attachments) may contain confidential, proprietary and privileged information, and unauthorized disclosure or use is prohibited. If you receive this e-mail in error, please notify the sender and delete this e-mail from your system.

SunGard Systeme GmbH . Registered: Frankfurt am Main, Handelsregister Frankfurt HRB 56839 . Geschäftsführer: Harold Finders, Christian U. Haas, Henry Morton Miller Jr., Victoria E. Silbey

RE: Deadlock with a twist

Posted by Ro...@sungard.com.

Thanks, Claus, for responding.

Unfortunately, upgrading to 2.4.x is not an immediate option. But as I have backported several fixes to the 1.6.x baseline a solution seems to be in place already.

I'm mostly interested in fully understanding how these deadlock came into existence in the first place. Looking at the code, there is a fixed order of events, namely

1. Acquire write lock on Shared ISM during prepare.
2. Downgrading the write lock to a read lock during commit,
3. Broadcasting "update ended" events, e.g. to update the Lucene indexes.
3.1. Lucene acquires a read lock on Shared ISM.

Specifically, the order of steps 2. and 3. is fixed within the SISM.Update#end() method
However, from my analysis it seems sometimes (and this is really rare) step 3. is being executed while 2. is not yet effective. And this bends my mind.
Current revisions of SISM have lines reordered a little, but to me there is no clear indication that a similar situation will not occur with trunk as well.

I guess it all boils down to the question: does all this go back to a bug within the JVM or is there still some "happens before" indicator missing from the code.

During merge verification we have manually switched 2. and 3. and could verify that the fixes (specifically JCR-2820) are effective.

Thanks and kind regards
Robert

AW: Deadlock with a twist

Posted by KÖLL Claus <C....@TIROL.GV.AT>.

Hi Robert,

It's hard to help here because Jackrabbit 1.6.x a very old version as you said.
I have investigated a lot of time to fix deadlock bugs in xa environment. It would be great if you can upgrade to the coming soon 2.4.3 version
where a lot of bugs are fixed specialy with deadlocks in xa.

greets
claus