You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Gary Tully (JIRA)" <ji...@apache.org> on 2016/08/15 10:42:20 UTC

[jira] [Resolved] (AMQ-5815) Frequency IOException cast dead lock while using kahadb in high concurrency environment

     [ https://issues.apache.org/jira/browse/AMQ-5815?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Gary Tully resolved AMQ-5815.
-----------------------------
       Resolution: Fixed
         Assignee: Gary Tully
    Fix Version/s: 5.15.0

two different issues here with similar IO failure scenarios.
note the default behaviour is to stop the broker in such an event via the IOExceptionHandler.
however it should be ok to proceed without deadlock.

It would be great to find the original jstack, because the restarted checkpoint thread is most probably attempting to get the write lock which is the trigger. In any event, having a global transaction store lock is a remnant of an old fix and is broken.

> Frequency IOException cast dead lock while using kahadb in high concurrency environment
> ---------------------------------------------------------------------------------------
>
>                 Key: AMQ-5815
>                 URL: https://issues.apache.org/jira/browse/AMQ-5815
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 5.10.0
>         Environment: Master/Slave AMQ + kahadb(NAS using NFS)
>            Reporter: Andrew
>            Assignee: Gary Tully
>             Fix For: 5.15.0
>
>   Original Estimate: 168h
>  Remaining Estimate: 168h
>
> When the concurrency is in high level, iops more than 3000/s, we find that most of the amq connections are blocked.We can't either get or push any data through them unless restart.This issue appears *frequency*.
> There is many IOExceptions in the amq logs.
> {code}
> 2015-05-28 00:37:08,544 | ERROR | Checkpoint failed | org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Journal Checkpoint Worker
> java.io.IOException: Input/output error
>         at java.io.RandomAccessFile.write(Native Method)[:1.6.0_30]
>         at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:1001)[:1.6.0_30]
>         at org.apache.activemq.util.RecoverableRandomAccessFile.writeLong(RecoverableRandomAccessFile.java:305)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.disk.page.PageFile.writeBatch(PageFile.java:1062)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.disk.page.PageFile.flush(PageFile.java:516)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1512)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.MessageDatabase$17.execute(MessageDatabase.java:1484)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.disk.page.Transaction.execute(Transaction.java:779)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1481)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:929)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:357)[activemq-kahadb-store-5.10.0.jar:5.10.0]
> 2015-05-28 00:37:08,547 | INFO  | Ignoring IO exception, java.io.IOException: Input/output error | org.apache.activemq.util.DefaultIOExceptionHandler | ActiveMQ Journal Checkpoint Worker
> java.io.IOException: Input/output error
>         at java.io.RandomAccessFile.write(Native Method)[:1.6.0_30]
>         at java.io.RandomAccessFile.writeLong(RandomAccessFile.java:1001)[:1.6.0_30]
>         at org.apache.activemq.util.RecoverableRandomAccessFile.writeLong(RecoverableRandomAccessFile.java:305)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.disk.page.PageFile.writeBatch(PageFile.java:1062)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.disk.page.PageFile.flush(PageFile.java:516)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1512)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.MessageDatabase$17.execute(MessageDatabase.java:1484)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.disk.page.Transaction.execute(Transaction.java:779)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.MessageDatabase.checkpointUpdate(MessageDatabase.java:1481)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.MessageDatabase.checkpointCleanup(MessageDatabase.java:929)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.MessageDatabase$3.run(MessageDatabase.java:357)[activemq-kahadb-store-5.10.0.jar:5.10.0]
> 2015-05-28 00:37:08,580 | INFO  | KahaDB: Recovering checkpoint thread after death | org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Transport: tcp:///10.40.50.31:58171@61616
> 2015-05-28 00:37:17,285 | INFO  | Journal failed while writing at: 2359276 | org.apache.activemq.store.kahadb.disk.journal.DataFileAppender | ActiveMQ Data File Writer
> 2015-05-28 00:37:17,287 | ERROR | KahaDB failed to store to Journal | org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Transport: tcp:///10.40.50.17:4487@61616
> java.io.IOException: Input/output error
>         at sun.nio.ch.FileChannelImpl.force0(Native Method)[:1.6.0_30]
>         at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:348)[:1.6.0_30]
>         at org.apache.activemq.util.RecoverableRandomAccessFile.sync(RecoverableRandomAccessFile.java:402)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.disk.journal.DataFileAppender.processQueue(DataFileAppender.java:373)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.disk.journal.DataFileAppender$1.run(DataFileAppender.java:193)[activemq-kahadb-store-5.10.0.jar:5.10.0]
> 2015-05-28 00:37:17,291 | ERROR | KahaDB failed to store to Journal | org.apache.activemq.store.kahadb.MessageDatabase | ActiveMQ Transport: tcp:///10.40.50.17:4442@61616
> java.io.IOException: Input/output error
>         at sun.nio.ch.FileChannelImpl.force0(Native Method)[:1.6.0_30]
>         at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:348)[:1.6.0_30]
>         at org.apache.activemq.util.RecoverableRandomAccessFile.sync(RecoverableRandomAccessFile.java:402)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.disk.journal.DataFileAppender.processQueue(DataFileAppender.java:373)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.disk.journal.DataFileAppender$1.run(DataFileAppender.java:193)[activemq-kahadb-store-5.10.0.jar:5.10.0]
> 2015-05-28 00:37:17,293 | INFO  | Ignoring IO exception, java.io.IOException: Input/output error | org.apache.activemq.util.DefaultIOExceptionHandler | ActiveMQ Transport: tcp:///10.40.50.17:4442@61616
> java.io.IOException: Input/output error
>         at sun.nio.ch.FileChannelImpl.force0(Native Method)[:1.6.0_30]
>         at sun.nio.ch.FileChannelImpl.force(FileChannelImpl.java:348)[:1.6.0_30]
>         at org.apache.activemq.util.RecoverableRandomAccessFile.sync(RecoverableRandomAccessFile.java:402)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.disk.journal.DataFileAppender.processQueue(DataFileAppender.java:373)[activemq-kahadb-store-5.10.0.jar:5.10.0]
>         at org.apache.activemq.store.kahadb.disk.journal.DataFileAppender$1.run(DataFileAppender.java:193)[activemq-kahadb-store-5.10.0.jar:5.10.0]
> {code}
> With the aid of jstack ,we find dead lock(How can I upload the whole jtack log?)
> The blocked connections are all waiting for 0x00007f90500034e0
> {code}
> [01:08:01] "ActiveMQ Transport: tcp:///10.40.50.17:4479@61616" daemon prio=10 tid=0x00007f8f8c015000 nid=0x2ece waiting for monitor entry [0x00007f8f1f6e4000]
> [01:08:01]    java.lang.Thread.State: BLOCKED (on object monitor)
> [01:08:01]         at org.apache.activemq.transaction.LocalTransaction.rollback(LocalTransaction.java:94)
> [01:08:01]         - waiting to lock <0x00007f90500034e0> (a org.apache.activemq.store.kahadb.KahaDBTransactionStore)
> [01:08:01]         at org.apache.activemq.transaction.LocalTransaction.commit(LocalTransaction.java:74)
> [01:08:01]         at org.apache.activemq.broker.TransactionBroker.commitTransaction(TransactionBroker.java:253)
> [01:08:01]         at org.apache.activemq.broker.MutableBrokerFilter.commitTransaction(MutableBrokerFilter.java:112)
> [01:08:01]         at org.apache.activemq.broker.TransportConnection.processCommitTransactionOnePhase(TransportConnection.java:453)
> [01:08:01]         at org.apache.activemq.command.TransactionInfo.visit(TransactionInfo.java:100)
> [01:08:01]         at org.apache.activemq.broker.TransportConnection.service(TransportConnection.java:294)
> [01:08:01]         at org.apache.activemq.broker.TransportConnection$1.onCommand(TransportConnection.java:148)
> [01:08:01]         at org.apache.activemq.transport.MutexTransport.onCommand(MutexTransport.java:50)
> [01:08:01]         at org.apache.activemq.transport.WireFormatNegotiator.onCommand(WireFormatNegotiator.java:113)
> [01:08:01]         at org.apache.activemq.transport.AbstractInactivityMonitor.onCommand(AbstractInactivityMonitor.java:270)
> [01:08:01]         at org.apache.activemq.transport.TransportSupport.doConsume(TransportSupport.java:83)
> [01:08:01]         at org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:214)
> [01:08:01]         at org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:196)
> [01:08:01]         at java.lang.Thread.run(Thread.java:662)
> {code}
> Only one connection(Thread) get this lock.But it is waiting for another Read lock 0x00007f90530ed8d8
> {code}
> [01:08:01] "ActiveMQ Transport: tcp:///10.40.50.17:4484@61616" daemon prio=10 tid=0x00007f8fa8039800 nid=0x2ed2 waiting on condition [0x00007f8f1f2e0000]
> [01:08:01]    java.lang.Thread.State: WAITING (parking)
> [01:08:01]         at sun.misc.Unsafe.park(Native Method)
> [01:08:01]         - parking to wait for  <0x00007f90530ed8d8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> [01:08:01]         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
> [01:08:01]         at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
> [01:08:01]         at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:941)
> [01:08:01]         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1261)
> [01:08:01]         at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:594)
> [01:08:01]         at org.apache.activemq.store.kahadb.MessageDatabase.store(MessageDatabase.java:972)
> [01:08:01]         at org.apache.activemq.store.kahadb.MessageDatabase.store(MessageDatabase.java:958)
> [01:08:01]         at org.apache.activemq.store.kahadb.KahaDBTransactionStore.rollback(KahaDBTransactionStore.java:312)
> [01:08:01]         at org.apache.activemq.transaction.LocalTransaction.rollback(LocalTransaction.java:94)
> [01:08:01]         - locked <0x00007f90500034e0> (a org.apache.activemq.store.kahadb.KahaDBTransactionStore)
> [01:08:01]         at org.apache.activemq.transaction.LocalTransaction.commit(LocalTransaction.java:74)
> [01:08:01]         at org.apache.activemq.broker.TransactionBroker.commitTransaction(TransactionBroker.java:253)
> [01:08:01]         at org.apache.activemq.broker.MutableBrokerFilter.commitTransaction(MutableBrokerFilter.java:112)
> [01:08:01]         at org.apache.activemq.broker.TransportConnection.processCommitTransactionOnePhase(TransportConnection.java:453)
> [01:08:01]         at org.apache.activemq.command.TransactionInfo.visit(TransactionInfo.java:100)
> [01:08:01]         at org.apache.activemq.broker.TransportConnection.service(TransportConnection.java:294)
> [01:08:01]         at org.apache.activemq.broker.TransportConnection$1.onCommand(TransportConnection.java:148)
> [01:08:01]         at org.apache.activemq.transport.MutexTransport.onCommand(MutexTransport.java:50)
> [01:08:01]         at org.apache.activemq.transport.WireFormatNegotiator.onCommand(WireFormatNegotiator.java:113)
> [01:08:01]         at org.apache.activemq.transport.AbstractInactivityMonitor.onCommand(AbstractInactivityMonitor.java:270)
> [01:08:01]         at org.apache.activemq.transport.TransportSupport.doConsume(TransportSupport.java:83)
> [01:08:01]         at org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:214)
> [01:08:01]         at org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:196)
> [01:08:01]         at java.lang.Thread.run(Thread.java:662)
> {code}
> At last we find 0x00007f90530ed8d8 is in many write lock waiting.But none write lock hodes it. The dead lock occurs.
> {code}
> [01:07:56] "ActiveMQ Transport: tcp:///10.40.50.47:60701@61616" daemon prio=10 tid=0x00007f8f7cbc5000 nid=0x3ec8 waiting on condition [0x00007f900d864000]
> [01:07:56]    java.lang.Thread.State: WAITING (parking)
> [01:07:56]         at sun.misc.Unsafe.park(Native Method)
> [01:07:56]         - parking to wait for  <0x00007f90530ed8d8> (a java.util.concurrent.locks.ReentrantReadWriteLock$NonfairSync)
> [01:07:56]         at java.util.concurrent.locks.LockSupport.park(LockSupport.java:156)
> [01:07:56]         at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:811)
> [01:07:56]         at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireShared(AbstractQueuedSynchronizer.java:941)
> [01:07:56]         at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireShared(AbstractQueuedSynchronizer.java:1261)
> [01:07:56]         at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.lock(ReentrantReadWriteLock.java:594)
> [01:07:56]         at org.apache.activemq.store.kahadb.MessageDatabase.store(MessageDatabase.java:972)
> [01:07:56]         at org.apache.activemq.store.kahadb.MessageDatabase.store(MessageDatabase.java:958)
> [01:07:56]         at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.removeMessage(KahaDBStore.java:453)
> [01:07:56]         at org.apache.activemq.store.kahadb.KahaDBStore$KahaDBMessageStore.removeAsyncMessage(KahaDBStore.java:407)
> [01:07:56]         at org.apache.activemq.store.kahadb.KahaDBTransactionStore.removeAsyncMessage(KahaDBTransactionStore.java:498)
> [01:07:56]         at org.apache.activemq.store.kahadb.KahaDBTransactionStore$1.removeAsyncMessage(KahaDBTransactionStore.java:185)
> [01:07:56]         at org.apache.activemq.broker.region.Queue.acknowledge(Queue.java:998)
> [01:07:56]         at org.apache.activemq.broker.region.Queue.removeMessage(Queue.java:1781)
> [01:07:56]         at org.apache.activemq.broker.region.QueueSubscription.acknowledge(QueueSubscription.java:63)
> [01:07:56]         at org.apache.activemq.broker.region.PrefetchSubscription.acknowledge(PrefetchSubscription.java:235)
> [01:07:56]         - locked <0x00007f905d10b1e8> (a java.lang.Object)
> [01:07:56]         at org.apache.activemq.broker.region.AbstractRegion.acknowledge(AbstractRegion.java:446)
> [01:07:56]         at org.apache.activemq.broker.region.RegionBroker.acknowledge(RegionBroker.java:461)
> [01:07:56]         at org.apache.activemq.broker.BrokerFilter.acknowledge(BrokerFilter.java:82)
> [01:07:56]         at org.apache.activemq.broker.TransactionBroker.acknowledge(TransactionBroker.java:277)
> [01:07:56]         at org.apache.activemq.broker.MutableBrokerFilter.acknowledge(MutableBrokerFilter.java:92)
> [01:07:56]         at org.apache.activemq.broker.TransportConnection.processMessageAck(TransportConnection.java:505)
> [01:07:56]         at org.apache.activemq.command.MessageAck.visit(MessageAck.java:245)
> [01:07:56]         at org.apache.activemq.broker.TransportConnection.service(TransportConnection.java:294)
> [01:07:56]         at org.apache.activemq.broker.TransportConnection$1.onCommand(TransportConnection.java:148)
> [01:07:56]         at org.apache.activemq.transport.MutexTransport.onCommand(MutexTransport.java:50)
> [01:07:56]         at org.apache.activemq.transport.WireFormatNegotiator.onCommand(WireFormatNegotiator.java:113)
> [01:07:56]         at org.apache.activemq.transport.AbstractInactivityMonitor.onCommand(AbstractInactivityMonitor.java:270)
> [01:07:56]         at org.apache.activemq.transport.TransportSupport.doConsume(TransportSupport.java:83)
> [01:07:56]         at org.apache.activemq.transport.tcp.TcpTransport.doRun(TcpTransport.java:214)
> [01:07:56]         at org.apache.activemq.transport.tcp.TcpTransport.run(TcpTransport.java:196)
> [01:07:56]         at java.lang.Thread.run(Thread.java:662)
> {code}
> In my opinion, IOExceptions may lead to losing persisten messages and so on ,but the broker shouldn't stop service.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)