You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by "Lorenz Quack (JIRA)" <ji...@apache.org> on 2017/02/28 15:45:45 UTC

[jira] [Comment Edited] (QPID-7685) [Java Broker, BDB] AsyncRecovery and Queue#enqueue can contend for a BDB Lock potentially bringing down the broker

    [ https://issues.apache.org/jira/browse/QPID-7685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15888198#comment-15888198 ] 

Lorenz Quack edited comment on QPID-7685 at 2/28/17 3:44 PM:
-------------------------------------------------------------

A workaround is increasing the {{je.lock.timeout}}.
However, one could argue that the root cause is that the Queues are sharing the DB.
Another potential solution could be to use {{READ_UNCOMMITTED}} in the Recoverer. We would have to check that there are no unintended consequences from such a change.


was (Author: lorenz.quack):
A workaround is increasing the {{je.lock.timeout}}.
However, one could argue that the root cause is that the Queues are sharing the DB.

> [Java Broker, BDB] AsyncRecovery and Queue#enqueue can contend for a BDB Lock potentially bringing down the broker
> ------------------------------------------------------------------------------------------------------------------
>
>                 Key: QPID-7685
>                 URL: https://issues.apache.org/jira/browse/QPID-7685
>             Project: Qpid
>          Issue Type: Bug
>    Affects Versions: 0.32, qpid-java-6.0.6, qpid-java-6.1.1, qpid-java-7.0
>            Reporter: Lorenz Quack
>            Priority: Critical
>
> This is the scenario:
> BDB message store.
> AsyncRecovery completes recovery of queue Q1. Queue Q2 is still being recovered.
> An transacted application sends a message to Q1 but does not commit the transaction. This ends up creating an entry in the {{DeliveryDb}} through the following call chain:
> > AMQChannel#deliverCurrentMessageIfComplete
> --> AbstractExchange#send
> ----> LocalTransaction#enqueue(List<? extends BaseQueue>, EnqueueableMessage, ServerTransaction.Action)
> ------> AbstractBDBMessageStore.BDBTransaction#enqueueMessage
> --------> AbstractBDBMessageStore#enqueueMessage
> ----------> getDeliveryDb().put
> The transaction will hold a write lock on this new DB entry until the transaction is committed.
> If it now happens that this entry appears immediately after the last entry of Q2 in the database the recovery code will try to read that entry in {{AbstractBDBMessageStore#visitMessageInstances(TransactionLogResource, MessageInstanceHandler)}}:
> {{cursor.getNext(key, value, LockMode.DEFAULT)}}
> Acquisition of the read lock will timeout after {{je.lock.timeout}} (default: 500ms) and after exhaustion of the retries (hard coded to 5) a {{StoreException}} will be thrown bringing down the broker with the following stack trace:
> {noformat}
> ########################################################################
> #
> # Unhandled Exception com.google.common.util.concurrent.UncheckedExecutionException: org.apache.qpid.server.store.StoreException: Cannot visit messages in Thread pool-2-thread-1
> #
> # Exiting
> #
> ########################################################################
> com.google.common.util.concurrent.UncheckedExecutionException: org.apache.qpid.server.store.StoreException: Cannot visit messages
> 	at com.google.common.util.concurrent.Futures.wrapAndThrowUnchecked(Futures.java:1358)
> 	at com.google.common.util.concurrent.Futures.getUnchecked(Futures.java:1344)
> 	at org.apache.qpid.server.virtualhost.AbstractVirtualHost$4.run(AbstractVirtualHost.java:1619)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:178)
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:292)
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 	at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.qpid.server.store.StoreException: Cannot visit messages
> 	at org.apache.qpid.server.store.berkeleydb.StandardEnvironmentFacade.handleDatabaseException(StandardEnvironmentFacade.java:369)
> 	at org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.sleepOrThrowOnLockConflict(AbstractBDBMessageStore.java:665)
> 	at org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.visitMessageInstances(AbstractBDBMessageStore.java:233)
> 	at org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer.recoverQueue(AsynchronousMessageStoreRecoverer.java:163)
> 	at org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer.access$900(AsynchronousMessageStoreRecoverer.java:86)
> 	at org.apache.qpid.server.virtualhost.AsynchronousMessageStoreRecoverer$AsynchronousRecoverer$QueueRecoveringTask.run(AsynchronousMessageStoreRecoverer.java:436)
> 	at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
> 	at java.util.concurrent.FutureTask.run(FutureTask.java:262)
> 	... 3 more
> Caused by: com.sleepycat.je.LockTimeoutException: (JE 5.0.104) Lock expired. Locker 2109822146 -1_Queue Recoverer : Q2 (vh: default)_ThreadLocker: waited for lock on database=QUEUE_ENTRIES LockAddr:530828558 LSN=0x0/0xa45c type=READ grant=WAIT_NEW timeoutMillis=500 startTime=1488294679798 endTime=1488294680298
> Owners: [<LockInfo locker="879351915 243_IoRcvr-/127.0.0.1:5672-/127.0.0.1:51619_Txn" type="WRITE"/>]
> Waiters: []
> 	at com.sleepycat.je.txn.LockManager.newLockTimeoutException(LockManager.java:665)
> 	at com.sleepycat.je.txn.LockManager.makeTimeoutMsgInternal(LockManager.java:623)
> 	at com.sleepycat.je.txn.SyncedLockManager.makeTimeoutMsg(SyncedLockManager.java:97)
> 	at com.sleepycat.je.txn.LockManager.lockInternal(LockManager.java:390)
> 	at com.sleepycat.je.txn.LockManager.lock(LockManager.java:276)
> 	at com.sleepycat.je.txn.BasicLocker.lockInternal(BasicLocker.java:118)
> 	at com.sleepycat.je.txn.Locker.lock(Locker.java:443)
> 	at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:2642)
> 	at com.sleepycat.je.dbi.CursorImpl.lockLN(CursorImpl.java:2443)
> 	at com.sleepycat.je.dbi.CursorImpl.fetchCurrent(CursorImpl.java:2272)
> 	at com.sleepycat.je.dbi.CursorImpl.getCurrentAlreadyLatched(CursorImpl.java:1480)
> 	at com.sleepycat.je.dbi.CursorImpl.getNext(CursorImpl.java:1607)
> 	at com.sleepycat.je.Cursor.searchInternal(Cursor.java:2898)
> 	at com.sleepycat.je.Cursor.searchAllowPhantoms(Cursor.java:2732)
> 	at com.sleepycat.je.Cursor.searchNoDups(Cursor.java:2586)
> 	at com.sleepycat.je.Cursor.search(Cursor.java:2553)
> 	at com.sleepycat.je.Cursor.getSearchKeyRange(Cursor.java:1757)
> 	at org.apache.qpid.server.store.berkeleydb.AbstractBDBMessageStore.visitMessageInstances(AbstractBDBMessageStore.java:203)
> 	... 8 more
> {noformat}
> I believe the call to {{cursor.getSearchKeyRange(key,value, LockMode.DEFAULT)}} has the same problem just under different circumstances.
> I have not checked whether other similar defects exist.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@qpid.apache.org
For additional commands, e-mail: dev-help@qpid.apache.org