You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Volker Kleinschmidt (JIRA)" <ji...@apache.org> on 2015/10/13 06:16:05 UTC
[jira] [Comment Edited] (AMQ-6005) Slave broker startup corrupts shared PList storage

    [ https://issues.apache.org/jira/browse/AMQ-6005?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14954334#comment-14954334 ] 

Volker Kleinschmidt edited comment on AMQ-6005 at 10/13/15 4:15 AM:
--------------------------------------------------------------------

Not yet - we have an AMQ 5.12 upgrade in the pipeline, but not in production use yet, and this problem is only reproduced in production use, since you need to create quite a few asynchronous topic messages to even get to use tmp_storage - while there's few messages they all stay in memory and this issue doesn't arise. Plus you need a multi-server environment. So this isn't easily reproduced in the lab.

However the relevant code has not changed in 5.12, I've verified that. It's all in one class and should be easy enough to follow the outline in the ticket description. I've added some additional detail to make it easier to follow.


was (Author: volkerk):
Not yet - we have an AMQ 5.12 upgrade in the pipeline, but not in production use yet, and this problem is only reproduced in production use, since you need to create quite a few asynchronous topic messages to even get to use tmp_storage - while there's few messages they all stay in memory and this issue doesn't arise. Plus you need a multi-server environment. So this isn't easily reproduced in the lab.

However the relevant code has not changed in 5.12, I've verified that. It's all in one class and should be easy enough to follow the outline in the ticket description.

> Slave broker startup corrupts shared PList storage
> --------------------------------------------------
>
>                 Key: AMQ-6005
>                 URL: https://issues.apache.org/jira/browse/AMQ-6005
>             Project: ActiveMQ
>          Issue Type: Bug
>          Components: KahaDB
>    Affects Versions: 5.7.0, 5.10.0
>         Environment: RHLinux6
>            Reporter: Volker Kleinschmidt
>
> h4. Background
> When multiple JVMs run AMQ in a master/slave configuration with the broker directory in a shared filesystem location (as is required e.g. for kahaPersistence), and when due to high message volume or slow producers the broker's memory needs exceed the configured memory usage limit, AMQ will overflow asynchronous messages to a PList store inside the "tmp_storage" subdirectory of said shared broker directory.
> h4. Issue
> We frequently observed this tmpDB store getting corrupted with "stale NFS filehandle" errors for tmpDB.data, tmpDB.redo, and some journal files, all of which suddenly went missing from the tmp_storage folder. This puts the entire broker into a bad state from which it cannot recover. Only restarting the service (which causes a broker slave to take over and loses the yet-undelivered messages) gets a working state back.
> h4. Symptoms
> Stack trace:
> {noformat}
> ...
> Caused by: java.io.IOException: Stale file handle
> 	at java.io.RandomAccessFile.readBytes0(Native Method)
> 	at java.io.RandomAccessFile.readBytes(RandomAccessFile.java:350)
> 	at java.io.RandomAccessFile.read(RandomAccessFile.java:385)
> 	at java.io.RandomAccessFile.readFully(RandomAccessFile.java:444)
> 	at java.io.RandomAccessFile.readFully(RandomAccessFile.java:424)
> 	at org.apache.kahadb.page.PageFile.readPage(PageFile.java:876)
> 	at org.apache.kahadb.page.Transaction$2.readPage(Transaction.java:446)
> 	at org.apache.kahadb.page.Transaction$2.<init>(Transaction.java:437)
> 	at org.apache.kahadb.page.Transaction.openInputStream(Transaction.java:434)
> 	at org.apache.kahadb.page.Transaction.load(Transaction.java:410)
> 	at org.apache.kahadb.page.Transaction.load(Transaction.java:367)
> 	at org.apache.kahadb.index.ListIndex.loadNode(ListIndex.java:306)
> 	at org.apache.kahadb.index.ListIndex.getHead(ListIndex.java:99)
> 	at org.apache.kahadb.index.ListIndex.iterator(ListIndex.java:284)
> 	at org.apache.activemq.store.kahadb.plist.PList$PListIterator.<init>(PList.java:199)
> 	at org.apache.activemq.store.kahadb.plist.PList.iterator(PList.java:189)
> 	at org.apache.activemq.broker.region.cursors.FilePendingMessageCursor$DiskIterator.<init>(FilePendingMessageCursor.java:496)
> {noformat}
> h4. Cause
> During BrokerThread startup, the BrokerService.startPersistenceAdapter() method is called, which  via doStartPersistenceAdapter() and getProducerSystemUsage() invokes getSystemUsage(), that calls getTempDataStore(), and that method summarily cleans out the existing contents of the tmp_storage directory.
> All of this happens *before* the broker lock is obtained in the PersistenceAdapter.start() method at the end of doStartPersistenceAdapter().
> So a JVM that doesn't get to be the broker (because there already is one) and runs in slave mode (waiting to obtain the broker lock) interferes with and corrupts the running broker's tmp_storage and thus breaks the broker. That's a critical bug. The slave has no business starting up the persistence adapter and cleaning out data as it hasn't gotten the lock yet, so isn't allowed to do any work, period. 
> h4. Workaround
> As workaround, an unshared local directory needs to be specified as tempDirectory for the broker, even if the main broker directory is shared. Also, since broker startup will clear the tmp_storage out anyway, there really is no advantage to having this in a shared location - since the next broker that starts up after a broker failure will never re-use that data anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)