You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Justin Bertram (Jira)" <ji...@apache.org> on 2020/08/10 20:02:00 UTC
[jira] [Commented] (ARTEMIS-2808) Artemis HA with shared storage
strategy does not reconnect with shared storage if reconnection happens at
shared storage
[ https://issues.apache.org/jira/browse/ARTEMIS-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175032#comment-17175032 ]
Justin Bertram commented on ARTEMIS-2808:
-----------------------------------------
Thanks for the logs and thread dumps. From what I can tell the broker is behaving reasonably, although the outcome clearly isn't what's expected or desired.
In scenario #1 the backup broker is happily waiting to get the lock on the journal. This thread is in the thread dumps before and after NFS goes down:
{noformat}
"AMQ229000: Activation for server ActiveMQServerImpl::serverUUID=5451042e-b0c6-11ea-80b3-005056979868" #32 prio=5 os_prio=0 tid=0x000000001a439800 nid=0x9ac waiting on condition [0x000000001ca1f000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
at java.lang.Thread.sleep(Native Method)
at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:403)
at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:188)
at org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77)
at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:3907)
{noformat}
And the log contains entries like this once NFS goes down:
{noformat}
2020-06-23 08:29:02,362 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Failure when accessing a lock file: java.io.IOException: An unexpected network error occurred
at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) [rt.jar:1.8.0_251]
at sun.nio.ch.FileDispatcherImpl.lock(Unknown Source) [rt.jar:1.8.0_251]
at sun.nio.ch.FileChannelImpl.tryLock(Unknown Source) [rt.jar:1.8.0_251]
at java.nio.channels.FileChannel.tryLock(Unknown Source) [rt.jar:1.8.0_251]
at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.tryLock(FileLockNodeManager.java:351) [artemis-server-2.13.0.jar:2.13.0]
at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:370) [artemis-server-2.13.0.jar:2.13.0]
at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:188) [artemis-server-2.13.0.jar:2.13.0]
at org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77) [artemis-server-2.13.0.jar:2.13.0]
at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:3907) [artemis-server-2.13.0.jar:2.13.0]
{noformat}
Unfortunately it appears that even after NFS is restored the broker still encounters those exceptions which indicates that something is going wrong at the JVM/OS level. It looks like once NFS is restored the file handle for the lock is stale. It may be necessary for the broker to re-create the file-handle from scratch.
That same basic thing appears to be happening in scenario #2 as well.
> Artemis HA with shared storage strategy does not reconnect with shared storage if reconnection happens at shared storage
> ------------------------------------------------------------------------------------------------------------------------
>
> Key: ARTEMIS-2808
> URL: https://issues.apache.org/jira/browse/ARTEMIS-2808
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Affects Versions: 2.11.0
> Environment: Windows 10
> Reporter: Karan Aggarwal
> Priority: Blocker
> Attachments: Scenario_1.zip, Scenario_2.zip
>
>
> We verified the behavior of Artemis HA by bringing down the shared storage (VM) while run is in progress and here is the observation:
> *Scenario:*
> * When Artemis services are up and running and run is in progress we restarted the machine hosting the shared storage
> * Shared storage was back up in 5 mins
> * Both Artemis master and slave did not connect back to the shared storage
> * We tried stopping the Artemis brokers. The slave stopped, but the master did not stop. We had to kill the process.
> * We tried to start the Artemis brokers. The master did not start up at all. The slave started successfully.
> * We restarted the master Artemis server. Server started successfully and acquired back up.
> Shared Storage type: NFS
> Impact: The run is stopped and Artemis servers needs to be started again every time shared storage connection goes down momentarily.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)