You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Justin Bertram (Jira)" <ji...@apache.org> on 2020/08/10 20:02:00 UTC

[jira] [Commented] (ARTEMIS-2808) Artemis HA with shared storage strategy does not reconnect with shared storage if reconnection happens at shared storage

    [ https://issues.apache.org/jira/browse/ARTEMIS-2808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17175032#comment-17175032 ] 

Justin Bertram commented on ARTEMIS-2808:
-----------------------------------------

Thanks for the logs and thread dumps. From what I can tell the broker is behaving reasonably, although the outcome clearly isn't what's expected or desired.

In scenario #1 the backup broker is happily waiting to get the lock on the journal. This thread is in the thread dumps before and after NFS goes down:
{noformat}
"AMQ229000: Activation for server ActiveMQServerImpl::serverUUID=5451042e-b0c6-11ea-80b3-005056979868" #32 prio=5 os_prio=0 tid=0x000000001a439800 nid=0x9ac waiting on condition [0x000000001ca1f000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
	at java.lang.Thread.sleep(Native Method)
	at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:403)
	at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:188)
	at org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77)
	at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:3907)
{noformat}
And the log contains entries like this once NFS goes down:
{noformat}
2020-06-23 08:29:02,362 DEBUG [org.apache.activemq.artemis.core.server.impl.FileLockNodeManager] Failure when accessing a lock file: java.io.IOException: An unexpected network error occurred
	at sun.nio.ch.FileDispatcherImpl.lock0(Native Method) [rt.jar:1.8.0_251]
	at sun.nio.ch.FileDispatcherImpl.lock(Unknown Source) [rt.jar:1.8.0_251]
	at sun.nio.ch.FileChannelImpl.tryLock(Unknown Source) [rt.jar:1.8.0_251]
	at java.nio.channels.FileChannel.tryLock(Unknown Source) [rt.jar:1.8.0_251]
	at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.tryLock(FileLockNodeManager.java:351) [artemis-server-2.13.0.jar:2.13.0]
	at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.lock(FileLockNodeManager.java:370) [artemis-server-2.13.0.jar:2.13.0]
	at org.apache.activemq.artemis.core.server.impl.FileLockNodeManager.awaitLiveNode(FileLockNodeManager.java:188) [artemis-server-2.13.0.jar:2.13.0]
	at org.apache.activemq.artemis.core.server.impl.SharedStoreBackupActivation.run(SharedStoreBackupActivation.java:77) [artemis-server-2.13.0.jar:2.13.0]
	at org.apache.activemq.artemis.core.server.impl.ActiveMQServerImpl$ActivationThread.run(ActiveMQServerImpl.java:3907) [artemis-server-2.13.0.jar:2.13.0]
{noformat}
Unfortunately it appears that even after NFS is restored the broker still encounters those exceptions which indicates that something is going wrong at the JVM/OS level. It looks like once NFS is restored the file handle for the lock is stale. It may be necessary for the broker to re-create the file-handle from scratch.

That same basic thing appears to be happening in scenario #2 as well.


> Artemis HA with shared storage strategy does not reconnect with shared storage if reconnection happens at shared storage
> ------------------------------------------------------------------------------------------------------------------------
>
>                 Key: ARTEMIS-2808
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2808
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.11.0
>         Environment: Windows 10
>            Reporter: Karan Aggarwal
>            Priority: Blocker
>         Attachments: Scenario_1.zip, Scenario_2.zip
>
>
> We verified the behavior of Artemis HA by bringing down the shared storage (VM) while run is in progress and here is the observation: 
> *Scenario:*
>  * When Artemis services are up and running and run is in progress we restarted the machine hosting the shared storage
>  * Shared storage was back up in 5 mins
>  * Both Artemis master and slave did not connect back to the shared storage
>  * We tried stopping the Artemis brokers. The slave stopped, but the master did not stop. We had to kill the process.
>  * We tried to start the Artemis brokers. The master did not start up at all. The slave started successfully.
>  * We restarted the master Artemis server. Server started successfully and acquired back up.
> Shared Storage type: NFS
> Impact: The run is stopped and Artemis servers needs to be started again every time shared storage connection goes down momentarily.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)