You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Johannes F. Knauf (JIRA)" <ji...@apache.org> on 2017/01/16 14:03:26 UTC

[jira] [Created] (AMQ-6564) HA: Slow Failover with AMQ + mKahaDb in Master/Slave setup with shared filesystem

Johannes F. Knauf created AMQ-6564:
--------------------------------------

             Summary: HA: Slow Failover with AMQ + mKahaDb in Master/Slave setup with shared filesystem
                 Key: AMQ-6564
                 URL: https://issues.apache.org/jira/browse/AMQ-6564
             Project: ActiveMQ
          Issue Type: Bug
          Components: KahaDB
    Affects Versions: 5.14.3
            Reporter: Johannes F. Knauf


Consider the following scenario:
* AMQ Host A and Host B are configured exactly the same
* Host A and Host B share a common filesystem storage for their (m)kahadb in order to create HA as described in http://activemq.apache.org/shared-file-system-master-slave.html 
* high-traffic scenario, where at each point in time quite some amount of messages is still in each queue

Expected:
Given Host A is current master and Host B is polling for the lock every 10 seconds (default),
when Host A is going down,
then Host B should be able to serve producer enqueue requests after 10 seconds + some microseconds at max.

Reality:
Host B needs to replay the whole journals before being available to accept new messages again. This can take a long time, especially if consistency checks are required. This means Master/Slave with shared FS is not really providing HA.

It is perfectly understandable, that for consumers the failover takes that long. They can only continue receiving messages, when all journals have been read. Otherwise order of messages would be destroyed.

For producers this is not the case, as AMQ could just create a fresh journal file and start appending immediately. Am I wrong?

Also it seems, that each kahaDB in an mKahaDB ist checked in sequence, so that in worst case even less filled queues are not available before everything is checked completely.

Long unavailability for producers is unacceptable in most scenarios. It means that all producing clients have to take a serious amount of effort to protect against these scenarios in order not to lose messages (buffering, etc.). Or is there a best practise workaround?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)