You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@activemq.apache.org by JasonHs <hs...@gmail.com> on 2016/11/24 04:45:57 UTC

Slow failover from primary to backup server

Hi all,

I'm running a 2 node Artemis cluster in replication mode. Everything is
running as it should, but we noticed a variance in the time it takes from
the 'backup' server to become 'live' when the primary server goes down. The
failover time can take a short as 10~20 seconds to upto a few minutes.

I tried changing quite a few cluster-connection settings according to the
documentation, but had no luck so far.

Also, is it recommended to run a 3 node cluster to avoid a split brain
scenario where both nodes in a 2-node cluster think they should be 'live' in
a temporary network outage scenario.

Thanks in advance,
Jason



--
View this message in context: http://activemq.2283324.n4.nabble.com/Slow-failover-from-primary-to-backup-server-tp4719474.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Re: Slow failover from primary to backup server

Posted by boris_snp <bo...@spglobal.com>.

i just posted a scenario to the group, think it is related to 2 brokers being
"live"
http://activemq.2283324.n4.nabble.com/2-broker-clusetr-both-brokers-are-live-td4730975.html;cid=1506091656365-514



--
Sent from: http://activemq.2283324.n4.nabble.com/ActiveMQ-User-f2341805.html

Re: Slow failover from primary to backup server

Posted by andytaylor <an...@gmail.com>.

Ive raised a PR with some fixes in, see
https://github.com/apache/activemq-artemis/pull/901. This should make the
failover quicker in some scenarios. You can also make some configuration
changes like so:

Configure the cluster connection to time out quickly

            <check-period>2500</check-period>
            <connection-ttl>5000</connection-ttl>

And also configure the connectors used  in the cluster to have a quicker
connect timeout,

<connector
name="artemis">tcp://backup1:61616?connect-timeout-millis=2000</connector>

I'm stillinvestigating making more improvements around this area, including.

1. Allowing the qourum size in different ways, statically configured, the
max cluster size and the current cluster size.
2. Allow the master to vote for a quorum if replication fails and shutdown
if it cant get a majority (configurable)

If anyone else has any scenarios I would be happy to try and include them
as part of this work.

Andy

On 25 November 2016 at 21:59, JasonHs [via ActiveMQ] <
ml-node+s2283324n4719501h11@n4.nabble.com> wrote:

> Thanks for confirming this Andy, look forward to the new release/update.
>
> Jason
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://activemq.2283324.n4.nabble.com/Slow-failover-from-
> primary-to-backup-server-tp4719474p4719501.html
> To start a new topic under ActiveMQ - User, email
> ml-node+s2283324n2341805h35@n4.nabble.com
> To unsubscribe from ActiveMQ - User, click here
> <http://activemq.2283324.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=2341805&code=YW5keS50YXlsczY3QGdtYWlsLmNvbXwyMzQxODA1fC05MDE1NDk1MzM=>
> .
> NAML
> <http://activemq.2283324.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>

--
View this message in context: http://activemq.2283324.n4.nabble.com/Slow-failover-from-primary-to-backup-server-tp4719474p4719537.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Re: Slow failover from primary to backup server

Posted by JasonHs <hs...@gmail.com>.

Thanks for confirming this Andy, look forward to the new release/update.

Jason



--
View this message in context: http://activemq.2283324.n4.nabble.com/Slow-failover-from-primary-to-backup-server-tp4719474p4719501.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Re: Slow failover from primary to backup server

Posted by andytaylor <an...@gmail.com>.

coincidentally I have been looking at replication and noticed some issue
like you have mentioned. Ive raised a Jira and will be looking to improve
this over the next couple of weeks,
https://issues.apache.org/jira/browse/ARTEMIS-866. Feel free to add any
scenarios that you are having difficulties and I willmake sure they are
addresses.

Andy

On 24 November 2016 at 23:00, JasonHs [via ActiveMQ] <
ml-node+s2283324n4719485h15@n4.nabble.com> wrote:

> Thanks Tim.
>
> I have since adjusted the connection TTL time cluster setting, and the
> failover seems to be faster now.
>
> However, as part of the failover testing, I'm still seeing at split brain
> scenario, where I manually block the 61616 port on the master server, and
> wait until backup becomes 'live', and then I unblock the port on the master
> server, and now I have 2 'live' servers.
>
> I have tested this with 3 servers, and still the same result.
>
> Cheers
> Jason
>
> ------------------------------
> If you reply to this email, your message will be added to the discussion
> below:
> http://activemq.2283324.n4.nabble.com/Slow-failover-from-
> primary-to-backup-server-tp4719474p4719485.html
> To start a new topic under ActiveMQ - User, email
> ml-node+s2283324n2341805h35@n4.nabble.com
> To unsubscribe from ActiveMQ - User, click here
> <http://activemq.2283324.n4.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=2341805&code=YW5keS50YXlsczY3QGdtYWlsLmNvbXwyMzQxODA1fC05MDE1NDk1MzM=>
> .
> NAML
> <http://activemq.2283324.n4.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>
>




--
View this message in context: http://activemq.2283324.n4.nabble.com/Slow-failover-from-primary-to-backup-server-tp4719474p4719494.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Re: Slow failover from primary to backup server

Posted by JasonHs <hs...@gmail.com>.

Thanks Tim. 

I have since adjusted the connection TTL time cluster setting, and the
failover seems to be faster now.

However, as part of the failover testing, I'm still seeing at split brain
scenario, where I manually block the 61616 port on the master server, and
wait until backup becomes 'live', and then I unblock the port on the master
server, and now I have 2 'live' servers.

I have tested this with 3 servers, and still the same result.

Cheers
Jason



--
View this message in context: http://activemq.2283324.n4.nabble.com/Slow-failover-from-primary-to-backup-server-tp4719474p4719485.html
Sent from the ActiveMQ - User mailing list archive at Nabble.com.

Re: Slow failover from primary to backup server

Posted by Tim Bain <tb...@alumni.duke.edu>.

Is the amount of time different when the server goes down due to a graceful
shutdown vs. a hard kill (kill -9 or equivalent)?  Could what you're seeing
be related to the amount of time it takes to detect that a TCP connection
has been severed without a clean shutdown?

Tim

On Nov 23, 2016 9:58 PM, "JasonHs" <hs...@gmail.com> wrote:

> Hi all,
>
> I'm running a 2 node Artemis cluster in replication mode. Everything is
> running as it should, but we noticed a variance in the time it takes from
> the 'backup' server to become 'live' when the primary server goes down. The
> failover time can take a short as 10~20 seconds to upto a few minutes.
>
> I tried changing quite a few cluster-connection settings according to the
> documentation, but had no luck so far.
>
> Also, is it recommended to run a 3 node cluster to avoid a split brain
> scenario where both nodes in a 2-node cluster think they should be 'live'
> in
> a temporary network outage scenario.
>
> Thanks in advance,
> Jason
>
>
>
> --
> View this message in context: http://activemq.2283324.n4.
> nabble.com/Slow-failover-from-primary-to-backup-server-tp4719474.html
> Sent from the ActiveMQ - User mailing list archive at Nabble.com.
>