You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Sebastian Lövdahl (Jira)" <ji...@apache.org> on 2020/08/28 06:54:00 UTC
[jira] [Commented] (ARTEMIS-2690) Intermittent network failure caused live and replica to both be live

    [ https://issues.apache.org/jira/browse/ARTEMIS-2690?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17186313#comment-17186313 ] 

Sebastian Lövdahl commented on ARTEMIS-2690:
--------------------------------------------

Unfortunately, it looks like explicitly setting `quorum-size` didn't help, we have still twice seen the same behaviour where both the live and replica ended up being live at the same time. One interesting thing though is that stopping and starting the node that is supposed to be the replica (the one that erroneously became live) does NOT solve the problem. It starts in live mode again, so it seems that somehow it doesn't notice that the actual live node isn't running. Does anyone have any ideas? I'm starting to feel kind of lost here.

> Intermittent network failure caused live and replica to both be live
> --------------------------------------------------------------------
>
>                 Key: ARTEMIS-2690
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-2690
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 2.11.0
>         Environment: Artemis 2.11.0, Ubuntu 18.04
>            Reporter: Sebastian Lövdahl
>            Priority: Major
>         Attachments: live1-artemis.log, live1-broker.xml, live2-artemis.log, live2-broker.xml, live3-artemis.log, live3-broker.xml, replica1-artemis.log, replica1-broker.xml
>
>
> An intermittent network failure caused both the live and replica to be live. Both happily accepted incoming connections until the node that was supposed to be the replica was manually shut down. Log files from all 4 nodes are attached. The {{replica1}} node happened to have some TRACE logging enabled as well.
>  
> As far as I have understood the documentation, the setup should be safe from a split brain point of view. The live2 and live3 nodes intentionally don't have any replicas at the moment. Complete {{broker.xml}} files are attached, but for reference, this is the {{ha-policy}}:
> live1:
> {code:xml}
> <ha-policy>
>   <replication>
>     <master>
>       <cluster-name>my-cluster</cluster-name>
>       <group-n ame>group1</group-name>
>       <check-for-live-server>true</check-for-live-server>
>       <vote-on-replication-failure>true</vote-on-replication-failure>
>     </master>
>   </replication>
> </ha-policy>
> {code}
> replica1:
> {code:xml}
> <ha-policy>
>   <replication>
>     <slave>
>        <cluster-name>my-cluster</cluster-name>
>        <group-name>group1</group-name>
>        <allow-failback>true</allow-failback>
>        <vote-on-replication-failure>true</vote-on-replication-failure>
>     </slave>
>   </replication>
> </ha-policy>
> {code}
> live2:
> {code:xml}
> <ha-policy>
>   <replication>
>     <master>
>        <cluster-name>my-cluster</cluster-name>
>        <group-name>group2</group-name>
>        <check-for-live-server>true</check-for-live-server>
>        <vote-on-replication-failure>true</vote-on-replication-failure>
>     </master>
>   </replication>
> </ha-policy>
> {code}
> live3:
> {code:xml}
> <ha-policy>
>   <replication>
>     <master>
>        <cluster-name>my-cluster</cluster-name>
>        <group-name>group2</group-name>
>        <check-for-live-server>true</check-for-live-server>
>        <vote-on-replication-failure>true</vote-on-replication-failure>
>     </master>
>   </replication>
> </ha-policy>
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)