You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@activemq.apache.org by Anindya Haldar <an...@oracle.com> on 2018/06/14 00:52:17 UTC

Questions on HA cluster and split brain

I have some questions related to the HA cluster, failover and split-brain cases.

Suppose I have set up a 3 node cluster with:

A = master
B = slave 1
C = slave 2

Also suppose they are all part of same group, and are set up to offer replication based HA.

Scenario 1
========
Say,

B starts up and finds A
B becomes the designated backup for A
C starts up, and tries to find a live server in this group
C figures that A already has a designated backup, which is B
C keeps waiting until the network topology is changed


Q1: At this point, will the transaction logs replicate from A to C?

Now let’s say

Node A (the current master) fails
B becomes the new master

Q2: At this point will C become to new new back up for B, assuming A remains in failed state?

Q3: If the answer to Q2 is yes, B will start replicating its journals to C; is that correct?


Scenario 2 (split brain detection case)
=============================
Say,

B detects a transient network failure with A
B wants to figure out if it needs to take over and be the new master
B starts a quorum voting process

The manual says this in the ‘High Availability and Failover’ section: 

"Specifically, the backup will become active when it loses connection to its live server. This can be problematic because this can also happen because of a temporary network problem. In order to address this issue, the backup will try to determine whether it still can connect to the other servers in the cluster. If it can connect to more than half the servers, it will become active, if more than half the servers also disappeared with the live, the backup will wait and try reconnecting with the live. This avoids a split brain situation."

Q4: At this point, which nodes are expected to participate in quorum voting? All of A, B and C? Or A and C only (B excludes itself from the set)? When it says "half the servers”, I read it in a way that B includes itself in the quorum voting. Is that the case?

Whereas in the ‘Avoiding Network Isolation’ section, the manual says this:

“Quorum voting is used by both the live and the backup to decide what to do if a replication connection is disconnected. Basically the server will request each live server in the cluster to vote as to whether it thinks the server it is replicating to or from is still alive. This being the case the minimum number of live/backup pairs needed is 3."

Q5: This implies only the live servers participate in quorum voting. Is that correct?

Q6: If the answer to Q5 is yes, then how does the split brain detection (as described in the quoted text right before Q4) work?

Q7: The text implies that in order to avoid split brain, a cluster needs at least 3 live/backup PAIRS. To me that implies at least 6 broker instances are needed in such a cluster; but that is kind of hard to believe, and I feel (I may be wrong) it actually means 3 broker instances, assuming scenarios 1 and 2 as described earlier are valid ones. Can you please clarify?

Would appreciate if someone can offer clarity on these questions.

Thanks,
Anindya Haldar
Oracle Marketing Cloud


Re: Questions on HA cluster and split brain

Posted by Clebert Suconic <cl...@gmail.com>.
I've used a PET project I created just to generate a commit report for
Artemis, and compiled it between 2.4.0 and 2.6.1, to give you an idea
of what we did between 2.4.0 and 2.6.1:
https://clebertsuconic.github.io/report-2.4.0-til-2.6.1.html


I had gone through that list before and I didn't see any disruptive
feature. Quite the contrary we added a lot more compatibility tests,
the testsuite is cleaner... everything is good to go at 2.6.1.


we should soon propose 2.6.2 which will include a few extra fixes from
the 2.6.x branch:
https://clebertsuconic.github.io/report-2.6.1-til-2.6.x.html



On Thu, Jun 14, 2018 at 8:11 PM, Anindya Haldar
<an...@oracle.com> wrote:
> When we started, we started with 2.4.0, and are so far trying to understand how we can use it effectively for our purpose. That said, the later versions of Artemis are definitely in our radar.
>
> Thanks,
> Anindya Haldar
> Oracle Marketing Cloud
>
>
>
>> On Jun 14, 2018, at 4:48 PM, Clebert Suconic <cl...@gmail.com> wrote:
>>
>> I think you should use 2.6.1.  There is nothing that it is not equivalent.
>> And where we are actively fixing issues now.
>>
>> On Thu, Jun 14, 2018 at 6:40 PM Anindya Haldar <anindya.haldar@oracle.com <ma...@oracle.com>>
>> wrote:
>>
>>> Thanks, again, for your quick response.
>>>
>>> Anindya Haldar
>>> Oracle Marketing Cloud
>>>
>>>
>>>> On Jun 14, 2018, at 3:34 PM, Justin Bertram <jbertram@apache.org <ma...@apache.org>> wrote:
>>>>
>>>>> 1) It is possible to define multiple groups within a cluster, and a
>>>> subset of the brokers in the cluster can be members of a specific group.
>>> Is
>>>> that correct?
>>>> her
>>>> Yes.
>>>>
>>>>> 2) The live-backup relationship is guided by group membership, when
>>> there
>>>> is explicit group membership defined. Is that correct?
>>>>
>>>> Yes.
>>>>
>>>>> 3) When a backup or a live server in a group starts the quorum voting
>>>> process, other live servers in the cluster, even if though they may not
>>> be
>>>> part of the same group, can participate in the quorum. Meaning the
>>> ability
>>>> to participate in quorum voting is defined by cluster membership, and not
>>>> by group membership within the cluster. Is that understanding correct?
>>>>
>>>> Yes.
>>>>
>>>>
>>>> In short, a "group" allows the pairing of specific live and backup
>>> brokers
>>>> together in the replicated HA use-case.
>>>>
>>>>
>>>> Justin
>>>>
>>>>
>>>> On Thu, Jun 14, 2018 at 5:19 PM, Anindya Haldar <
>>> anindya.haldar@oracle.com>
>>>> wrote:
>>>>
>>>>> I have a few quick follow up questions. From the discussion here, and
>>> from
>>>>> what I understand reading the Artemis manual, here is my understanding
>>>>> about the idea of a cluster vs. the idea of a group within a cluster:
>>>>>
>>>>> 1) It is possible to define multiple groups within a cluster, and a
>>> subset
>>>>> of the brokers in the cluster can be members of a specific group. Is
>>> that
>>>>> correct?
>>>>>
>>>>> 2) The live-backup relationship is guided by group membership, when
>>> there
>>>>> is explicit group membership defined. Is that correct?
>>>>>
>>>>> 3) When a backup or a live server in a group starts the quorum voting
>>>>> process, other live servers in the cluster, even if though they may not
>>> be
>>>>> part of the same group, can participate in the quorum. Meaning the
>>> ability
>>>>> to participate in quorum voting is defined by cluster membership, and
>>> not
>>>>> by group membership within the cluster. Is that understanding correct?
>>>>>
>>>>> Thanks,
>>>>>
>>>>> Anindya Haldar
>>>>> Oracle Marketing Cloud
>>>>>
>>>>>
>>>>>> On Jun 14, 2018, at 9:57 AM, Anindya Haldar <anindya.haldar@oracle.com
>>>>
>>>>> wrote:
>>>>>>
>>>>>> Many thanks, Justin. This makes things much clearer for us when it
>>> comes
>>>>> to designing the HA cluster.
>>>>>>
>>>>>> As for the Artemis evaluation scope, we want to use it as one of the
>>>>> supported messaging backbones in our application suite. The application
>>>>> suite requires strong transactional guarantees, high availability, and
>>> high
>>>>> performance and scale, amongst other things. We are looking towards a
>>> full
>>>>> blown technology evaluation with those needs in mind.
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Anindya Haldar
>>>>>> Oracle Marketing Cloud
>>>>>>
>>>>>>
>>>>>>> On Jun 13, 2018, at 7:23 PM, Justin Bertram <jb...@apache.org>
>>>>> wrote:
>>>>>>>
>>>>>>>> Q1: At this point, will the transaction logs replicate from A to C?
>>>>>>>
>>>>>>> No.  A will be replicating to B since B is the designated backup.
>>>>> Also, by
>>>>>>> "transaction logs" I assume you mean what the Artemis documentation
>>>>> refers
>>>>>>> to as the journal (i.e. all persistent message data).
>>>>>>>
>>>>>>>> Q2: At this point will C become to new new back up for B, assuming A
>>>>>>> remains in failed state?
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>> Q3: If the answer to Q2 is yes, B will start replicating its journals
>>>>> to
>>>>>>> C; is that correct?
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>> Q4: At this point, which nodes are expected to participate in quorum
>>>>>>> voting? All of A, B and C? Or A and C only (B excludes itself from the
>>>>>>> set)? When it says "half the servers”, I read it in a way that B
>>>>> includes
>>>>>>> itself in the quorum voting. Is that the case?
>>>>>>>
>>>>>>> A would be the only server available to participate in the quorum
>>> voting
>>>>>>> since it is the only live server.  However, since B can't reach A
>>> then B
>>>>>>> would not receive any quorum vote responses.  B doesn't vote; it
>>> simply
>>>>>>> asks for a vote.
>>>>>>>
>>>>>>>> Q5: This implies only the live servers participate in quorum voting.
>>> Is
>>>>>>> that correct?
>>>>>>>
>>>>>>> Yes.
>>>>>>>
>>>>>>>> Q6: If the answer to Q5 is yes, then how does the split brain
>>> detection
>>>>>>> (as described in the quoted text right before Q4) work?
>>>>>>>
>>>>>>> It works by having multiple voting members (i.e. live servers) in the
>>>>>>> cluster.  The topology you've described with a single live and 2
>>>>> backups is
>>>>>>> not sufficient to mitigate against split brain.
>>>>>>>
>>>>>>>> Q7: The text implies that in order to avoid split brain, a cluster
>>>>> needs
>>>>>>> at least 3 live/backup PAIRS.
>>>>>>>
>>>>>>> That is correct - 3 live/backup pairs.
>>>>>>>
>>>>>>>> To me that implies at least 6 broker instances are needed in such a
>>>>>>> cluster; but that is kind of hard to believe, and I feel (I may be
>>>>> wrong)
>>>>>>> it actually means 3 broker instances, assuming scenarios 1 and 2 as
>>>>>>> described earlier are valid ones. Can you please clarify?
>>>>>>>
>>>>>>> What you feel is incorrect.  That said, the live & backup instances
>>> can
>>>>> be
>>>>>>> colocated which means although there are 6 total broker instances
>>> only 3
>>>>>>> machines are required.
>>>>>>>
>>>>>>> I think implementing a feature whereby backups can participate in the
>>>>>>> quorum vote would be a great addition to the broker.  Unfortunately I
>>>>>>> haven't had time to contribute such a feature.
>>>>>>>
>>>>>>>
>>>>>>> If I may ask a question of my own...Your emails to this list have
>>>>> piqued my
>>>>>>> interest and I'm curious to know to what end you are evaluating
>>> Artemis
>>>>>>> since you apparently work for Oracle on a cloud related team and
>>> Oracle
>>>>>>> already has a cloud messaging solution.  Can you elaborate at all?
>>>>>>>
>>>>>>>
>>>>>>> Justin
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Jun 13, 2018 at 7:56 PM, Anindya Haldar <
>>>>> anindya.haldar@oracle.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> BTW, these are questions related to Artemis 2.4.0, which is what we
>>> are
>>>>>>>> evaluating right now for our solution.
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Jun 13, 2018, at 5:52 PM, Anindya Haldar <
>>>>> anindya.haldar@oracle.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>> I have some questions related to the HA cluster, failover and
>>>>>>>> split-brain cases.
>>>>>>>>>
>>>>>>>>> Suppose I have set up a 3 node cluster with:
>>>>>>>>>
>>>>>>>>> A = master
>>>>>>>>> B = slave 1
>>>>>>>>> C = slave 2
>>>>>>>>>
>>>>>>>>> Also suppose they are all part of same group, and are set up to
>>> offer
>>>>>>>> replication based HA.
>>>>>>>>>
>>>>>>>>> Scenario 1
>>>>>>>>> ========
>>>>>>>>> Say,
>>>>>>>>>
>>>>>>>>> B starts up and finds A
>>>>>>>>> B becomes the designated backup for A
>>>>>>>>> C starts up, and tries to find a live server in this group
>>>>>>>>> C figures that A already has a designated backup, which is B
>>>>>>>>> C keeps waiting until the network topology is changed
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Q1: At this point, will the transaction logs replicate from A to C?
>>>>>>>>>
>>>>>>>>> Now let’s say
>>>>>>>>>
>>>>>>>>> Node A (the current master) fails
>>>>>>>>> B becomes the new master
>>>>>>>>>
>>>>>>>>> Q2: At this point will C become to new new back up for B, assuming A
>>>>>>>> remains in failed state?
>>>>>>>>>
>>>>>>>>> Q3: If the answer to Q2 is yes, B will start replicating its
>>> journals
>>>>> to
>>>>>>>> C; is that correct?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Scenario 2 (split brain detection case)
>>>>>>>>> =============================
>>>>>>>>> Say,
>>>>>>>>>
>>>>>>>>> B detects a transient network failure with A
>>>>>>>>> B wants to figure out if it needs to take over and be the new master
>>>>>>>>> B starts a quorum voting process
>>>>>>>>>
>>>>>>>>> The manual says this in the ‘High Availability and Failover’
>>> section:
>>>>>>>>>
>>>>>>>>> "Specifically, the backup will become active when it loses
>>> connection
>>>>> to
>>>>>>>> its live server. This can be problematic because this can also happen
>>>>>>>> because of a temporary network problem. In order to address this
>>>>> issue, the
>>>>>>>> backup will try to determine whether it still can connect to the
>>> other
>>>>>>>> servers in the cluster. If it can connect to more than half the
>>>>> servers, it
>>>>>>>> will become active, if more than half the servers also disappeared
>>>>> with the
>>>>>>>> live, the backup will wait and try reconnecting with the live. This
>>>>> avoids
>>>>>>>> a split brain situation."
>>>>>>>>>
>>>>>>>>> Q4: At this point, which nodes are expected to participate in quorum
>>>>>>>> voting? All of A, B and C? Or A and C only (B excludes itself from
>>> the
>>>>>>>> set)? When it says "half the servers”, I read it in a way that B
>>>>> includes
>>>>>>>> itself in the quorum voting. Is that the case?
>>>>>>>>>
>>>>>>>>> Whereas in the ‘Avoiding Network Isolation’ section, the manual says
>>>>>>>> this:
>>>>>>>>>
>>>>>>>>> “Quorum voting is used by both the live and the backup to decide
>>> what
>>>>> to
>>>>>>>> do if a replication connection is disconnected. Basically the server
>>>>> will
>>>>>>>> request each live server in the cluster to vote as to whether it
>>>>> thinks the
>>>>>>>> server it is replicating to or from is still alive. This being the
>>>>> case the
>>>>>>>> minimum number of live/backup pairs needed is 3."
>>>>>>>>>
>>>>>>>>> Q5: This implies only the live servers participate in quorum voting.
>>>>> Is
>>>>>>>> that correct?
>>>>>>>>>
>>>>>>>>> Q6: If the answer to Q5 is yes, then how does the split brain
>>>>> detection
>>>>>>>> (as described in the quoted text right before Q4) work?
>>>>>>>>>
>>>>>>>>> Q7: The text implies that in order to avoid split brain, a cluster
>>>>> needs
>>>>>>>> at least 3 live/backup PAIRS. To me that implies at least 6 broker
>>>>>>>> instances are needed in such a cluster; but that is kind of hard to
>>>>>>>> believe, and I feel (I may be wrong) it actually means 3 broker
>>>>> instances,
>>>>>>>> assuming scenarios 1 and 2 as described earlier are valid ones. Can
>>> you
>>>>>>>> please clarify?
>>>>>>>>>
>>>>>>>>> Would appreciate if someone can offer clarity on these questions.
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> Anindya Haldar
>>>>>>>>> Oracle Marketing Cloud
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>
>>> --
>> Clebert Suconic
>



-- 
Clebert Suconic

Re: Questions on HA cluster and split brain

Posted by Anindya Haldar <an...@oracle.com>.
When we started, we started with 2.4.0, and are so far trying to understand how we can use it effectively for our purpose. That said, the later versions of Artemis are definitely in our radar.

Thanks,
Anindya Haldar
Oracle Marketing Cloud



> On Jun 14, 2018, at 4:48 PM, Clebert Suconic <cl...@gmail.com> wrote:
> 
> I think you should use 2.6.1.  There is nothing that it is not equivalent.
> And where we are actively fixing issues now.
> 
> On Thu, Jun 14, 2018 at 6:40 PM Anindya Haldar <anindya.haldar@oracle.com <ma...@oracle.com>>
> wrote:
> 
>> Thanks, again, for your quick response.
>> 
>> Anindya Haldar
>> Oracle Marketing Cloud
>> 
>> 
>>> On Jun 14, 2018, at 3:34 PM, Justin Bertram <jbertram@apache.org <ma...@apache.org>> wrote:
>>> 
>>>> 1) It is possible to define multiple groups within a cluster, and a
>>> subset of the brokers in the cluster can be members of a specific group.
>> Is
>>> that correct?
>>> her
>>> Yes.
>>> 
>>>> 2) The live-backup relationship is guided by group membership, when
>> there
>>> is explicit group membership defined. Is that correct?
>>> 
>>> Yes.
>>> 
>>>> 3) When a backup or a live server in a group starts the quorum voting
>>> process, other live servers in the cluster, even if though they may not
>> be
>>> part of the same group, can participate in the quorum. Meaning the
>> ability
>>> to participate in quorum voting is defined by cluster membership, and not
>>> by group membership within the cluster. Is that understanding correct?
>>> 
>>> Yes.
>>> 
>>> 
>>> In short, a "group" allows the pairing of specific live and backup
>> brokers
>>> together in the replicated HA use-case.
>>> 
>>> 
>>> Justin
>>> 
>>> 
>>> On Thu, Jun 14, 2018 at 5:19 PM, Anindya Haldar <
>> anindya.haldar@oracle.com>
>>> wrote:
>>> 
>>>> I have a few quick follow up questions. From the discussion here, and
>> from
>>>> what I understand reading the Artemis manual, here is my understanding
>>>> about the idea of a cluster vs. the idea of a group within a cluster:
>>>> 
>>>> 1) It is possible to define multiple groups within a cluster, and a
>> subset
>>>> of the brokers in the cluster can be members of a specific group. Is
>> that
>>>> correct?
>>>> 
>>>> 2) The live-backup relationship is guided by group membership, when
>> there
>>>> is explicit group membership defined. Is that correct?
>>>> 
>>>> 3) When a backup or a live server in a group starts the quorum voting
>>>> process, other live servers in the cluster, even if though they may not
>> be
>>>> part of the same group, can participate in the quorum. Meaning the
>> ability
>>>> to participate in quorum voting is defined by cluster membership, and
>> not
>>>> by group membership within the cluster. Is that understanding correct?
>>>> 
>>>> Thanks,
>>>> 
>>>> Anindya Haldar
>>>> Oracle Marketing Cloud
>>>> 
>>>> 
>>>>> On Jun 14, 2018, at 9:57 AM, Anindya Haldar <anindya.haldar@oracle.com
>>> 
>>>> wrote:
>>>>> 
>>>>> Many thanks, Justin. This makes things much clearer for us when it
>> comes
>>>> to designing the HA cluster.
>>>>> 
>>>>> As for the Artemis evaluation scope, we want to use it as one of the
>>>> supported messaging backbones in our application suite. The application
>>>> suite requires strong transactional guarantees, high availability, and
>> high
>>>> performance and scale, amongst other things. We are looking towards a
>> full
>>>> blown technology evaluation with those needs in mind.
>>>>> 
>>>>> Thanks,
>>>>> 
>>>>> Anindya Haldar
>>>>> Oracle Marketing Cloud
>>>>> 
>>>>> 
>>>>>> On Jun 13, 2018, at 7:23 PM, Justin Bertram <jb...@apache.org>
>>>> wrote:
>>>>>> 
>>>>>>> Q1: At this point, will the transaction logs replicate from A to C?
>>>>>> 
>>>>>> No.  A will be replicating to B since B is the designated backup.
>>>> Also, by
>>>>>> "transaction logs" I assume you mean what the Artemis documentation
>>>> refers
>>>>>> to as the journal (i.e. all persistent message data).
>>>>>> 
>>>>>>> Q2: At this point will C become to new new back up for B, assuming A
>>>>>> remains in failed state?
>>>>>> 
>>>>>> Yes.
>>>>>> 
>>>>>>> Q3: If the answer to Q2 is yes, B will start replicating its journals
>>>> to
>>>>>> C; is that correct?
>>>>>> 
>>>>>> Yes.
>>>>>> 
>>>>>>> Q4: At this point, which nodes are expected to participate in quorum
>>>>>> voting? All of A, B and C? Or A and C only (B excludes itself from the
>>>>>> set)? When it says "half the servers”, I read it in a way that B
>>>> includes
>>>>>> itself in the quorum voting. Is that the case?
>>>>>> 
>>>>>> A would be the only server available to participate in the quorum
>> voting
>>>>>> since it is the only live server.  However, since B can't reach A
>> then B
>>>>>> would not receive any quorum vote responses.  B doesn't vote; it
>> simply
>>>>>> asks for a vote.
>>>>>> 
>>>>>>> Q5: This implies only the live servers participate in quorum voting.
>> Is
>>>>>> that correct?
>>>>>> 
>>>>>> Yes.
>>>>>> 
>>>>>>> Q6: If the answer to Q5 is yes, then how does the split brain
>> detection
>>>>>> (as described in the quoted text right before Q4) work?
>>>>>> 
>>>>>> It works by having multiple voting members (i.e. live servers) in the
>>>>>> cluster.  The topology you've described with a single live and 2
>>>> backups is
>>>>>> not sufficient to mitigate against split brain.
>>>>>> 
>>>>>>> Q7: The text implies that in order to avoid split brain, a cluster
>>>> needs
>>>>>> at least 3 live/backup PAIRS.
>>>>>> 
>>>>>> That is correct - 3 live/backup pairs.
>>>>>> 
>>>>>>> To me that implies at least 6 broker instances are needed in such a
>>>>>> cluster; but that is kind of hard to believe, and I feel (I may be
>>>> wrong)
>>>>>> it actually means 3 broker instances, assuming scenarios 1 and 2 as
>>>>>> described earlier are valid ones. Can you please clarify?
>>>>>> 
>>>>>> What you feel is incorrect.  That said, the live & backup instances
>> can
>>>> be
>>>>>> colocated which means although there are 6 total broker instances
>> only 3
>>>>>> machines are required.
>>>>>> 
>>>>>> I think implementing a feature whereby backups can participate in the
>>>>>> quorum vote would be a great addition to the broker.  Unfortunately I
>>>>>> haven't had time to contribute such a feature.
>>>>>> 
>>>>>> 
>>>>>> If I may ask a question of my own...Your emails to this list have
>>>> piqued my
>>>>>> interest and I'm curious to know to what end you are evaluating
>> Artemis
>>>>>> since you apparently work for Oracle on a cloud related team and
>> Oracle
>>>>>> already has a cloud messaging solution.  Can you elaborate at all?
>>>>>> 
>>>>>> 
>>>>>> Justin
>>>>>> 
>>>>>> 
>>>>>> On Wed, Jun 13, 2018 at 7:56 PM, Anindya Haldar <
>>>> anindya.haldar@oracle.com>
>>>>>> wrote:
>>>>>> 
>>>>>>> BTW, these are questions related to Artemis 2.4.0, which is what we
>> are
>>>>>>> evaluating right now for our solution.
>>>>>>> 
>>>>>>> 
>>>>>>>> On Jun 13, 2018, at 5:52 PM, Anindya Haldar <
>>>> anindya.haldar@oracle.com>
>>>>>>> wrote:
>>>>>>>> 
>>>>>>>> I have some questions related to the HA cluster, failover and
>>>>>>> split-brain cases.
>>>>>>>> 
>>>>>>>> Suppose I have set up a 3 node cluster with:
>>>>>>>> 
>>>>>>>> A = master
>>>>>>>> B = slave 1
>>>>>>>> C = slave 2
>>>>>>>> 
>>>>>>>> Also suppose they are all part of same group, and are set up to
>> offer
>>>>>>> replication based HA.
>>>>>>>> 
>>>>>>>> Scenario 1
>>>>>>>> ========
>>>>>>>> Say,
>>>>>>>> 
>>>>>>>> B starts up and finds A
>>>>>>>> B becomes the designated backup for A
>>>>>>>> C starts up, and tries to find a live server in this group
>>>>>>>> C figures that A already has a designated backup, which is B
>>>>>>>> C keeps waiting until the network topology is changed
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Q1: At this point, will the transaction logs replicate from A to C?
>>>>>>>> 
>>>>>>>> Now let’s say
>>>>>>>> 
>>>>>>>> Node A (the current master) fails
>>>>>>>> B becomes the new master
>>>>>>>> 
>>>>>>>> Q2: At this point will C become to new new back up for B, assuming A
>>>>>>> remains in failed state?
>>>>>>>> 
>>>>>>>> Q3: If the answer to Q2 is yes, B will start replicating its
>> journals
>>>> to
>>>>>>> C; is that correct?
>>>>>>>> 
>>>>>>>> 
>>>>>>>> Scenario 2 (split brain detection case)
>>>>>>>> =============================
>>>>>>>> Say,
>>>>>>>> 
>>>>>>>> B detects a transient network failure with A
>>>>>>>> B wants to figure out if it needs to take over and be the new master
>>>>>>>> B starts a quorum voting process
>>>>>>>> 
>>>>>>>> The manual says this in the ‘High Availability and Failover’
>> section:
>>>>>>>> 
>>>>>>>> "Specifically, the backup will become active when it loses
>> connection
>>>> to
>>>>>>> its live server. This can be problematic because this can also happen
>>>>>>> because of a temporary network problem. In order to address this
>>>> issue, the
>>>>>>> backup will try to determine whether it still can connect to the
>> other
>>>>>>> servers in the cluster. If it can connect to more than half the
>>>> servers, it
>>>>>>> will become active, if more than half the servers also disappeared
>>>> with the
>>>>>>> live, the backup will wait and try reconnecting with the live. This
>>>> avoids
>>>>>>> a split brain situation."
>>>>>>>> 
>>>>>>>> Q4: At this point, which nodes are expected to participate in quorum
>>>>>>> voting? All of A, B and C? Or A and C only (B excludes itself from
>> the
>>>>>>> set)? When it says "half the servers”, I read it in a way that B
>>>> includes
>>>>>>> itself in the quorum voting. Is that the case?
>>>>>>>> 
>>>>>>>> Whereas in the ‘Avoiding Network Isolation’ section, the manual says
>>>>>>> this:
>>>>>>>> 
>>>>>>>> “Quorum voting is used by both the live and the backup to decide
>> what
>>>> to
>>>>>>> do if a replication connection is disconnected. Basically the server
>>>> will
>>>>>>> request each live server in the cluster to vote as to whether it
>>>> thinks the
>>>>>>> server it is replicating to or from is still alive. This being the
>>>> case the
>>>>>>> minimum number of live/backup pairs needed is 3."
>>>>>>>> 
>>>>>>>> Q5: This implies only the live servers participate in quorum voting.
>>>> Is
>>>>>>> that correct?
>>>>>>>> 
>>>>>>>> Q6: If the answer to Q5 is yes, then how does the split brain
>>>> detection
>>>>>>> (as described in the quoted text right before Q4) work?
>>>>>>>> 
>>>>>>>> Q7: The text implies that in order to avoid split brain, a cluster
>>>> needs
>>>>>>> at least 3 live/backup PAIRS. To me that implies at least 6 broker
>>>>>>> instances are needed in such a cluster; but that is kind of hard to
>>>>>>> believe, and I feel (I may be wrong) it actually means 3 broker
>>>> instances,
>>>>>>> assuming scenarios 1 and 2 as described earlier are valid ones. Can
>> you
>>>>>>> please clarify?
>>>>>>>> 
>>>>>>>> Would appreciate if someone can offer clarity on these questions.
>>>>>>>> 
>>>>>>>> Thanks,
>>>>>>>> Anindya Haldar
>>>>>>>> Oracle Marketing Cloud
>>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>> 
>>>> 
>>>> 
>> 
>> --
> Clebert Suconic


Re: Questions on HA cluster and split brain

Posted by Clebert Suconic <cl...@gmail.com>.
I think you should use 2.6.1.  There is nothing that it is not equivalent.
And where we are actively fixing issues now.

On Thu, Jun 14, 2018 at 6:40 PM Anindya Haldar <an...@oracle.com>
wrote:

> Thanks, again, for your quick response.
>
> Anindya Haldar
> Oracle Marketing Cloud
>
>
> > On Jun 14, 2018, at 3:34 PM, Justin Bertram <jb...@apache.org> wrote:
> >
> >> 1) It is possible to define multiple groups within a cluster, and a
> > subset of the brokers in the cluster can be members of a specific group.
> Is
> > that correct?
> > her
> > Yes.
> >
> >> 2) The live-backup relationship is guided by group membership, when
> there
> > is explicit group membership defined. Is that correct?
> >
> > Yes.
> >
> >> 3) When a backup or a live server in a group starts the quorum voting
> > process, other live servers in the cluster, even if though they may not
> be
> > part of the same group, can participate in the quorum. Meaning the
> ability
> > to participate in quorum voting is defined by cluster membership, and not
> > by group membership within the cluster. Is that understanding correct?
> >
> > Yes.
> >
> >
> > In short, a "group" allows the pairing of specific live and backup
> brokers
> > together in the replicated HA use-case.
> >
> >
> > Justin
> >
> >
> > On Thu, Jun 14, 2018 at 5:19 PM, Anindya Haldar <
> anindya.haldar@oracle.com>
> > wrote:
> >
> >> I have a few quick follow up questions. From the discussion here, and
> from
> >> what I understand reading the Artemis manual, here is my understanding
> >> about the idea of a cluster vs. the idea of a group within a cluster:
> >>
> >> 1) It is possible to define multiple groups within a cluster, and a
> subset
> >> of the brokers in the cluster can be members of a specific group. Is
> that
> >> correct?
> >>
> >> 2) The live-backup relationship is guided by group membership, when
> there
> >> is explicit group membership defined. Is that correct?
> >>
> >> 3) When a backup or a live server in a group starts the quorum voting
> >> process, other live servers in the cluster, even if though they may not
> be
> >> part of the same group, can participate in the quorum. Meaning the
> ability
> >> to participate in quorum voting is defined by cluster membership, and
> not
> >> by group membership within the cluster. Is that understanding correct?
> >>
> >> Thanks,
> >>
> >> Anindya Haldar
> >> Oracle Marketing Cloud
> >>
> >>
> >>> On Jun 14, 2018, at 9:57 AM, Anindya Haldar <anindya.haldar@oracle.com
> >
> >> wrote:
> >>>
> >>> Many thanks, Justin. This makes things much clearer for us when it
> comes
> >> to designing the HA cluster.
> >>>
> >>> As for the Artemis evaluation scope, we want to use it as one of the
> >> supported messaging backbones in our application suite. The application
> >> suite requires strong transactional guarantees, high availability, and
> high
> >> performance and scale, amongst other things. We are looking towards a
> full
> >> blown technology evaluation with those needs in mind.
> >>>
> >>> Thanks,
> >>>
> >>> Anindya Haldar
> >>> Oracle Marketing Cloud
> >>>
> >>>
> >>>> On Jun 13, 2018, at 7:23 PM, Justin Bertram <jb...@apache.org>
> >> wrote:
> >>>>
> >>>>> Q1: At this point, will the transaction logs replicate from A to C?
> >>>>
> >>>> No.  A will be replicating to B since B is the designated backup.
> >> Also, by
> >>>> "transaction logs" I assume you mean what the Artemis documentation
> >> refers
> >>>> to as the journal (i.e. all persistent message data).
> >>>>
> >>>>> Q2: At this point will C become to new new back up for B, assuming A
> >>>> remains in failed state?
> >>>>
> >>>> Yes.
> >>>>
> >>>>> Q3: If the answer to Q2 is yes, B will start replicating its journals
> >> to
> >>>> C; is that correct?
> >>>>
> >>>> Yes.
> >>>>
> >>>>> Q4: At this point, which nodes are expected to participate in quorum
> >>>> voting? All of A, B and C? Or A and C only (B excludes itself from the
> >>>> set)? When it says "half the servers”, I read it in a way that B
> >> includes
> >>>> itself in the quorum voting. Is that the case?
> >>>>
> >>>> A would be the only server available to participate in the quorum
> voting
> >>>> since it is the only live server.  However, since B can't reach A
> then B
> >>>> would not receive any quorum vote responses.  B doesn't vote; it
> simply
> >>>> asks for a vote.
> >>>>
> >>>>> Q5: This implies only the live servers participate in quorum voting.
> Is
> >>>> that correct?
> >>>>
> >>>> Yes.
> >>>>
> >>>>> Q6: If the answer to Q5 is yes, then how does the split brain
> detection
> >>>> (as described in the quoted text right before Q4) work?
> >>>>
> >>>> It works by having multiple voting members (i.e. live servers) in the
> >>>> cluster.  The topology you've described with a single live and 2
> >> backups is
> >>>> not sufficient to mitigate against split brain.
> >>>>
> >>>>> Q7: The text implies that in order to avoid split brain, a cluster
> >> needs
> >>>> at least 3 live/backup PAIRS.
> >>>>
> >>>> That is correct - 3 live/backup pairs.
> >>>>
> >>>>> To me that implies at least 6 broker instances are needed in such a
> >>>> cluster; but that is kind of hard to believe, and I feel (I may be
> >> wrong)
> >>>> it actually means 3 broker instances, assuming scenarios 1 and 2 as
> >>>> described earlier are valid ones. Can you please clarify?
> >>>>
> >>>> What you feel is incorrect.  That said, the live & backup instances
> can
> >> be
> >>>> colocated which means although there are 6 total broker instances
> only 3
> >>>> machines are required.
> >>>>
> >>>> I think implementing a feature whereby backups can participate in the
> >>>> quorum vote would be a great addition to the broker.  Unfortunately I
> >>>> haven't had time to contribute such a feature.
> >>>>
> >>>>
> >>>> If I may ask a question of my own...Your emails to this list have
> >> piqued my
> >>>> interest and I'm curious to know to what end you are evaluating
> Artemis
> >>>> since you apparently work for Oracle on a cloud related team and
> Oracle
> >>>> already has a cloud messaging solution.  Can you elaborate at all?
> >>>>
> >>>>
> >>>> Justin
> >>>>
> >>>>
> >>>> On Wed, Jun 13, 2018 at 7:56 PM, Anindya Haldar <
> >> anindya.haldar@oracle.com>
> >>>> wrote:
> >>>>
> >>>>> BTW, these are questions related to Artemis 2.4.0, which is what we
> are
> >>>>> evaluating right now for our solution.
> >>>>>
> >>>>>
> >>>>>> On Jun 13, 2018, at 5:52 PM, Anindya Haldar <
> >> anindya.haldar@oracle.com>
> >>>>> wrote:
> >>>>>>
> >>>>>> I have some questions related to the HA cluster, failover and
> >>>>> split-brain cases.
> >>>>>>
> >>>>>> Suppose I have set up a 3 node cluster with:
> >>>>>>
> >>>>>> A = master
> >>>>>> B = slave 1
> >>>>>> C = slave 2
> >>>>>>
> >>>>>> Also suppose they are all part of same group, and are set up to
> offer
> >>>>> replication based HA.
> >>>>>>
> >>>>>> Scenario 1
> >>>>>> ========
> >>>>>> Say,
> >>>>>>
> >>>>>> B starts up and finds A
> >>>>>> B becomes the designated backup for A
> >>>>>> C starts up, and tries to find a live server in this group
> >>>>>> C figures that A already has a designated backup, which is B
> >>>>>> C keeps waiting until the network topology is changed
> >>>>>>
> >>>>>>
> >>>>>> Q1: At this point, will the transaction logs replicate from A to C?
> >>>>>>
> >>>>>> Now let’s say
> >>>>>>
> >>>>>> Node A (the current master) fails
> >>>>>> B becomes the new master
> >>>>>>
> >>>>>> Q2: At this point will C become to new new back up for B, assuming A
> >>>>> remains in failed state?
> >>>>>>
> >>>>>> Q3: If the answer to Q2 is yes, B will start replicating its
> journals
> >> to
> >>>>> C; is that correct?
> >>>>>>
> >>>>>>
> >>>>>> Scenario 2 (split brain detection case)
> >>>>>> =============================
> >>>>>> Say,
> >>>>>>
> >>>>>> B detects a transient network failure with A
> >>>>>> B wants to figure out if it needs to take over and be the new master
> >>>>>> B starts a quorum voting process
> >>>>>>
> >>>>>> The manual says this in the ‘High Availability and Failover’
> section:
> >>>>>>
> >>>>>> "Specifically, the backup will become active when it loses
> connection
> >> to
> >>>>> its live server. This can be problematic because this can also happen
> >>>>> because of a temporary network problem. In order to address this
> >> issue, the
> >>>>> backup will try to determine whether it still can connect to the
> other
> >>>>> servers in the cluster. If it can connect to more than half the
> >> servers, it
> >>>>> will become active, if more than half the servers also disappeared
> >> with the
> >>>>> live, the backup will wait and try reconnecting with the live. This
> >> avoids
> >>>>> a split brain situation."
> >>>>>>
> >>>>>> Q4: At this point, which nodes are expected to participate in quorum
> >>>>> voting? All of A, B and C? Or A and C only (B excludes itself from
> the
> >>>>> set)? When it says "half the servers”, I read it in a way that B
> >> includes
> >>>>> itself in the quorum voting. Is that the case?
> >>>>>>
> >>>>>> Whereas in the ‘Avoiding Network Isolation’ section, the manual says
> >>>>> this:
> >>>>>>
> >>>>>> “Quorum voting is used by both the live and the backup to decide
> what
> >> to
> >>>>> do if a replication connection is disconnected. Basically the server
> >> will
> >>>>> request each live server in the cluster to vote as to whether it
> >> thinks the
> >>>>> server it is replicating to or from is still alive. This being the
> >> case the
> >>>>> minimum number of live/backup pairs needed is 3."
> >>>>>>
> >>>>>> Q5: This implies only the live servers participate in quorum voting.
> >> Is
> >>>>> that correct?
> >>>>>>
> >>>>>> Q6: If the answer to Q5 is yes, then how does the split brain
> >> detection
> >>>>> (as described in the quoted text right before Q4) work?
> >>>>>>
> >>>>>> Q7: The text implies that in order to avoid split brain, a cluster
> >> needs
> >>>>> at least 3 live/backup PAIRS. To me that implies at least 6 broker
> >>>>> instances are needed in such a cluster; but that is kind of hard to
> >>>>> believe, and I feel (I may be wrong) it actually means 3 broker
> >> instances,
> >>>>> assuming scenarios 1 and 2 as described earlier are valid ones. Can
> you
> >>>>> please clarify?
> >>>>>>
> >>>>>> Would appreciate if someone can offer clarity on these questions.
> >>>>>>
> >>>>>> Thanks,
> >>>>>> Anindya Haldar
> >>>>>> Oracle Marketing Cloud
> >>>>>>
> >>>>>
> >>>>>
> >>>
> >>
> >>
>
> --
Clebert Suconic

Re: Questions on HA cluster and split brain

Posted by Anindya Haldar <an...@oracle.com>.
Thanks, again, for your quick response.

Anindya Haldar
Oracle Marketing Cloud


> On Jun 14, 2018, at 3:34 PM, Justin Bertram <jb...@apache.org> wrote:
> 
>> 1) It is possible to define multiple groups within a cluster, and a
> subset of the brokers in the cluster can be members of a specific group. Is
> that correct?
> 
> Yes.
> 
>> 2) The live-backup relationship is guided by group membership, when there
> is explicit group membership defined. Is that correct?
> 
> Yes.
> 
>> 3) When a backup or a live server in a group starts the quorum voting
> process, other live servers in the cluster, even if though they may not be
> part of the same group, can participate in the quorum. Meaning the ability
> to participate in quorum voting is defined by cluster membership, and not
> by group membership within the cluster. Is that understanding correct?
> 
> Yes.
> 
> 
> In short, a "group" allows the pairing of specific live and backup brokers
> together in the replicated HA use-case.
> 
> 
> Justin
> 
> 
> On Thu, Jun 14, 2018 at 5:19 PM, Anindya Haldar <an...@oracle.com>
> wrote:
> 
>> I have a few quick follow up questions. From the discussion here, and from
>> what I understand reading the Artemis manual, here is my understanding
>> about the idea of a cluster vs. the idea of a group within a cluster:
>> 
>> 1) It is possible to define multiple groups within a cluster, and a subset
>> of the brokers in the cluster can be members of a specific group. Is that
>> correct?
>> 
>> 2) The live-backup relationship is guided by group membership, when there
>> is explicit group membership defined. Is that correct?
>> 
>> 3) When a backup or a live server in a group starts the quorum voting
>> process, other live servers in the cluster, even if though they may not be
>> part of the same group, can participate in the quorum. Meaning the ability
>> to participate in quorum voting is defined by cluster membership, and not
>> by group membership within the cluster. Is that understanding correct?
>> 
>> Thanks,
>> 
>> Anindya Haldar
>> Oracle Marketing Cloud
>> 
>> 
>>> On Jun 14, 2018, at 9:57 AM, Anindya Haldar <an...@oracle.com>
>> wrote:
>>> 
>>> Many thanks, Justin. This makes things much clearer for us when it comes
>> to designing the HA cluster.
>>> 
>>> As for the Artemis evaluation scope, we want to use it as one of the
>> supported messaging backbones in our application suite. The application
>> suite requires strong transactional guarantees, high availability, and high
>> performance and scale, amongst other things. We are looking towards a full
>> blown technology evaluation with those needs in mind.
>>> 
>>> Thanks,
>>> 
>>> Anindya Haldar
>>> Oracle Marketing Cloud
>>> 
>>> 
>>>> On Jun 13, 2018, at 7:23 PM, Justin Bertram <jb...@apache.org>
>> wrote:
>>>> 
>>>>> Q1: At this point, will the transaction logs replicate from A to C?
>>>> 
>>>> No.  A will be replicating to B since B is the designated backup.
>> Also, by
>>>> "transaction logs" I assume you mean what the Artemis documentation
>> refers
>>>> to as the journal (i.e. all persistent message data).
>>>> 
>>>>> Q2: At this point will C become to new new back up for B, assuming A
>>>> remains in failed state?
>>>> 
>>>> Yes.
>>>> 
>>>>> Q3: If the answer to Q2 is yes, B will start replicating its journals
>> to
>>>> C; is that correct?
>>>> 
>>>> Yes.
>>>> 
>>>>> Q4: At this point, which nodes are expected to participate in quorum
>>>> voting? All of A, B and C? Or A and C only (B excludes itself from the
>>>> set)? When it says "half the servers”, I read it in a way that B
>> includes
>>>> itself in the quorum voting. Is that the case?
>>>> 
>>>> A would be the only server available to participate in the quorum voting
>>>> since it is the only live server.  However, since B can't reach A then B
>>>> would not receive any quorum vote responses.  B doesn't vote; it simply
>>>> asks for a vote.
>>>> 
>>>>> Q5: This implies only the live servers participate in quorum voting. Is
>>>> that correct?
>>>> 
>>>> Yes.
>>>> 
>>>>> Q6: If the answer to Q5 is yes, then how does the split brain detection
>>>> (as described in the quoted text right before Q4) work?
>>>> 
>>>> It works by having multiple voting members (i.e. live servers) in the
>>>> cluster.  The topology you've described with a single live and 2
>> backups is
>>>> not sufficient to mitigate against split brain.
>>>> 
>>>>> Q7: The text implies that in order to avoid split brain, a cluster
>> needs
>>>> at least 3 live/backup PAIRS.
>>>> 
>>>> That is correct - 3 live/backup pairs.
>>>> 
>>>>> To me that implies at least 6 broker instances are needed in such a
>>>> cluster; but that is kind of hard to believe, and I feel (I may be
>> wrong)
>>>> it actually means 3 broker instances, assuming scenarios 1 and 2 as
>>>> described earlier are valid ones. Can you please clarify?
>>>> 
>>>> What you feel is incorrect.  That said, the live & backup instances can
>> be
>>>> colocated which means although there are 6 total broker instances only 3
>>>> machines are required.
>>>> 
>>>> I think implementing a feature whereby backups can participate in the
>>>> quorum vote would be a great addition to the broker.  Unfortunately I
>>>> haven't had time to contribute such a feature.
>>>> 
>>>> 
>>>> If I may ask a question of my own...Your emails to this list have
>> piqued my
>>>> interest and I'm curious to know to what end you are evaluating Artemis
>>>> since you apparently work for Oracle on a cloud related team and Oracle
>>>> already has a cloud messaging solution.  Can you elaborate at all?
>>>> 
>>>> 
>>>> Justin
>>>> 
>>>> 
>>>> On Wed, Jun 13, 2018 at 7:56 PM, Anindya Haldar <
>> anindya.haldar@oracle.com>
>>>> wrote:
>>>> 
>>>>> BTW, these are questions related to Artemis 2.4.0, which is what we are
>>>>> evaluating right now for our solution.
>>>>> 
>>>>> 
>>>>>> On Jun 13, 2018, at 5:52 PM, Anindya Haldar <
>> anindya.haldar@oracle.com>
>>>>> wrote:
>>>>>> 
>>>>>> I have some questions related to the HA cluster, failover and
>>>>> split-brain cases.
>>>>>> 
>>>>>> Suppose I have set up a 3 node cluster with:
>>>>>> 
>>>>>> A = master
>>>>>> B = slave 1
>>>>>> C = slave 2
>>>>>> 
>>>>>> Also suppose they are all part of same group, and are set up to offer
>>>>> replication based HA.
>>>>>> 
>>>>>> Scenario 1
>>>>>> ========
>>>>>> Say,
>>>>>> 
>>>>>> B starts up and finds A
>>>>>> B becomes the designated backup for A
>>>>>> C starts up, and tries to find a live server in this group
>>>>>> C figures that A already has a designated backup, which is B
>>>>>> C keeps waiting until the network topology is changed
>>>>>> 
>>>>>> 
>>>>>> Q1: At this point, will the transaction logs replicate from A to C?
>>>>>> 
>>>>>> Now let’s say
>>>>>> 
>>>>>> Node A (the current master) fails
>>>>>> B becomes the new master
>>>>>> 
>>>>>> Q2: At this point will C become to new new back up for B, assuming A
>>>>> remains in failed state?
>>>>>> 
>>>>>> Q3: If the answer to Q2 is yes, B will start replicating its journals
>> to
>>>>> C; is that correct?
>>>>>> 
>>>>>> 
>>>>>> Scenario 2 (split brain detection case)
>>>>>> =============================
>>>>>> Say,
>>>>>> 
>>>>>> B detects a transient network failure with A
>>>>>> B wants to figure out if it needs to take over and be the new master
>>>>>> B starts a quorum voting process
>>>>>> 
>>>>>> The manual says this in the ‘High Availability and Failover’ section:
>>>>>> 
>>>>>> "Specifically, the backup will become active when it loses connection
>> to
>>>>> its live server. This can be problematic because this can also happen
>>>>> because of a temporary network problem. In order to address this
>> issue, the
>>>>> backup will try to determine whether it still can connect to the other
>>>>> servers in the cluster. If it can connect to more than half the
>> servers, it
>>>>> will become active, if more than half the servers also disappeared
>> with the
>>>>> live, the backup will wait and try reconnecting with the live. This
>> avoids
>>>>> a split brain situation."
>>>>>> 
>>>>>> Q4: At this point, which nodes are expected to participate in quorum
>>>>> voting? All of A, B and C? Or A and C only (B excludes itself from the
>>>>> set)? When it says "half the servers”, I read it in a way that B
>> includes
>>>>> itself in the quorum voting. Is that the case?
>>>>>> 
>>>>>> Whereas in the ‘Avoiding Network Isolation’ section, the manual says
>>>>> this:
>>>>>> 
>>>>>> “Quorum voting is used by both the live and the backup to decide what
>> to
>>>>> do if a replication connection is disconnected. Basically the server
>> will
>>>>> request each live server in the cluster to vote as to whether it
>> thinks the
>>>>> server it is replicating to or from is still alive. This being the
>> case the
>>>>> minimum number of live/backup pairs needed is 3."
>>>>>> 
>>>>>> Q5: This implies only the live servers participate in quorum voting.
>> Is
>>>>> that correct?
>>>>>> 
>>>>>> Q6: If the answer to Q5 is yes, then how does the split brain
>> detection
>>>>> (as described in the quoted text right before Q4) work?
>>>>>> 
>>>>>> Q7: The text implies that in order to avoid split brain, a cluster
>> needs
>>>>> at least 3 live/backup PAIRS. To me that implies at least 6 broker
>>>>> instances are needed in such a cluster; but that is kind of hard to
>>>>> believe, and I feel (I may be wrong) it actually means 3 broker
>> instances,
>>>>> assuming scenarios 1 and 2 as described earlier are valid ones. Can you
>>>>> please clarify?
>>>>>> 
>>>>>> Would appreciate if someone can offer clarity on these questions.
>>>>>> 
>>>>>> Thanks,
>>>>>> Anindya Haldar
>>>>>> Oracle Marketing Cloud
>>>>>> 
>>>>> 
>>>>> 
>>> 
>> 
>> 


Re: Questions on HA cluster and split brain

Posted by Justin Bertram <jb...@apache.org>.
> 1) It is possible to define multiple groups within a cluster, and a
subset of the brokers in the cluster can be members of a specific group. Is
that correct?

Yes.

> 2) The live-backup relationship is guided by group membership, when there
is explicit group membership defined. Is that correct?

Yes.

> 3) When a backup or a live server in a group starts the quorum voting
process, other live servers in the cluster, even if though they may not be
part of the same group, can participate in the quorum. Meaning the ability
to participate in quorum voting is defined by cluster membership, and not
by group membership within the cluster. Is that understanding correct?

Yes.


In short, a "group" allows the pairing of specific live and backup brokers
together in the replicated HA use-case.


Justin


On Thu, Jun 14, 2018 at 5:19 PM, Anindya Haldar <an...@oracle.com>
wrote:

> I have a few quick follow up questions. From the discussion here, and from
> what I understand reading the Artemis manual, here is my understanding
> about the idea of a cluster vs. the idea of a group within a cluster:
>
> 1) It is possible to define multiple groups within a cluster, and a subset
> of the brokers in the cluster can be members of a specific group. Is that
> correct?
>
> 2) The live-backup relationship is guided by group membership, when there
> is explicit group membership defined. Is that correct?
>
> 3) When a backup or a live server in a group starts the quorum voting
> process, other live servers in the cluster, even if though they may not be
> part of the same group, can participate in the quorum. Meaning the ability
> to participate in quorum voting is defined by cluster membership, and not
> by group membership within the cluster. Is that understanding correct?
>
> Thanks,
>
> Anindya Haldar
> Oracle Marketing Cloud
>
>
> > On Jun 14, 2018, at 9:57 AM, Anindya Haldar <an...@oracle.com>
> wrote:
> >
> > Many thanks, Justin. This makes things much clearer for us when it comes
> to designing the HA cluster.
> >
> > As for the Artemis evaluation scope, we want to use it as one of the
> supported messaging backbones in our application suite. The application
> suite requires strong transactional guarantees, high availability, and high
> performance and scale, amongst other things. We are looking towards a full
> blown technology evaluation with those needs in mind.
> >
> > Thanks,
> >
> > Anindya Haldar
> > Oracle Marketing Cloud
> >
> >
> >> On Jun 13, 2018, at 7:23 PM, Justin Bertram <jb...@apache.org>
> wrote:
> >>
> >>> Q1: At this point, will the transaction logs replicate from A to C?
> >>
> >> No.  A will be replicating to B since B is the designated backup.
> Also, by
> >> "transaction logs" I assume you mean what the Artemis documentation
> refers
> >> to as the journal (i.e. all persistent message data).
> >>
> >>> Q2: At this point will C become to new new back up for B, assuming A
> >> remains in failed state?
> >>
> >> Yes.
> >>
> >>> Q3: If the answer to Q2 is yes, B will start replicating its journals
> to
> >> C; is that correct?
> >>
> >> Yes.
> >>
> >>> Q4: At this point, which nodes are expected to participate in quorum
> >> voting? All of A, B and C? Or A and C only (B excludes itself from the
> >> set)? When it says "half the servers”, I read it in a way that B
> includes
> >> itself in the quorum voting. Is that the case?
> >>
> >> A would be the only server available to participate in the quorum voting
> >> since it is the only live server.  However, since B can't reach A then B
> >> would not receive any quorum vote responses.  B doesn't vote; it simply
> >> asks for a vote.
> >>
> >>> Q5: This implies only the live servers participate in quorum voting. Is
> >> that correct?
> >>
> >> Yes.
> >>
> >>> Q6: If the answer to Q5 is yes, then how does the split brain detection
> >> (as described in the quoted text right before Q4) work?
> >>
> >> It works by having multiple voting members (i.e. live servers) in the
> >> cluster.  The topology you've described with a single live and 2
> backups is
> >> not sufficient to mitigate against split brain.
> >>
> >>> Q7: The text implies that in order to avoid split brain, a cluster
> needs
> >> at least 3 live/backup PAIRS.
> >>
> >> That is correct - 3 live/backup pairs.
> >>
> >>> To me that implies at least 6 broker instances are needed in such a
> >> cluster; but that is kind of hard to believe, and I feel (I may be
> wrong)
> >> it actually means 3 broker instances, assuming scenarios 1 and 2 as
> >> described earlier are valid ones. Can you please clarify?
> >>
> >> What you feel is incorrect.  That said, the live & backup instances can
> be
> >> colocated which means although there are 6 total broker instances only 3
> >> machines are required.
> >>
> >> I think implementing a feature whereby backups can participate in the
> >> quorum vote would be a great addition to the broker.  Unfortunately I
> >> haven't had time to contribute such a feature.
> >>
> >>
> >> If I may ask a question of my own...Your emails to this list have
> piqued my
> >> interest and I'm curious to know to what end you are evaluating Artemis
> >> since you apparently work for Oracle on a cloud related team and Oracle
> >> already has a cloud messaging solution.  Can you elaborate at all?
> >>
> >>
> >> Justin
> >>
> >>
> >> On Wed, Jun 13, 2018 at 7:56 PM, Anindya Haldar <
> anindya.haldar@oracle.com>
> >> wrote:
> >>
> >>> BTW, these are questions related to Artemis 2.4.0, which is what we are
> >>> evaluating right now for our solution.
> >>>
> >>>
> >>>> On Jun 13, 2018, at 5:52 PM, Anindya Haldar <
> anindya.haldar@oracle.com>
> >>> wrote:
> >>>>
> >>>> I have some questions related to the HA cluster, failover and
> >>> split-brain cases.
> >>>>
> >>>> Suppose I have set up a 3 node cluster with:
> >>>>
> >>>> A = master
> >>>> B = slave 1
> >>>> C = slave 2
> >>>>
> >>>> Also suppose they are all part of same group, and are set up to offer
> >>> replication based HA.
> >>>>
> >>>> Scenario 1
> >>>> ========
> >>>> Say,
> >>>>
> >>>> B starts up and finds A
> >>>> B becomes the designated backup for A
> >>>> C starts up, and tries to find a live server in this group
> >>>> C figures that A already has a designated backup, which is B
> >>>> C keeps waiting until the network topology is changed
> >>>>
> >>>>
> >>>> Q1: At this point, will the transaction logs replicate from A to C?
> >>>>
> >>>> Now let’s say
> >>>>
> >>>> Node A (the current master) fails
> >>>> B becomes the new master
> >>>>
> >>>> Q2: At this point will C become to new new back up for B, assuming A
> >>> remains in failed state?
> >>>>
> >>>> Q3: If the answer to Q2 is yes, B will start replicating its journals
> to
> >>> C; is that correct?
> >>>>
> >>>>
> >>>> Scenario 2 (split brain detection case)
> >>>> =============================
> >>>> Say,
> >>>>
> >>>> B detects a transient network failure with A
> >>>> B wants to figure out if it needs to take over and be the new master
> >>>> B starts a quorum voting process
> >>>>
> >>>> The manual says this in the ‘High Availability and Failover’ section:
> >>>>
> >>>> "Specifically, the backup will become active when it loses connection
> to
> >>> its live server. This can be problematic because this can also happen
> >>> because of a temporary network problem. In order to address this
> issue, the
> >>> backup will try to determine whether it still can connect to the other
> >>> servers in the cluster. If it can connect to more than half the
> servers, it
> >>> will become active, if more than half the servers also disappeared
> with the
> >>> live, the backup will wait and try reconnecting with the live. This
> avoids
> >>> a split brain situation."
> >>>>
> >>>> Q4: At this point, which nodes are expected to participate in quorum
> >>> voting? All of A, B and C? Or A and C only (B excludes itself from the
> >>> set)? When it says "half the servers”, I read it in a way that B
> includes
> >>> itself in the quorum voting. Is that the case?
> >>>>
> >>>> Whereas in the ‘Avoiding Network Isolation’ section, the manual says
> >>> this:
> >>>>
> >>>> “Quorum voting is used by both the live and the backup to decide what
> to
> >>> do if a replication connection is disconnected. Basically the server
> will
> >>> request each live server in the cluster to vote as to whether it
> thinks the
> >>> server it is replicating to or from is still alive. This being the
> case the
> >>> minimum number of live/backup pairs needed is 3."
> >>>>
> >>>> Q5: This implies only the live servers participate in quorum voting.
> Is
> >>> that correct?
> >>>>
> >>>> Q6: If the answer to Q5 is yes, then how does the split brain
> detection
> >>> (as described in the quoted text right before Q4) work?
> >>>>
> >>>> Q7: The text implies that in order to avoid split brain, a cluster
> needs
> >>> at least 3 live/backup PAIRS. To me that implies at least 6 broker
> >>> instances are needed in such a cluster; but that is kind of hard to
> >>> believe, and I feel (I may be wrong) it actually means 3 broker
> instances,
> >>> assuming scenarios 1 and 2 as described earlier are valid ones. Can you
> >>> please clarify?
> >>>>
> >>>> Would appreciate if someone can offer clarity on these questions.
> >>>>
> >>>> Thanks,
> >>>> Anindya Haldar
> >>>> Oracle Marketing Cloud
> >>>>
> >>>
> >>>
> >
>
>

Re: Questions on HA cluster and split brain

Posted by Anindya Haldar <an...@oracle.com>.
I have a few quick follow up questions. From the discussion here, and from what I understand reading the Artemis manual, here is my understanding about the idea of a cluster vs. the idea of a group within a cluster:

1) It is possible to define multiple groups within a cluster, and a subset of the brokers in the cluster can be members of a specific group. Is that correct?

2) The live-backup relationship is guided by group membership, when there is explicit group membership defined. Is that correct?

3) When a backup or a live server in a group starts the quorum voting process, other live servers in the cluster, even if though they may not be part of the same group, can participate in the quorum. Meaning the ability to participate in quorum voting is defined by cluster membership, and not by group membership within the cluster. Is that understanding correct?

Thanks,

Anindya Haldar
Oracle Marketing Cloud


> On Jun 14, 2018, at 9:57 AM, Anindya Haldar <an...@oracle.com> wrote:
> 
> Many thanks, Justin. This makes things much clearer for us when it comes to designing the HA cluster.
> 
> As for the Artemis evaluation scope, we want to use it as one of the supported messaging backbones in our application suite. The application suite requires strong transactional guarantees, high availability, and high performance and scale, amongst other things. We are looking towards a full blown technology evaluation with those needs in mind.
> 
> Thanks,
> 
> Anindya Haldar
> Oracle Marketing Cloud
> 
> 
>> On Jun 13, 2018, at 7:23 PM, Justin Bertram <jb...@apache.org> wrote:
>> 
>>> Q1: At this point, will the transaction logs replicate from A to C?
>> 
>> No.  A will be replicating to B since B is the designated backup.  Also, by
>> "transaction logs" I assume you mean what the Artemis documentation refers
>> to as the journal (i.e. all persistent message data).
>> 
>>> Q2: At this point will C become to new new back up for B, assuming A
>> remains in failed state?
>> 
>> Yes.
>> 
>>> Q3: If the answer to Q2 is yes, B will start replicating its journals to
>> C; is that correct?
>> 
>> Yes.
>> 
>>> Q4: At this point, which nodes are expected to participate in quorum
>> voting? All of A, B and C? Or A and C only (B excludes itself from the
>> set)? When it says "half the servers”, I read it in a way that B includes
>> itself in the quorum voting. Is that the case?
>> 
>> A would be the only server available to participate in the quorum voting
>> since it is the only live server.  However, since B can't reach A then B
>> would not receive any quorum vote responses.  B doesn't vote; it simply
>> asks for a vote.
>> 
>>> Q5: This implies only the live servers participate in quorum voting. Is
>> that correct?
>> 
>> Yes.
>> 
>>> Q6: If the answer to Q5 is yes, then how does the split brain detection
>> (as described in the quoted text right before Q4) work?
>> 
>> It works by having multiple voting members (i.e. live servers) in the
>> cluster.  The topology you've described with a single live and 2 backups is
>> not sufficient to mitigate against split brain.
>> 
>>> Q7: The text implies that in order to avoid split brain, a cluster needs
>> at least 3 live/backup PAIRS.
>> 
>> That is correct - 3 live/backup pairs.
>> 
>>> To me that implies at least 6 broker instances are needed in such a
>> cluster; but that is kind of hard to believe, and I feel (I may be wrong)
>> it actually means 3 broker instances, assuming scenarios 1 and 2 as
>> described earlier are valid ones. Can you please clarify?
>> 
>> What you feel is incorrect.  That said, the live & backup instances can be
>> colocated which means although there are 6 total broker instances only 3
>> machines are required.
>> 
>> I think implementing a feature whereby backups can participate in the
>> quorum vote would be a great addition to the broker.  Unfortunately I
>> haven't had time to contribute such a feature.
>> 
>> 
>> If I may ask a question of my own...Your emails to this list have piqued my
>> interest and I'm curious to know to what end you are evaluating Artemis
>> since you apparently work for Oracle on a cloud related team and Oracle
>> already has a cloud messaging solution.  Can you elaborate at all?
>> 
>> 
>> Justin
>> 
>> 
>> On Wed, Jun 13, 2018 at 7:56 PM, Anindya Haldar <an...@oracle.com>
>> wrote:
>> 
>>> BTW, these are questions related to Artemis 2.4.0, which is what we are
>>> evaluating right now for our solution.
>>> 
>>> 
>>>> On Jun 13, 2018, at 5:52 PM, Anindya Haldar <an...@oracle.com>
>>> wrote:
>>>> 
>>>> I have some questions related to the HA cluster, failover and
>>> split-brain cases.
>>>> 
>>>> Suppose I have set up a 3 node cluster with:
>>>> 
>>>> A = master
>>>> B = slave 1
>>>> C = slave 2
>>>> 
>>>> Also suppose they are all part of same group, and are set up to offer
>>> replication based HA.
>>>> 
>>>> Scenario 1
>>>> ========
>>>> Say,
>>>> 
>>>> B starts up and finds A
>>>> B becomes the designated backup for A
>>>> C starts up, and tries to find a live server in this group
>>>> C figures that A already has a designated backup, which is B
>>>> C keeps waiting until the network topology is changed
>>>> 
>>>> 
>>>> Q1: At this point, will the transaction logs replicate from A to C?
>>>> 
>>>> Now let’s say
>>>> 
>>>> Node A (the current master) fails
>>>> B becomes the new master
>>>> 
>>>> Q2: At this point will C become to new new back up for B, assuming A
>>> remains in failed state?
>>>> 
>>>> Q3: If the answer to Q2 is yes, B will start replicating its journals to
>>> C; is that correct?
>>>> 
>>>> 
>>>> Scenario 2 (split brain detection case)
>>>> =============================
>>>> Say,
>>>> 
>>>> B detects a transient network failure with A
>>>> B wants to figure out if it needs to take over and be the new master
>>>> B starts a quorum voting process
>>>> 
>>>> The manual says this in the ‘High Availability and Failover’ section:
>>>> 
>>>> "Specifically, the backup will become active when it loses connection to
>>> its live server. This can be problematic because this can also happen
>>> because of a temporary network problem. In order to address this issue, the
>>> backup will try to determine whether it still can connect to the other
>>> servers in the cluster. If it can connect to more than half the servers, it
>>> will become active, if more than half the servers also disappeared with the
>>> live, the backup will wait and try reconnecting with the live. This avoids
>>> a split brain situation."
>>>> 
>>>> Q4: At this point, which nodes are expected to participate in quorum
>>> voting? All of A, B and C? Or A and C only (B excludes itself from the
>>> set)? When it says "half the servers”, I read it in a way that B includes
>>> itself in the quorum voting. Is that the case?
>>>> 
>>>> Whereas in the ‘Avoiding Network Isolation’ section, the manual says
>>> this:
>>>> 
>>>> “Quorum voting is used by both the live and the backup to decide what to
>>> do if a replication connection is disconnected. Basically the server will
>>> request each live server in the cluster to vote as to whether it thinks the
>>> server it is replicating to or from is still alive. This being the case the
>>> minimum number of live/backup pairs needed is 3."
>>>> 
>>>> Q5: This implies only the live servers participate in quorum voting. Is
>>> that correct?
>>>> 
>>>> Q6: If the answer to Q5 is yes, then how does the split brain detection
>>> (as described in the quoted text right before Q4) work?
>>>> 
>>>> Q7: The text implies that in order to avoid split brain, a cluster needs
>>> at least 3 live/backup PAIRS. To me that implies at least 6 broker
>>> instances are needed in such a cluster; but that is kind of hard to
>>> believe, and I feel (I may be wrong) it actually means 3 broker instances,
>>> assuming scenarios 1 and 2 as described earlier are valid ones. Can you
>>> please clarify?
>>>> 
>>>> Would appreciate if someone can offer clarity on these questions.
>>>> 
>>>> Thanks,
>>>> Anindya Haldar
>>>> Oracle Marketing Cloud
>>>> 
>>> 
>>> 
> 


Re: Questions on HA cluster and split brain

Posted by Anindya Haldar <an...@oracle.com>.
Many thanks, Justin. This makes things much clearer for us when it comes to designing the HA cluster.

As for the Artemis evaluation scope, we want to use it as one of the supported messaging backbones in our application suite. The application suite requires strong transactional guarantees, high availability, and high performance and scale, amongst other things. We are looking towards a full blown technology evaluation with those needs in mind.

Thanks,

Anindya Haldar
Oracle Marketing Cloud


> On Jun 13, 2018, at 7:23 PM, Justin Bertram <jb...@apache.org> wrote:
> 
>> Q1: At this point, will the transaction logs replicate from A to C?
> 
> No.  A will be replicating to B since B is the designated backup.  Also, by
> "transaction logs" I assume you mean what the Artemis documentation refers
> to as the journal (i.e. all persistent message data).
> 
>> Q2: At this point will C become to new new back up for B, assuming A
> remains in failed state?
> 
> Yes.
> 
>> Q3: If the answer to Q2 is yes, B will start replicating its journals to
> C; is that correct?
> 
> Yes.
> 
>> Q4: At this point, which nodes are expected to participate in quorum
> voting? All of A, B and C? Or A and C only (B excludes itself from the
> set)? When it says "half the servers”, I read it in a way that B includes
> itself in the quorum voting. Is that the case?
> 
> A would be the only server available to participate in the quorum voting
> since it is the only live server.  However, since B can't reach A then B
> would not receive any quorum vote responses.  B doesn't vote; it simply
> asks for a vote.
> 
>> Q5: This implies only the live servers participate in quorum voting. Is
> that correct?
> 
> Yes.
> 
>> Q6: If the answer to Q5 is yes, then how does the split brain detection
> (as described in the quoted text right before Q4) work?
> 
> It works by having multiple voting members (i.e. live servers) in the
> cluster.  The topology you've described with a single live and 2 backups is
> not sufficient to mitigate against split brain.
> 
>> Q7: The text implies that in order to avoid split brain, a cluster needs
> at least 3 live/backup PAIRS.
> 
> That is correct - 3 live/backup pairs.
> 
>> To me that implies at least 6 broker instances are needed in such a
> cluster; but that is kind of hard to believe, and I feel (I may be wrong)
> it actually means 3 broker instances, assuming scenarios 1 and 2 as
> described earlier are valid ones. Can you please clarify?
> 
> What you feel is incorrect.  That said, the live & backup instances can be
> colocated which means although there are 6 total broker instances only 3
> machines are required.
> 
> I think implementing a feature whereby backups can participate in the
> quorum vote would be a great addition to the broker.  Unfortunately I
> haven't had time to contribute such a feature.
> 
> 
> If I may ask a question of my own...Your emails to this list have piqued my
> interest and I'm curious to know to what end you are evaluating Artemis
> since you apparently work for Oracle on a cloud related team and Oracle
> already has a cloud messaging solution.  Can you elaborate at all?
> 
> 
> Justin
> 
> 
> On Wed, Jun 13, 2018 at 7:56 PM, Anindya Haldar <an...@oracle.com>
> wrote:
> 
>> BTW, these are questions related to Artemis 2.4.0, which is what we are
>> evaluating right now for our solution.
>> 
>> 
>>> On Jun 13, 2018, at 5:52 PM, Anindya Haldar <an...@oracle.com>
>> wrote:
>>> 
>>> I have some questions related to the HA cluster, failover and
>> split-brain cases.
>>> 
>>> Suppose I have set up a 3 node cluster with:
>>> 
>>> A = master
>>> B = slave 1
>>> C = slave 2
>>> 
>>> Also suppose they are all part of same group, and are set up to offer
>> replication based HA.
>>> 
>>> Scenario 1
>>> ========
>>> Say,
>>> 
>>> B starts up and finds A
>>> B becomes the designated backup for A
>>> C starts up, and tries to find a live server in this group
>>> C figures that A already has a designated backup, which is B
>>> C keeps waiting until the network topology is changed
>>> 
>>> 
>>> Q1: At this point, will the transaction logs replicate from A to C?
>>> 
>>> Now let’s say
>>> 
>>> Node A (the current master) fails
>>> B becomes the new master
>>> 
>>> Q2: At this point will C become to new new back up for B, assuming A
>> remains in failed state?
>>> 
>>> Q3: If the answer to Q2 is yes, B will start replicating its journals to
>> C; is that correct?
>>> 
>>> 
>>> Scenario 2 (split brain detection case)
>>> =============================
>>> Say,
>>> 
>>> B detects a transient network failure with A
>>> B wants to figure out if it needs to take over and be the new master
>>> B starts a quorum voting process
>>> 
>>> The manual says this in the ‘High Availability and Failover’ section:
>>> 
>>> "Specifically, the backup will become active when it loses connection to
>> its live server. This can be problematic because this can also happen
>> because of a temporary network problem. In order to address this issue, the
>> backup will try to determine whether it still can connect to the other
>> servers in the cluster. If it can connect to more than half the servers, it
>> will become active, if more than half the servers also disappeared with the
>> live, the backup will wait and try reconnecting with the live. This avoids
>> a split brain situation."
>>> 
>>> Q4: At this point, which nodes are expected to participate in quorum
>> voting? All of A, B and C? Or A and C only (B excludes itself from the
>> set)? When it says "half the servers”, I read it in a way that B includes
>> itself in the quorum voting. Is that the case?
>>> 
>>> Whereas in the ‘Avoiding Network Isolation’ section, the manual says
>> this:
>>> 
>>> “Quorum voting is used by both the live and the backup to decide what to
>> do if a replication connection is disconnected. Basically the server will
>> request each live server in the cluster to vote as to whether it thinks the
>> server it is replicating to or from is still alive. This being the case the
>> minimum number of live/backup pairs needed is 3."
>>> 
>>> Q5: This implies only the live servers participate in quorum voting. Is
>> that correct?
>>> 
>>> Q6: If the answer to Q5 is yes, then how does the split brain detection
>> (as described in the quoted text right before Q4) work?
>>> 
>>> Q7: The text implies that in order to avoid split brain, a cluster needs
>> at least 3 live/backup PAIRS. To me that implies at least 6 broker
>> instances are needed in such a cluster; but that is kind of hard to
>> believe, and I feel (I may be wrong) it actually means 3 broker instances,
>> assuming scenarios 1 and 2 as described earlier are valid ones. Can you
>> please clarify?
>>> 
>>> Would appreciate if someone can offer clarity on these questions.
>>> 
>>> Thanks,
>>> Anindya Haldar
>>> Oracle Marketing Cloud
>>> 
>> 
>> 


Re: Questions on HA cluster and split brain

Posted by Justin Bertram <jb...@apache.org>.
> Q1: At this point, will the transaction logs replicate from A to C?

No.  A will be replicating to B since B is the designated backup.  Also, by
"transaction logs" I assume you mean what the Artemis documentation refers
to as the journal (i.e. all persistent message data).

> Q2: At this point will C become to new new back up for B, assuming A
remains in failed state?

Yes.

> Q3: If the answer to Q2 is yes, B will start replicating its journals to
C; is that correct?

Yes.

> Q4: At this point, which nodes are expected to participate in quorum
voting? All of A, B and C? Or A and C only (B excludes itself from the
set)? When it says "half the servers”, I read it in a way that B includes
itself in the quorum voting. Is that the case?

A would be the only server available to participate in the quorum voting
since it is the only live server.  However, since B can't reach A then B
would not receive any quorum vote responses.  B doesn't vote; it simply
asks for a vote.

> Q5: This implies only the live servers participate in quorum voting. Is
that correct?

Yes.

> Q6: If the answer to Q5 is yes, then how does the split brain detection
(as described in the quoted text right before Q4) work?

It works by having multiple voting members (i.e. live servers) in the
cluster.  The topology you've described with a single live and 2 backups is
not sufficient to mitigate against split brain.

> Q7: The text implies that in order to avoid split brain, a cluster needs
at least 3 live/backup PAIRS.

That is correct - 3 live/backup pairs.

> To me that implies at least 6 broker instances are needed in such a
cluster; but that is kind of hard to believe, and I feel (I may be wrong)
it actually means 3 broker instances, assuming scenarios 1 and 2 as
described earlier are valid ones. Can you please clarify?

What you feel is incorrect.  That said, the live & backup instances can be
colocated which means although there are 6 total broker instances only 3
machines are required.

I think implementing a feature whereby backups can participate in the
quorum vote would be a great addition to the broker.  Unfortunately I
haven't had time to contribute such a feature.


If I may ask a question of my own...Your emails to this list have piqued my
interest and I'm curious to know to what end you are evaluating Artemis
since you apparently work for Oracle on a cloud related team and Oracle
already has a cloud messaging solution.  Can you elaborate at all?


Justin


On Wed, Jun 13, 2018 at 7:56 PM, Anindya Haldar <an...@oracle.com>
wrote:

> BTW, these are questions related to Artemis 2.4.0, which is what we are
> evaluating right now for our solution.
>
>
> > On Jun 13, 2018, at 5:52 PM, Anindya Haldar <an...@oracle.com>
> wrote:
> >
> > I have some questions related to the HA cluster, failover and
> split-brain cases.
> >
> > Suppose I have set up a 3 node cluster with:
> >
> > A = master
> > B = slave 1
> > C = slave 2
> >
> > Also suppose they are all part of same group, and are set up to offer
> replication based HA.
> >
> > Scenario 1
> > ========
> > Say,
> >
> > B starts up and finds A
> > B becomes the designated backup for A
> > C starts up, and tries to find a live server in this group
> > C figures that A already has a designated backup, which is B
> > C keeps waiting until the network topology is changed
> >
> >
> > Q1: At this point, will the transaction logs replicate from A to C?
> >
> > Now let’s say
> >
> > Node A (the current master) fails
> > B becomes the new master
> >
> > Q2: At this point will C become to new new back up for B, assuming A
> remains in failed state?
> >
> > Q3: If the answer to Q2 is yes, B will start replicating its journals to
> C; is that correct?
> >
> >
> > Scenario 2 (split brain detection case)
> > =============================
> > Say,
> >
> > B detects a transient network failure with A
> > B wants to figure out if it needs to take over and be the new master
> > B starts a quorum voting process
> >
> > The manual says this in the ‘High Availability and Failover’ section:
> >
> > "Specifically, the backup will become active when it loses connection to
> its live server. This can be problematic because this can also happen
> because of a temporary network problem. In order to address this issue, the
> backup will try to determine whether it still can connect to the other
> servers in the cluster. If it can connect to more than half the servers, it
> will become active, if more than half the servers also disappeared with the
> live, the backup will wait and try reconnecting with the live. This avoids
> a split brain situation."
> >
> > Q4: At this point, which nodes are expected to participate in quorum
> voting? All of A, B and C? Or A and C only (B excludes itself from the
> set)? When it says "half the servers”, I read it in a way that B includes
> itself in the quorum voting. Is that the case?
> >
> > Whereas in the ‘Avoiding Network Isolation’ section, the manual says
> this:
> >
> > “Quorum voting is used by both the live and the backup to decide what to
> do if a replication connection is disconnected. Basically the server will
> request each live server in the cluster to vote as to whether it thinks the
> server it is replicating to or from is still alive. This being the case the
> minimum number of live/backup pairs needed is 3."
> >
> > Q5: This implies only the live servers participate in quorum voting. Is
> that correct?
> >
> > Q6: If the answer to Q5 is yes, then how does the split brain detection
> (as described in the quoted text right before Q4) work?
> >
> > Q7: The text implies that in order to avoid split brain, a cluster needs
> at least 3 live/backup PAIRS. To me that implies at least 6 broker
> instances are needed in such a cluster; but that is kind of hard to
> believe, and I feel (I may be wrong) it actually means 3 broker instances,
> assuming scenarios 1 and 2 as described earlier are valid ones. Can you
> please clarify?
> >
> > Would appreciate if someone can offer clarity on these questions.
> >
> > Thanks,
> > Anindya Haldar
> > Oracle Marketing Cloud
> >
>
>

Re: Questions on HA cluster and split brain

Posted by Anindya Haldar <an...@oracle.com>.
BTW, these are questions related to Artemis 2.4.0, which is what we are evaluating right now for our solution.


> On Jun 13, 2018, at 5:52 PM, Anindya Haldar <an...@oracle.com> wrote:
> 
> I have some questions related to the HA cluster, failover and split-brain cases.
> 
> Suppose I have set up a 3 node cluster with:
> 
> A = master
> B = slave 1
> C = slave 2
> 
> Also suppose they are all part of same group, and are set up to offer replication based HA.
> 
> Scenario 1
> ========
> Say,
> 
> B starts up and finds A
> B becomes the designated backup for A
> C starts up, and tries to find a live server in this group
> C figures that A already has a designated backup, which is B
> C keeps waiting until the network topology is changed
> 
> 
> Q1: At this point, will the transaction logs replicate from A to C?
> 
> Now let’s say
> 
> Node A (the current master) fails
> B becomes the new master
> 
> Q2: At this point will C become to new new back up for B, assuming A remains in failed state?
> 
> Q3: If the answer to Q2 is yes, B will start replicating its journals to C; is that correct?
> 
> 
> Scenario 2 (split brain detection case)
> =============================
> Say,
> 
> B detects a transient network failure with A
> B wants to figure out if it needs to take over and be the new master
> B starts a quorum voting process
> 
> The manual says this in the ‘High Availability and Failover’ section: 
> 
> "Specifically, the backup will become active when it loses connection to its live server. This can be problematic because this can also happen because of a temporary network problem. In order to address this issue, the backup will try to determine whether it still can connect to the other servers in the cluster. If it can connect to more than half the servers, it will become active, if more than half the servers also disappeared with the live, the backup will wait and try reconnecting with the live. This avoids a split brain situation."
> 
> Q4: At this point, which nodes are expected to participate in quorum voting? All of A, B and C? Or A and C only (B excludes itself from the set)? When it says "half the servers”, I read it in a way that B includes itself in the quorum voting. Is that the case?
> 
> Whereas in the ‘Avoiding Network Isolation’ section, the manual says this:
> 
> “Quorum voting is used by both the live and the backup to decide what to do if a replication connection is disconnected. Basically the server will request each live server in the cluster to vote as to whether it thinks the server it is replicating to or from is still alive. This being the case the minimum number of live/backup pairs needed is 3."
> 
> Q5: This implies only the live servers participate in quorum voting. Is that correct?
> 
> Q6: If the answer to Q5 is yes, then how does the split brain detection (as described in the quoted text right before Q4) work?
> 
> Q7: The text implies that in order to avoid split brain, a cluster needs at least 3 live/backup PAIRS. To me that implies at least 6 broker instances are needed in such a cluster; but that is kind of hard to believe, and I feel (I may be wrong) it actually means 3 broker instances, assuming scenarios 1 and 2 as described earlier are valid ones. Can you please clarify?
> 
> Would appreciate if someone can offer clarity on these questions.
> 
> Thanks,
> Anindya Haldar
> Oracle Marketing Cloud
>