You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@qpid.apache.org by jbelch <ja...@verizon.net> on 2013/08/01 23:51:43 UTC

Qpid Java Broker - Master Failure and Recovery

In your High Availability documentation for the Java Broker, section "1.6.3.2
- Depictions of cluster operation" has a section called "Master Failure and
Recovery" which describes the sequence of events for a master failover and
the replica taking over the master's role.  One of the items states the
following:

3.  A third-party (an operator, a script or a combination of the two)
verifies that the master has truely failed and is no longer running. If it
has truely failed, the decision is made to designate the replica as primary,
allowing it to assume the role of master despite the other node being down.
This primary designation is performed using JMX.

I am able to get the BDBHAMessageStore bean and set the "DesignatedPrimary"
attribute to true for the replica server in the case the master fails.  What
else do I need to do for the replica to take over as master.  Do I need to
set anything else?  Do I need to cycle the replica or will it take over the
master role simply by setting attributes to tell it that it is assuming the
master role?

 




--
View this message in context: http://qpid.2158936.n2.nabble.com/Qpid-Java-Broker-Master-Failure-and-Recovery-tp7596360.html
Sent from the Apache Qpid users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org


Re: Qpid Java Broker - Master Failure and Recovery

Posted by Robbie Gemmell <ro...@gmail.com>.
The helper node details in the XML are only used when initialising the
store for the first time, i.e when creating the cluster for the first node
(by specifying itself as the helper), or joining the pre-existing cluster
(by specifying an existing node as the helpder) for the rest. Once you
initialise BDB HA, the cluster information is stored *inside* the log files
it creates, and if these files still exist when the node is restarted then
that embedded cluster information is used to identify where the members of
the cluster are and communicate with them without the use of a specific
helper, thus ignoring the helper node information passed via the XML.

If you should happen to delete the store files for the node that created
the cluster (i.e the very original master, which which has XML
configuration pointing at itself, call it NodeA) and then want to restart
that node and have it rejoin the cluster (which still exists because the
other node, call it NodeB, has retained its store files) then you would
need to update the configuration of NodeA to supply the details of an
existing node as the helper, i.e NodeB. On the other hand, if NodeA
retained its files but you deleted the NodeB files, when you restarted it
you wouldn't need to edit the configuration since it already points at an
existing node of the cluster, Node A.

For more information, see the first of the pages on the Oracle site that I
linked to earlier:
http://docs.oracle.com/cd/E17277_02/html/ReplicationGuide/lifecycle.html#lifecycle-new

Robbie

On 6 August 2013 21:11, jbelch <ja...@verizon.net> wrote:

> But how does the old "MASTER" that failed know what port to connect to as
> the
> HelperHostPort?  Is the new HelperHostPort the new "MASTER":5001 even
> though
> the jconsole still displays it as the old "MASTER".
>
>
>
> --
> View this message in context:
> http://qpid.2158936.n2.nabble.com/Qpid-Java-Broker-Master-Failure-and-Recovery-tp7596360p7596577.html
> Sent from the Apache Qpid users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
> For additional commands, e-mail: users-help@qpid.apache.org
>
>

Re: Qpid Java Broker - Master Failure and Recovery

Posted by jbelch <ja...@verizon.net>.
But how does the old "MASTER" that failed know what port to connect to as the
HelperHostPort?  Is the new HelperHostPort the new "MASTER":5001 even though
the jconsole still displays it as the old "MASTER".



--
View this message in context: http://qpid.2158936.n2.nabble.com/Qpid-Java-Broker-Master-Failure-and-Recovery-tp7596360p7596577.html
Sent from the Apache Qpid users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org


Re: Qpid Java Broker - Master Failure and Recovery

Posted by Robbie Gemmell <ro...@gmail.com>.
The host:port information in the XML configuration is only used when first
joining or creating the cluster. After that, BDB stores the cluster
information itself in the store files and references that.

You should only need to update the configuration if you blow away the store
files for the node that originally created the cluster (by virtue of having
itself as helper in the XML) and then wanted to bring it back up and have
it rejoin the existing cluster by restoring from the other node (since you
would now instead need to set that node as the helper).

Robbie

On 6 August 2013 20:43, jbelch <ja...@verizon.net> wrote:

> Thanks for the detailed response.  In the case where my master node fails,
> I
> can set the DesignatedPrimary flag to true in the Replica node.  I see the
> NodeState transition to "MASTER".   In the original configuration, the
> replica node was pointing to the master host's 5001 port for the
> HelperHostPort.  If that machine is no longer there, do we need to update
> the new "MASTER"?  The new "MASTER" still has it's helper port set to the
> original master node, which may not be available, and it doesn't look like
> it updates to a local port.
>
>
>
> --
> View this message in context:
> http://qpid.2158936.n2.nabble.com/Qpid-Java-Broker-Master-Failure-and-Recovery-tp7596360p7596575.html
> Sent from the Apache Qpid users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
> For additional commands, e-mail: users-help@qpid.apache.org
>
>

Re: Qpid Java Broker - Master Failure and Recovery

Posted by jbelch <ja...@verizon.net>.
Thanks for the detailed response.  In the case where my master node fails, I
can set the DesignatedPrimary flag to true in the Replica node.  I see the
NodeState transition to "MASTER".   In the original configuration, the
replica node was pointing to the master host's 5001 port for the
HelperHostPort.  If that machine is no longer there, do we need to update
the new "MASTER"?  The new "MASTER" still has it's helper port set to the
original master node, which may not be available, and it doesn't look like
it updates to a local port.  



--
View this message in context: http://qpid.2158936.n2.nabble.com/Qpid-Java-Broker-Master-Failure-and-Recovery-tp7596360p7596575.html
Sent from the Apache Qpid users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org


Re: Qpid Java Broker - Master Failure and Recovery

Posted by Robbie Gemmell <ro...@gmail.com>.
Hi James,

I thought you might say that, that's where I was going with that
half-finished sentence originally. I think you have been a bit confused
about the purpose and effect of setting the designatedPrimary attribute.

The designatedPrimary flag only contributes to the behaviour of a node when
it is the only operational node in a 2-node cluster, it doesn't influence
their behaviour when both nodes are operational and can't be used to pick
the master in that case, though it does influence whether a node can elect
itself master when operating by itself. Being designatedPrimary allows a
node to override the need for both nodes to be present to conduct an
election, and also the need for the other node to reply with an
acknowledgement when performing commits etc (if the defined replication
policy requires this normally). This could allow a master to continue
operating if its replica fails, or alternatively allow a former replica to
elect itself master without interaction from the node which has failed.
Without use of the designatedPrimary attribute, these behaviours are
disallowed to ensure both nodes can't assume themselves to be master at the
same time and actually be able to persist changes to the database, i.e a
split-brain scenario.

You can find more information about how all of this works in our
documentation and/or the BDB JE documentation:
http://qpid.apache.org/releases/qpid-0.22/java-broker/book/Java-Broker-High-Availability.html
http://docs.oracle.com/cd/E17277_02/html/ReplicationGuide/lifecycle.html#twonode
http://docs.oracle.com/cd/E17277_02/html/ReplicationGuide/two-node.html

As you noted in a subsequent email there is functionality in BDB's
DbGroupAdmin utilty to transfer the master role, this didn't exist at the
time the prior work on this area was done and so we haven't yet tested it
or exposed that functionality through the JMX MBean for the store, but I
imagine it would be something that gets looked at as part of the work
toward supporting N-node clusters.

Robbie

On 6 August 2013 18:08, jbelch <ja...@verizon.net> wrote:

> Both nodes in the cluster are running.  First node comes up as Master,
> second
> node comes up as Replica.  I was trying to swap the roles with both nodes
> running.  I thought I would be able to tell the master to become the
> replica
> and replica to become the master.  In our production environment, we will
> need to detect the master failure and tell the replica to become the
> master.
>
>
>
> --
> View this message in context:
> http://qpid.2158936.n2.nabble.com/Qpid-Java-Broker-Master-Failure-and-Recovery-tp7596360p7596564.html
> Sent from the Apache Qpid users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
> For additional commands, e-mail: users-help@qpid.apache.org
>
>

Re: Qpid Java Broker - Master Failure and Recovery

Posted by jbelch <ja...@verizon.net>.
Both nodes in the cluster are running.  First node comes up as Master, second
node comes up as Replica.  I was trying to swap the roles with both nodes
running.  I thought I would be able to tell the master to become the replica
and replica to become the master.  In our production environment, we will
need to detect the master failure and tell the replica to become the master.



--
View this message in context: http://qpid.2158936.n2.nabble.com/Qpid-Java-Broker-Master-Failure-and-Recovery-tp7596360p7596564.html
Sent from the Apache Qpid users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org


Re: Qpid Java Broker - Master Failure and Recovery

Posted by jbelch <ja...@verizon.net>.
There is a DbGroupAdmin class that comes with BerkeleyDB that allows for a
master transfer.  Should we use this instead of trying to set attributes on
the BDBHAMessageStore bean?



--
View this message in context: http://qpid.2158936.n2.nabble.com/Qpid-Java-Broker-Master-Failure-and-Recovery-tp7596360p7596570.html
Sent from the Apache Qpid users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org


Re: Qpid Java Broker - Master Failure and Recovery

Posted by jbelch <ja...@verizon.net>.
We are using Java Broker .20.  Would there be any reason it would not work in
.20 but may work in .22?



--
View this message in context: http://qpid.2158936.n2.nabble.com/Qpid-Java-Broker-Master-Failure-and-Recovery-tp7596360p7596572.html
Sent from the Apache Qpid users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org


Re: Qpid Java Broker - Master Failure and Recovery

Posted by Robbie Gemmell <ro...@gmail.com>.
You can ignore that half finished sentence I meant to remove once I decided
to ask for mroe info :)

On 6 August 2013 17:57, Robbie Gemmell <ro...@gmail.com> wrote:

> Hi James,
>
> Can you elaborate on the full state of the cluster (e.g both nodes
> running, second node only running because first node went down while it was
> the master, what the desginatedPrimry settings were etc) and also how it
> got into that state (e.g killed the first node while it was master, pulled
> the netowrk cable, etc etc). The behaviour can be fairly subtle at times
> and the below doesn't really provide enough information to fully reason
> about.
>
> Robbie
>
> When you said below that you tried 'un designating' the
>
> On 6 August 2013 17:43, jbelch <ja...@verizon.net> wrote:
>
>> Keith,
>>
>>   I can successfully set the DesignatedPrimary attribute to true on the
>> passive node and nothing changes.  It still thinks the original master is
>> the master.  The NodeState attribute never changes when I toggle the
>> DesignatedPrimary back and forth.  I also attempted to set the
>> DesignatedPrimary attribute to false on the active node and I don't see a
>> change, either.  What am I missing?  The NodeState does not change.  I
>> attempt to "removeNodeFromGroup" from my Jconsole and it even disallows
>> removing the original master because it still thinks it's the master node.
>> Any thoughts?
>>
>> James
>>
>>
>>
>> --
>> View this message in context:
>> http://qpid.2158936.n2.nabble.com/Qpid-Java-Broker-Master-Failure-and-Recovery-tp7596360p7596548.html
>> Sent from the Apache Qpid users mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
>> For additional commands, e-mail: users-help@qpid.apache.org
>>
>>
>

Re: Qpid Java Broker - Master Failure and Recovery

Posted by Robbie Gemmell <ro...@gmail.com>.
Hi James,

Can you elaborate on the full state of the cluster (e.g both nodes running,
second node only running because first node went down while it was the
master, what the desginatedPrimry settings were etc) and also how it got
into that state (e.g killed the first node while it was master, pulled the
netowrk cable, etc etc). The behaviour can be fairly subtle at times and
the below doesn't really provide enough information to fully reason about.

Robbie

When you said below that you tried 'un designating' the

On 6 August 2013 17:43, jbelch <ja...@verizon.net> wrote:

> Keith,
>
>   I can successfully set the DesignatedPrimary attribute to true on the
> passive node and nothing changes.  It still thinks the original master is
> the master.  The NodeState attribute never changes when I toggle the
> DesignatedPrimary back and forth.  I also attempted to set the
> DesignatedPrimary attribute to false on the active node and I don't see a
> change, either.  What am I missing?  The NodeState does not change.  I
> attempt to "removeNodeFromGroup" from my Jconsole and it even disallows
> removing the original master because it still thinks it's the master node.
> Any thoughts?
>
> James
>
>
>
> --
> View this message in context:
> http://qpid.2158936.n2.nabble.com/Qpid-Java-Broker-Master-Failure-and-Recovery-tp7596360p7596548.html
> Sent from the Apache Qpid users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
> For additional commands, e-mail: users-help@qpid.apache.org
>
>

Re: Qpid Java Broker - Master Failure and Recovery

Posted by jbelch <ja...@verizon.net>.
Keith,

  I can successfully set the DesignatedPrimary attribute to true on the
passive node and nothing changes.  It still thinks the original master is
the master.  The NodeState attribute never changes when I toggle the
DesignatedPrimary back and forth.  I also attempted to set the
DesignatedPrimary attribute to false on the active node and I don't see a
change, either.  What am I missing?  The NodeState does not change.  I
attempt to "removeNodeFromGroup" from my Jconsole and it even disallows
removing the original master because it still thinks it's the master node. 
Any thoughts?

James



--
View this message in context: http://qpid.2158936.n2.nabble.com/Qpid-Java-Broker-Master-Failure-and-Recovery-tp7596360p7596548.html
Sent from the Apache Qpid users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
For additional commands, e-mail: users-help@qpid.apache.org


Re: Qpid Java Broker - Master Failure and Recovery

Posted by Keith W <ke...@gmail.com>.
Hi James

> I am able to get the BDBHAMessageStore bean and set the
"DesignatedPrimary"
attribute to true for the replica server in the case the master fails.  What
else do I need to do for the replica to take over as master.  Do I need to
set anything else?  Do I need to cycle the replica or will it take over the
master role simply by setting attributes to tell it that it is assuming the
master role?

Setting designatedPrimary to true using the mbean is all that is required.
You will see the node transition to the MASTER role and at that point
transactions will begin to commit successfully at that node. Keep in mind
that the change to the designatedPrimary flag is *not* persisted to the
config, so if the node were to be bounced, the primary desigination would
no longer apply and it would not re-take the master role.

Once you are ready to restore the master to operation, you currently need
to cycle both servers to get the system back to the business as usual
state.  There is a piece of work underway to improve the HA integration, so
this should became more elegant in a future release.

Hope this helps, Keith.







On 1 August 2013 22:51, jbelch <ja...@verizon.net> wrote:

> In your High Availability documentation for the Java Broker, section
> "1.6.3.2
> - Depictions of cluster operation" has a section called "Master Failure and
> Recovery" which describes the sequence of events for a master failover and
> the replica taking over the master's role.  One of the items states the
> following:
>
> 3.  A third-party (an operator, a script or a combination of the two)
> verifies that the master has truely failed and is no longer running. If it
> has truely failed, the decision is made to designate the replica as
> primary,
> allowing it to assume the role of master despite the other node being down.
> This primary designation is performed using JMX.
>
> I am able to get the BDBHAMessageStore bean and set the "DesignatedPrimary"
> attribute to true for the replica server in the case the master fails.
>  What
> else do I need to do for the replica to take over as master.  Do I need to
> set anything else?  Do I need to cycle the replica or will it take over the
> master role simply by setting attributes to tell it that it is assuming the
> master role?
>
>
>
>
>
>
> --
> View this message in context:
> http://qpid.2158936.n2.nabble.com/Qpid-Java-Broker-Master-Failure-and-Recovery-tp7596360.html
> Sent from the Apache Qpid users mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: users-unsubscribe@qpid.apache.org
> For additional commands, e-mail: users-help@qpid.apache.org
>
>