You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@qpid.apache.org by Sandy Pratt <pr...@adobe.com> on 2009/03/05 02:23:56 UTC

A few clustering/failover questions

I've been experimented with failover on a cluster of two brokers, and I often see this log item when a broker fails:

2009-mar-04 17:11:19 debug Exception constructed: Attempted size underflow on dequeue(21): size: max=104857600, current=0; count: unli
mited; type=flow_to_disk (qpid/broker/QueuePolicy.cpp:54)

What does underflow mean here?

The broker seems to have died:

[prattrs@hsvrhm5 qpidd]$ sudo /sbin/service qpidd status
qpidd dead but pid file exists

The test I was running failed over to the other broker and completed after a timeout expired.

A subsequent test immediately failed over to the other broker and completed (which makes sense because qpid on the first broker was probably dead before it started).

In a general sense, what are the steps required to recover from a broker failure?  What I am looking for is step #3 below:

Assume a cluster of two brokers, A and B
1) A dies
2) clients fail over to B
3) do something to recover A without interrupting clients of B
4) A and B are again interchangeable

I've looked through the docs and haven't seen anything about this.  Apologies if I missed it.  I also tried simply restarting A, which doesn't seem to work.

Thanks,

Sandy

Re: A few clustering/failover questions

Posted by Carl Trieloff <cc...@redhat.com>.
Sandy Pratt wrote:
> [snip]
>
>   
>> It's that simple. What will happen is that the restarted node be will 
>> be re-synced to the active state of node B, and you can continue on. 
>> the clients will also be notified that
>> the cluster-membership changes (Java & C++), so even if node B is 
>> brought back on a different IP address the client will know where to 
>> fail-over to.
>>
>>     
>
> Correction,  no rename of the jrnl dir is required, it will do that for 
> you automatically.... just restart node A with the same cluster-name as 
> node B.
>
> It will do the rest for you...
> Carl.
>
>
> ..
>
> Thanks for the clarification, Carl.  I wonder if my broker is in a bad state.  I'll re-initialize the installation and see what happens.  As to that, do you recommend clearing out /var/lib/qpidd to re-init?
>
> Thanks,
>
> Sandy
>   

yes, best is to clear the data-directories.

How it works is as follows, only one store is needed to recover, so if 
all the nodes in the cluster are killed, the best practice is to find 
the one with the latest time stamp, (last node to go down) and then to 
rename, move or delete the rest. then restart the node with the store to 
recover from first.

There are thoughts to do this automatically, but this could so the wrong 
thing is the machines don't have there clocks synced, so today it is 
left to the user.

To clean start, just delete all the data under /var/lib/qpidd

Carl.


RE: A few clustering/failover questions

Posted by Sandy Pratt <pr...@adobe.com>.
[snip]

> It's that simple. What will happen is that the restarted node be will 
> be re-synced to the active state of node B, and you can continue on. 
> the clients will also be notified that
> the cluster-membership changes (Java & C++), so even if node B is 
> brought back on a different IP address the client will know where to 
> fail-over to.
>

Correction,  no rename of the jrnl dir is required, it will do that for 
you automatically.... just restart node A with the same cluster-name as 
node B.

It will do the rest for you...
Carl.


..

Thanks for the clarification, Carl.  I wonder if my broker is in a bad state.  I'll re-initialize the installation and see what happens.  As to that, do you recommend clearing out /var/lib/qpidd to re-init?

Thanks,

Sandy

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


Re: A few clustering/failover questions

Posted by Alan Conway <ac...@redhat.com>.
Sandy Pratt wrote:
> I was attempting to rejoin the broker to the cluster, and found some errors in the logs.  This snippet below seems to be where it starts:
> 
> 2009-mar-09 10:53:47 trace 10.59.174.186:15159(DUMPEE) RECV 10.59.174.186:15159-0x8144690(local,catchup): Frame[BEbe; channel=1; conte
> nt (21 bytes) 55khagadlc-8yrlp...]
> 2009-mar-09 10:53:47 debug Exception constructed: Unexpected command start frame. (qpid/SessionState.cpp:57)
> 2009-mar-09 10:53:47 error Connection exception: framing-error: Unexpected command start frame. (qpid/SessionState.cpp:57)
> 2009-mar-09 10:53:47 error Connection 10.59.174.211:49354 closed by error: Unexpected command start frame. (qpid/SessionState.cpp:57)(
> 501)
> 2009-mar-09 10:53:47 trace 10.59.174.186:15159(DUMPEE) RECV 10.59.174.186:15159-0x8144690(local,catchup): Frame[BEbe; channel=1; {Clus
> terConnectionQueuePositionBody: queue=test.queue; position=19; }]
> 2009-mar-09 10:53:47 trace 10.59.174.186:15159(DUMPEE) RECV 10.59.174.186:15159-0x8144690(local,catchup): Frame[Bbe; channel=1; {Messa
> geTransferBody: destination=\x00qpid-dump\x00; accept-mode=1; acquire-mode=0; }]
> 2009-mar-09 10:53:47 debug Exception constructed: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:67)
> 2009-mar-09 10:53:47 error Channel exception: not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:67)
> 2009-mar-09 10:53:47 trace 10.59.174.186:15159(DUMPEE) RECV 10.59.174.186:15159-0x8144690(local,catchup): Frame[be; channel=1; header
> (99 bytes); properties={{MessageProperties: content-length=0; message-id=86fe821e-9123-3ef0-bb79-3ded7e79c767; content-type=text/plain
> ; user-id=guest; }{DeliveryProperties: redelivered=1; priority=4; delivery-mode=2; timestamp=1236621013709; expiration=0; exchange=tes
> t.direct; routing-key=test.queue; }}]
> 2009-mar-09 10:53:47 debug Exception constructed: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:67)
> 
> 
> The unexpected command start frame seems to be where it started.  Before that I see a bunch of RECV lines and afterwards a bunch of "Channel 1 is not attached" errors.
> 
> Any idea what this means?

This was a bug, it has been fixed on the trunk. It was a problem with 
replication to a new broker joining the cluster. Can you try a build from trunk? 
If you still see problems like this I'd like to know.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


RE: A few clustering/failover questions

Posted by Sandy Pratt <pr...@adobe.com>.
I was attempting to rejoin the broker to the cluster, and found some errors in the logs.  This snippet below seems to be where it starts:

2009-mar-09 10:53:47 trace 10.59.174.186:15159(DUMPEE) RECV 10.59.174.186:15159-0x8144690(local,catchup): Frame[BEbe; channel=1; conte
nt (21 bytes) 55khagadlc-8yrlp...]
2009-mar-09 10:53:47 debug Exception constructed: Unexpected command start frame. (qpid/SessionState.cpp:57)
2009-mar-09 10:53:47 error Connection exception: framing-error: Unexpected command start frame. (qpid/SessionState.cpp:57)
2009-mar-09 10:53:47 error Connection 10.59.174.211:49354 closed by error: Unexpected command start frame. (qpid/SessionState.cpp:57)(
501)
2009-mar-09 10:53:47 trace 10.59.174.186:15159(DUMPEE) RECV 10.59.174.186:15159-0x8144690(local,catchup): Frame[BEbe; channel=1; {Clus
terConnectionQueuePositionBody: queue=test.queue; position=19; }]
2009-mar-09 10:53:47 trace 10.59.174.186:15159(DUMPEE) RECV 10.59.174.186:15159-0x8144690(local,catchup): Frame[Bbe; channel=1; {Messa
geTransferBody: destination=\x00qpid-dump\x00; accept-mode=1; acquire-mode=0; }]
2009-mar-09 10:53:47 debug Exception constructed: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:67)
2009-mar-09 10:53:47 error Channel exception: not-attached: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:67)
2009-mar-09 10:53:47 trace 10.59.174.186:15159(DUMPEE) RECV 10.59.174.186:15159-0x8144690(local,catchup): Frame[be; channel=1; header
(99 bytes); properties={{MessageProperties: content-length=0; message-id=86fe821e-9123-3ef0-bb79-3ded7e79c767; content-type=text/plain
; user-id=guest; }{DeliveryProperties: redelivered=1; priority=4; delivery-mode=2; timestamp=1236621013709; expiration=0; exchange=tes
t.direct; routing-key=test.queue; }}]
2009-mar-09 10:53:47 debug Exception constructed: Channel 1 is not attached (qpid/amqp_0_10/SessionHandler.cpp:67)


The unexpected command start frame seems to be where it started.  Before that I see a bunch of RECV lines and afterwards a bunch of "Channel 1 is not attached" errors.

Any idea what this means?



Thanks,

Sandy




-----Original Message-----
From: Carl Trieloff [mailto:cctrieloff@redhat.com] 
Sent: Wednesday, March 04, 2009 7:16 PM
To: users@qpid.apache.org
Cc: Sandy Pratt
Subject: Re: A few clustering/failover questions

Carl Trieloff wrote:
>
>>
>> Assume a cluster of two brokers, A and B
>> 1) A dies
>> 2) clients fail over to B
>> 3) do something to recover A without interrupting clients of B
>> 4) A and B are again interchangeable
>>
>> I've looked through the docs and haven't seen anything about this.  
>> Apologies if I missed it.  I also tried simply restarting A, which 
>> doesn't seem to work.
>>   
>
> Sandy,
>
> qpidd can support the hot joining of member back to the cluster. so 
> step 3 is simply.
>
> - rename or move the jrnl dir if you are running a durable store (if 
> no store skip this step)
> - restart node A with the same cluster-name as node B.
>
> It's that simple. What will happen is that the restarted node be will 
> be re-synced to the active state of node B, and you can continue on. 
> the clients will also be notified that
> the cluster-membership changes (Java & C++), so even if node B is 
> brought back on a different IP address the client will know where to 
> fail-over to.
>

Correction,  no rename of the jrnl dir is required, it will do that for 
you automatically.... just restart node A with the same cluster-name as 
node B.

It will do the rest for you...
Carl.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


Re: A few clustering/failover questions

Posted by Carl Trieloff <cc...@redhat.com>.
Carl Trieloff wrote:
>
>>
>> Assume a cluster of two brokers, A and B
>> 1) A dies
>> 2) clients fail over to B
>> 3) do something to recover A without interrupting clients of B
>> 4) A and B are again interchangeable
>>
>> I've looked through the docs and haven't seen anything about this.  
>> Apologies if I missed it.  I also tried simply restarting A, which 
>> doesn't seem to work.
>>   
>
> Sandy,
>
> qpidd can support the hot joining of member back to the cluster. so 
> step 3 is simply.
>
> - rename or move the jrnl dir if you are running a durable store (if 
> no store skip this step)
> - restart node A with the same cluster-name as node B.
>
> It's that simple. What will happen is that the restarted node be will 
> be re-synced to the active state of node B, and you can continue on. 
> the clients will also be notified that
> the cluster-membership changes (Java & C++), so even if node B is 
> brought back on a different IP address the client will know where to 
> fail-over to.
>

Correction,  no rename of the jrnl dir is required, it will do that for 
you automatically.... just restart node A with the same cluster-name as 
node B.

It will do the rest for you...
Carl.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


Re: A few clustering/failover questions

Posted by Carl Trieloff <cc...@redhat.com>.
>
> Assume a cluster of two brokers, A and B
> 1) A dies
> 2) clients fail over to B
> 3) do something to recover A without interrupting clients of B
> 4) A and B are again interchangeable
>
> I've looked through the docs and haven't seen anything about this.  Apologies if I missed it.  I also tried simply restarting A, which doesn't seem to work.
>   

Sandy,

qpidd can support the hot joining of member back to the cluster. so step 
3 is simply.

- rename or move the jrnl dir if you are running a durable store (if no 
store skip this step)
- restart node A with the same cluster-name as node B.

It's that simple. What will happen is that the restarted node be will be 
re-synced to the active state of node B, and you can continue on. the 
clients will also be notified that
the cluster-membership changes (Java & C++), so even if node B is 
brought back on a different IP address the client will know where to 
fail-over to.

regards
Carl.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


Re: A few clustering/failover questions

Posted by Carl Trieloff <cc...@redhat.com>.
Sandy Pratt wrote:
> Gordon:
>
> I'm on RHEL5 and yum gives me the following version for the qpidd package:
>
> qpidd.i386                                                 0.4.732838-1.el5
>
> I presume that's M4?

This looks like the MRG build which is M4 + selected trunk patches.

Carl.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


RE: A few clustering/failover questions

Posted by Sandy Pratt <pr...@adobe.com>.
Gordon:

I'm on RHEL5 and yum gives me the following version for the qpidd package:

qpidd.i386                                                 0.4.732838-1.el5

I presume that's M4?

Thanks,

Sandy

-----Original Message-----
From: Gordon Sim [mailto:gsim@redhat.com] 
Sent: Thursday, March 05, 2009 1:54 AM
To: users@qpid.apache.org
Cc: Sandy Pratt
Subject: Re: A few clustering/failover questions

Sandy Pratt wrote:
> I've been experimented with failover on a cluster of two brokers, and
> I often see this log item when a broker fails:
> 
> 2009-mar-04 17:11:19 debug Exception constructed: Attempted size
> underflow on dequeue(21): size: max=104857600, current=0; count: unli
>  mited; type=flow_to_disk (qpid/broker/QueuePolicy.cpp:54)
> 
> What does underflow mean here?

The policy maintains a running count of enqueued messages and the 
aggregate size. Underflow here means that more data was dequeued than 
was enqueued which indicates some logical error.

What version of the code are you using? M4 or latest from trunk?

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org


Re: A few clustering/failover questions

Posted by Gordon Sim <gs...@redhat.com>.
Sandy Pratt wrote:
> I've been experimented with failover on a cluster of two brokers, and
> I often see this log item when a broker fails:
> 
> 2009-mar-04 17:11:19 debug Exception constructed: Attempted size
> underflow on dequeue(21): size: max=104857600, current=0; count: unli
>  mited; type=flow_to_disk (qpid/broker/QueuePolicy.cpp:54)
> 
> What does underflow mean here?

The policy maintains a running count of enqueued messages and the 
aggregate size. Underflow here means that more data was dequeued than 
was enqueued which indicates some logical error.

What version of the code are you using? M4 or latest from trunk?

> The broker seems to have died:
> 
> [prattrs@hsvrhm5 qpidd]$ sudo /sbin/service qpidd status qpidd dead
> but pid file exists
> 
> The test I was running failed over to the other broker and completed
> after a timeout expired.
> 
> A subsequent test immediately failed over to the other broker and
> completed (which makes sense because qpid on the first broker was
> probably dead before it started).
> 
> In a general sense, what are the steps required to recover from a
> broker failure?  What I am looking for is step #3 below:
> 
> Assume a cluster of two brokers, A and B 1) A dies 2) clients fail
> over to B 3) do something to recover A without interrupting clients
> of B 4) A and B are again interchangeable
> 
> I've looked through the docs and haven't seen anything about this.
> Apologies if I missed it.  I also tried simply restarting A, which
> doesn't seem to work.

What errors/symptoms do you see when simply restarting A? (As Carl says, 
this _should_ be all that is required).



---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org