You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@qpid.apache.org by Mark Moseley <mo...@gmail.com> on 2010/12/29 01:00:04 UTC

Cluster failing to resurrect durable static route

Sorry in advance that this is long. I've tried to explain it as
succinctly but thoroughly as possible.

I've got a 2-node qpid test cluster at each of 2 datacenters, which
are federated together with a single durable static route between
each. Qpid is version 0.8. Corosync and openais are stock Squeeze
(1.2.1-3 and 1.1.2-2, respectively). OS is Squeeze, 32-bit, on Dell
Poweredge 1950s, kernel 2.6.36. The static route is durable and is set
up over SSL.

This is quite possibly just a conceptual problem with how I'm setting
this up, so if anyone has a 'right way' to do it, I'm all ears :)

Just a prelim: Call them cluster A with nodes A1 and A2, and cluster B
with nodes B1 and B2. The static route is defined as A1->B1 for an
exchange on cluster B (call it exchangeB), and the other route is
B1->A1 for an exchange on cluster A (call it exchangeA). After setting
this up, things seem to work pretty well. I can send from any node in
cluster A to exchangeB and it's received by any receiving node in
cluster B. Running "qpid-config ... exchanges --bindings" on cluster A
nodes show the route to cluster B for exchangeB and vice versa. That
seems to be good.

The trouble I'm having regards failover. I'm finding that if I fail
the cluster in the order where the node with the route on it lives:

* Kill A1, kill A2, start A2, start A1  -> The bindings on cluster B
for exchangeA get set back up automatically

Also, after I kill A1, the route seems to fail over correctly to A2,
i.e. with A1 dead and A2 still alive, looking at qpid-route on B1 or
B2 says:
Exchange 'exchangeA' (direct)
    bind [mytopic] => bridge_queue_1_f6d80145-67d2-4659-b26e-80c4da3ae85b

If I stop the cluster in this order:

* Kill A2, kill A1, start A1, start A2  -> The bindings on cluster B
for exchangeA don't get set up, i.e. on B1 or B2, qpid-route says:
Exchange 'exchangeA' (direct)

Am I doing something wrong or is this a known limitation? I'd expect
that regardless of ordering, a durable route would come back up on its
own, on either node. I'd also think that if it was a limitation, it'd
happen in the other order, when A2 was the last node standing,
considering the route was created for A1.

I had tried earlier to use source routes for my routing and they
seemed to do better at coming back after failover but on the source
clusters' side, the non-primary node (A2) would often blow up when
cluster B was down and a node in cluster B came back online, always
saying this in A2's qpid logs (10.1.58.3 is A1, 10.1.58.4 is A2):

2010-12-28 17:19:37 info ACL Allow id:walclust@QPID action:create
ObjectType:link Name:
2010-12-28 17:19:37 info Connection is a federation link
2010-12-28 17:19:39 error Channel exception: not-attached: Channel 1
is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
2010-12-28 17:19:39 critical cluster(10.1.58.4:3128 READY/error) local
error 3054 did not occur on member 10.1.58.3:3369: not-attached:
Channel 1 is not)
2010-12-28 17:19:39 critical Error delivering frames: local error did
not occur on all cluster members : not-attached: Channel 1 is not
attached (qpid/a)
2010-12-28 17:19:39 notice cluster(10.1.58.4:3128 LEFT/error) leaving
cluster walclust
2010-12-28 17:19:39 notice Shut down


I'm pushing my luck with an email this long, but I'll mention one
other weirdness. I was working on another test cluster where the IPs
were 10.1.1.246 and 10.1.1.247. In the qpid logs, they were fairly
consistently referred to in the logs as 10.1.1.118 and 10.1.1.119,
almost like the 8th bit was being cleared. Could be some localized
bizarreness (though dns and nsswitch both reported the IPs correctly)
but I thought I'd mention it. I haven't tried it out with other IPs
where the 4th octet (or any octet) is over 128.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: Cluster failing to resurrect durable static route

Posted by Mark Moseley <mo...@gmail.com>.

On Tue, Dec 28, 2010 at 4:00 PM, Mark Moseley <mo...@gmail.com> wrote:
> Sorry in advance that this is long. I've tried to explain it as
> succinctly but thoroughly as possible.
>
> I've got a 2-node qpid test cluster at each of 2 datacenters, which
> are federated together with a single durable static route between
> each. Qpid is version 0.8. Corosync and openais are stock Squeeze
> (1.2.1-3 and 1.1.2-2, respectively). OS is Squeeze, 32-bit, on Dell
> Poweredge 1950s, kernel 2.6.36. The static route is durable and is set
> up over SSL.
>
> This is quite possibly just a conceptual problem with how I'm setting
> this up, so if anyone has a 'right way' to do it, I'm all ears :)
>
> Just a prelim: Call them cluster A with nodes A1 and A2, and cluster B
> with nodes B1 and B2. The static route is defined as A1->B1 for an
> exchange on cluster B (call it exchangeB), and the other route is
> B1->A1 for an exchange on cluster A (call it exchangeA). After setting
> this up, things seem to work pretty well. I can send from any node in
> cluster A to exchangeB and it's received by any receiving node in
> cluster B. Running "qpid-config ... exchanges --bindings" on cluster A
> nodes show the route to cluster B for exchangeB and vice versa. That
> seems to be good.
>
> The trouble I'm having regards failover. I'm finding that if I fail
> the cluster in the order where the node with the route on it lives:
>
> * Kill A1, kill A2, start A2, start A1  -> The bindings on cluster B
> for exchangeA get set back up automatically
>
> Also, after I kill A1, the route seems to fail over correctly to A2,
> i.e. with A1 dead and A2 still alive, looking at qpid-route on B1 or
> B2 says:
> Exchange 'exchangeA' (direct)
>    bind [mytopic] => bridge_queue_1_f6d80145-67d2-4659-b26e-80c4da3ae85b
>
> If I stop the cluster in this order:
>
> * Kill A2, kill A1, start A1, start A2  -> The bindings on cluster B
> for exchangeA don't get set up, i.e. on B1 or B2, qpid-route says:
> Exchange 'exchangeA' (direct)
>
> Am I doing something wrong or is this a known limitation? I'd expect
> that regardless of ordering, a durable route would come back up on its
> own, on either node. I'd also think that if it was a limitation, it'd
> happen in the other order, when A2 was the last node standing,
> considering the route was created for A1.
>
> I had tried earlier to use source routes for my routing and they
> seemed to do better at coming back after failover but on the source
> clusters' side, the non-primary node (A2) would often blow up when
> cluster B was down and a node in cluster B came back online, always
> saying this in A2's qpid logs (10.1.58.3 is A1, 10.1.58.4 is A2):
>
> 2010-12-28 17:19:37 info ACL Allow id:walclust@QPID action:create
> ObjectType:link Name:
> 2010-12-28 17:19:37 info Connection is a federation link
> 2010-12-28 17:19:39 error Channel exception: not-attached: Channel 1
> is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
> 2010-12-28 17:19:39 critical cluster(10.1.58.4:3128 READY/error) local
> error 3054 did not occur on member 10.1.58.3:3369: not-attached:
> Channel 1 is not)
> 2010-12-28 17:19:39 critical Error delivering frames: local error did
> not occur on all cluster members : not-attached: Channel 1 is not
> attached (qpid/a)
> 2010-12-28 17:19:39 notice cluster(10.1.58.4:3128 LEFT/error) leaving
> cluster walclust
> 2010-12-28 17:19:39 notice Shut down
>
>
> I'm pushing my luck with an email this long, but I'll mention one
> other weirdness. I was working on another test cluster where the IPs
> were 10.1.1.246 and 10.1.1.247. In the qpid logs, they were fairly
> consistently referred to in the logs as 10.1.1.118 and 10.1.1.119,
> almost like the 8th bit was being cleared. Could be some localized
> bizarreness (though dns and nsswitch both reported the IPs correctly)
> but I thought I'd mention it. I haven't tried it out with other IPs
> where the 4th octet (or any octet) is over 128.
>

Ignore this last thing. It's probably just from having
"clear_node_high_bit: yes" in corosync.conf. Everything besides the
odd IP thing is still pertinent though.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: Cluster failing to resurrect durable static route

Posted by Alan Conway <ac...@redhat.com>.

On 01/07/2011 07:21 PM, Mark Moseley wrote:
> On Fri, Jan 7, 2011 at 3:50 PM, Mark Moseley<mo...@gmail.com>  wrote:
>> On Thu, Jan 6, 2011 at 12:55 PM, Alan Conway<ac...@redhat.com>  wrote:
>>> On 12/28/2010 07:00 PM, Mark Moseley wrote:
>>>>
>>>> Sorry in advance that this is long. I've tried to explain it as
>>>> succinctly but thoroughly as possible.
>>>>
>>>> I've got a 2-node qpid test cluster at each of 2 datacenters, which
>>>> are federated together with a single durable static route between
>>>> each. Qpid is version 0.8. Corosync and openais are stock Squeeze
>>>> (1.2.1-3 and 1.1.2-2, respectively). OS is Squeeze, 32-bit, on Dell
>>>> Poweredge 1950s, kernel 2.6.36. The static route is durable and is set
>>>> up over SSL.
>>>>
>>>> This is quite possibly just a conceptual problem with how I'm setting
>>>> this up, so if anyone has a 'right way' to do it, I'm all ears :)
>>>>
>>>> Just a prelim: Call them cluster A with nodes A1 and A2, and cluster B
>>>> with nodes B1 and B2. The static route is defined as A1->B1 for an
>>>> exchange on cluster B (call it exchangeB), and the other route is
>>>> B1->A1 for an exchange on cluster A (call it exchangeA). After setting
>>>> this up, things seem to work pretty well. I can send from any node in
>>>> cluster A to exchangeB and it's received by any receiving node in
>>>> cluster B. Running "qpid-config ... exchanges --bindings" on cluster A
>>>> nodes show the route to cluster B for exchangeB and vice versa. That
>>>> seems to be good.
>>>>
>>>> The trouble I'm having regards failover. I'm finding that if I fail
>>>> the cluster in the order where the node with the route on it lives:
>>>>
>>>> * Kill A1, kill A2, start A2, start A1  ->    The bindings on cluster B
>>>> for exchangeA get set back up automatically
>>>>
>>>> Also, after I kill A1, the route seems to fail over correctly to A2,
>>>> i.e. with A1 dead and A2 still alive, looking at qpid-route on B1 or
>>>> B2 says:
>>>> Exchange 'exchangeA' (direct)
>>>>      bind [mytopic] =>    bridge_queue_1_f6d80145-67d2-4659-b26e-80c4da3ae85b
>>>>
>>>> If I stop the cluster in this order:
>>>>
>>>> * Kill A2, kill A1, start A1, start A2  ->    The bindings on cluster B
>>>> for exchangeA don't get set up, i.e. on B1 or B2, qpid-route says:
>>>> Exchange 'exchangeA' (direct)
>>>>
>>>> Am I doing something wrong or is this a known limitation? I'd expect
>>>> that regardless of ordering, a durable route would come back up on its
>>>> own, on either node. I'd also think that if it was a limitation, it'd
>>>> happen in the other order, when A2 was the last node standing,
>>>> considering the route was created for A1.
>>>>
>>>
>>> I think you have uncovered a bug, can you create a JIRA for it and assign it
>>> to me  initially? Detailed instructions on how to reproduce are greatly
>>> appreciated.
>>
>> I've created this as JIRA 2992. I wasn't quite clever enough to figure
>> out how to assign it to you :)   Sorry to be daft, but I can't seem to
>> find any link/button that looks like it'd let me do that.
>>
>>
>>>> I had tried earlier to use source routes for my routing and they
>>>> seemed to do better at coming back after failover but on the source
>>>> clusters' side, the non-primary node (A2) would often blow up when
>>>> cluster B was down and a node in cluster B came back online, always
>>>> saying this in A2's qpid logs (10.1.58.3 is A1, 10.1.58.4 is A2):
>>>>
>>>> 2010-12-28 17:19:37 info ACL Allow id:walclust@QPID action:create
>>>> ObjectType:link Name:
>>>> 2010-12-28 17:19:37 info Connection is a federation link
>>>> 2010-12-28 17:19:39 error Channel exception: not-attached: Channel 1
>>>> is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
>>>> 2010-12-28 17:19:39 critical cluster(10.1.58.4:3128 READY/error) local
>>>> error 3054 did not occur on member 10.1.58.3:3369: not-attached:
>>>> Channel 1 is not)
>>>> 2010-12-28 17:19:39 critical Error delivering frames: local error did
>>>> not occur on all cluster members : not-attached: Channel 1 is not
>>>> attached (qpid/a)
>>>> 2010-12-28 17:19:39 notice cluster(10.1.58.4:3128 LEFT/error) leaving
>>>> cluster walclust
>>>> 2010-12-28 17:19:39 notice Shut down
>>>>
>>>
>>> This also sounds like a bug, can you create a separate JIRA for it? Assign
>>> to me as well.
>>
>> In debugging this, I figured I really ought to upgrade
>> corosync/openais for the heck of it. I just did that a couple of hours
>> ago and now I'm going to re-test the source route case. If upgrading
>> corosync/openais doesn't fix it, I'll open up another JIRA.
>>
>> Thanks!
>>
>
> The source-local route issue still persists with a newer
> corosync/openais. I've opened the JIRA 2993. Still haven't figured out
> how to actually assign it to you though.
>

Thanks for the JIRAs, I assigned them to myself. I guess that only comitters can 
assign JIRAs which is why you didn't find the button to do it. I'll look into 
these and/or give them to someone who will.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: Cluster failing to resurrect durable static route

Posted by Mark Moseley <mo...@gmail.com>.

On Fri, Jan 7, 2011 at 3:50 PM, Mark Moseley <mo...@gmail.com> wrote:
> On Thu, Jan 6, 2011 at 12:55 PM, Alan Conway <ac...@redhat.com> wrote:
>> On 12/28/2010 07:00 PM, Mark Moseley wrote:
>>>
>>> Sorry in advance that this is long. I've tried to explain it as
>>> succinctly but thoroughly as possible.
>>>
>>> I've got a 2-node qpid test cluster at each of 2 datacenters, which
>>> are federated together with a single durable static route between
>>> each. Qpid is version 0.8. Corosync and openais are stock Squeeze
>>> (1.2.1-3 and 1.1.2-2, respectively). OS is Squeeze, 32-bit, on Dell
>>> Poweredge 1950s, kernel 2.6.36. The static route is durable and is set
>>> up over SSL.
>>>
>>> This is quite possibly just a conceptual problem with how I'm setting
>>> this up, so if anyone has a 'right way' to do it, I'm all ears :)
>>>
>>> Just a prelim: Call them cluster A with nodes A1 and A2, and cluster B
>>> with nodes B1 and B2. The static route is defined as A1->B1 for an
>>> exchange on cluster B (call it exchangeB), and the other route is
>>> B1->A1 for an exchange on cluster A (call it exchangeA). After setting
>>> this up, things seem to work pretty well. I can send from any node in
>>> cluster A to exchangeB and it's received by any receiving node in
>>> cluster B. Running "qpid-config ... exchanges --bindings" on cluster A
>>> nodes show the route to cluster B for exchangeB and vice versa. That
>>> seems to be good.
>>>
>>> The trouble I'm having regards failover. I'm finding that if I fail
>>> the cluster in the order where the node with the route on it lives:
>>>
>>> * Kill A1, kill A2, start A2, start A1  ->  The bindings on cluster B
>>> for exchangeA get set back up automatically
>>>
>>> Also, after I kill A1, the route seems to fail over correctly to A2,
>>> i.e. with A1 dead and A2 still alive, looking at qpid-route on B1 or
>>> B2 says:
>>> Exchange 'exchangeA' (direct)
>>>     bind [mytopic] =>  bridge_queue_1_f6d80145-67d2-4659-b26e-80c4da3ae85b
>>>
>>> If I stop the cluster in this order:
>>>
>>> * Kill A2, kill A1, start A1, start A2  ->  The bindings on cluster B
>>> for exchangeA don't get set up, i.e. on B1 or B2, qpid-route says:
>>> Exchange 'exchangeA' (direct)
>>>
>>> Am I doing something wrong or is this a known limitation? I'd expect
>>> that regardless of ordering, a durable route would come back up on its
>>> own, on either node. I'd also think that if it was a limitation, it'd
>>> happen in the other order, when A2 was the last node standing,
>>> considering the route was created for A1.
>>>
>>
>> I think you have uncovered a bug, can you create a JIRA for it and assign it
>> to me  initially? Detailed instructions on how to reproduce are greatly
>> appreciated.
>
> I've created this as JIRA 2992. I wasn't quite clever enough to figure
> out how to assign it to you :)   Sorry to be daft, but I can't seem to
> find any link/button that looks like it'd let me do that.
>
>
>>> I had tried earlier to use source routes for my routing and they
>>> seemed to do better at coming back after failover but on the source
>>> clusters' side, the non-primary node (A2) would often blow up when
>>> cluster B was down and a node in cluster B came back online, always
>>> saying this in A2's qpid logs (10.1.58.3 is A1, 10.1.58.4 is A2):
>>>
>>> 2010-12-28 17:19:37 info ACL Allow id:walclust@QPID action:create
>>> ObjectType:link Name:
>>> 2010-12-28 17:19:37 info Connection is a federation link
>>> 2010-12-28 17:19:39 error Channel exception: not-attached: Channel 1
>>> is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
>>> 2010-12-28 17:19:39 critical cluster(10.1.58.4:3128 READY/error) local
>>> error 3054 did not occur on member 10.1.58.3:3369: not-attached:
>>> Channel 1 is not)
>>> 2010-12-28 17:19:39 critical Error delivering frames: local error did
>>> not occur on all cluster members : not-attached: Channel 1 is not
>>> attached (qpid/a)
>>> 2010-12-28 17:19:39 notice cluster(10.1.58.4:3128 LEFT/error) leaving
>>> cluster walclust
>>> 2010-12-28 17:19:39 notice Shut down
>>>
>>
>> This also sounds like a bug, can you create a separate JIRA for it? Assign
>> to me as well.
>
> In debugging this, I figured I really ought to upgrade
> corosync/openais for the heck of it. I just did that a couple of hours
> ago and now I'm going to re-test the source route case. If upgrading
> corosync/openais doesn't fix it, I'll open up another JIRA.
>
> Thanks!
>

The source-local route issue still persists with a newer
corosync/openais. I've opened the JIRA 2993. Still haven't figured out
how to actually assign it to you though.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: Cluster failing to resurrect durable static route

Posted by Mark Moseley <mo...@gmail.com>.

On Thu, Jan 6, 2011 at 12:55 PM, Alan Conway <ac...@redhat.com> wrote:
> On 12/28/2010 07:00 PM, Mark Moseley wrote:
>>
>> Sorry in advance that this is long. I've tried to explain it as
>> succinctly but thoroughly as possible.
>>
>> I've got a 2-node qpid test cluster at each of 2 datacenters, which
>> are federated together with a single durable static route between
>> each. Qpid is version 0.8. Corosync and openais are stock Squeeze
>> (1.2.1-3 and 1.1.2-2, respectively). OS is Squeeze, 32-bit, on Dell
>> Poweredge 1950s, kernel 2.6.36. The static route is durable and is set
>> up over SSL.
>>
>> This is quite possibly just a conceptual problem with how I'm setting
>> this up, so if anyone has a 'right way' to do it, I'm all ears :)
>>
>> Just a prelim: Call them cluster A with nodes A1 and A2, and cluster B
>> with nodes B1 and B2. The static route is defined as A1->B1 for an
>> exchange on cluster B (call it exchangeB), and the other route is
>> B1->A1 for an exchange on cluster A (call it exchangeA). After setting
>> this up, things seem to work pretty well. I can send from any node in
>> cluster A to exchangeB and it's received by any receiving node in
>> cluster B. Running "qpid-config ... exchanges --bindings" on cluster A
>> nodes show the route to cluster B for exchangeB and vice versa. That
>> seems to be good.
>>
>> The trouble I'm having regards failover. I'm finding that if I fail
>> the cluster in the order where the node with the route on it lives:
>>
>> * Kill A1, kill A2, start A2, start A1  ->  The bindings on cluster B
>> for exchangeA get set back up automatically
>>
>> Also, after I kill A1, the route seems to fail over correctly to A2,
>> i.e. with A1 dead and A2 still alive, looking at qpid-route on B1 or
>> B2 says:
>> Exchange 'exchangeA' (direct)
>>     bind [mytopic] =>  bridge_queue_1_f6d80145-67d2-4659-b26e-80c4da3ae85b
>>
>> If I stop the cluster in this order:
>>
>> * Kill A2, kill A1, start A1, start A2  ->  The bindings on cluster B
>> for exchangeA don't get set up, i.e. on B1 or B2, qpid-route says:
>> Exchange 'exchangeA' (direct)
>>
>> Am I doing something wrong or is this a known limitation? I'd expect
>> that regardless of ordering, a durable route would come back up on its
>> own, on either node. I'd also think that if it was a limitation, it'd
>> happen in the other order, when A2 was the last node standing,
>> considering the route was created for A1.
>>
>
> I think you have uncovered a bug, can you create a JIRA for it and assign it
> to me  initially? Detailed instructions on how to reproduce are greatly
> appreciated.

I've created this as JIRA 2992. I wasn't quite clever enough to figure
out how to assign it to you :)   Sorry to be daft, but I can't seem to
find any link/button that looks like it'd let me do that.


>> I had tried earlier to use source routes for my routing and they
>> seemed to do better at coming back after failover but on the source
>> clusters' side, the non-primary node (A2) would often blow up when
>> cluster B was down and a node in cluster B came back online, always
>> saying this in A2's qpid logs (10.1.58.3 is A1, 10.1.58.4 is A2):
>>
>> 2010-12-28 17:19:37 info ACL Allow id:walclust@QPID action:create
>> ObjectType:link Name:
>> 2010-12-28 17:19:37 info Connection is a federation link
>> 2010-12-28 17:19:39 error Channel exception: not-attached: Channel 1
>> is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
>> 2010-12-28 17:19:39 critical cluster(10.1.58.4:3128 READY/error) local
>> error 3054 did not occur on member 10.1.58.3:3369: not-attached:
>> Channel 1 is not)
>> 2010-12-28 17:19:39 critical Error delivering frames: local error did
>> not occur on all cluster members : not-attached: Channel 1 is not
>> attached (qpid/a)
>> 2010-12-28 17:19:39 notice cluster(10.1.58.4:3128 LEFT/error) leaving
>> cluster walclust
>> 2010-12-28 17:19:39 notice Shut down
>>
>
> This also sounds like a bug, can you create a separate JIRA for it? Assign
> to me as well.

In debugging this, I figured I really ought to upgrade
corosync/openais for the heck of it. I just did that a couple of hours
ago and now I'm going to re-test the source route case. If upgrading
corosync/openais doesn't fix it, I'll open up another JIRA.

Thanks!

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org

Re: Cluster failing to resurrect durable static route

Posted by Alan Conway <ac...@redhat.com>.

On 12/28/2010 07:00 PM, Mark Moseley wrote:
> Sorry in advance that this is long. I've tried to explain it as
> succinctly but thoroughly as possible.
>
> I've got a 2-node qpid test cluster at each of 2 datacenters, which
> are federated together with a single durable static route between
> each. Qpid is version 0.8. Corosync and openais are stock Squeeze
> (1.2.1-3 and 1.1.2-2, respectively). OS is Squeeze, 32-bit, on Dell
> Poweredge 1950s, kernel 2.6.36. The static route is durable and is set
> up over SSL.
>
> This is quite possibly just a conceptual problem with how I'm setting
> this up, so if anyone has a 'right way' to do it, I'm all ears :)
>
> Just a prelim: Call them cluster A with nodes A1 and A2, and cluster B
> with nodes B1 and B2. The static route is defined as A1->B1 for an
> exchange on cluster B (call it exchangeB), and the other route is
> B1->A1 for an exchange on cluster A (call it exchangeA). After setting
> this up, things seem to work pretty well. I can send from any node in
> cluster A to exchangeB and it's received by any receiving node in
> cluster B. Running "qpid-config ... exchanges --bindings" on cluster A
> nodes show the route to cluster B for exchangeB and vice versa. That
> seems to be good.
>
> The trouble I'm having regards failover. I'm finding that if I fail
> the cluster in the order where the node with the route on it lives:
>
> * Kill A1, kill A2, start A2, start A1  ->  The bindings on cluster B
> for exchangeA get set back up automatically
>
> Also, after I kill A1, the route seems to fail over correctly to A2,
> i.e. with A1 dead and A2 still alive, looking at qpid-route on B1 or
> B2 says:
> Exchange 'exchangeA' (direct)
>      bind [mytopic] =>  bridge_queue_1_f6d80145-67d2-4659-b26e-80c4da3ae85b
>
> If I stop the cluster in this order:
>
> * Kill A2, kill A1, start A1, start A2  ->  The bindings on cluster B
> for exchangeA don't get set up, i.e. on B1 or B2, qpid-route says:
> Exchange 'exchangeA' (direct)
>
> Am I doing something wrong or is this a known limitation? I'd expect
> that regardless of ordering, a durable route would come back up on its
> own, on either node. I'd also think that if it was a limitation, it'd
> happen in the other order, when A2 was the last node standing,
> considering the route was created for A1.
>

I think you have uncovered a bug, can you create a JIRA for it and assign it to 
me  initially? Detailed instructions on how to reproduce are greatly appreciated.

> I had tried earlier to use source routes for my routing and they
> seemed to do better at coming back after failover but on the source
> clusters' side, the non-primary node (A2) would often blow up when
> cluster B was down and a node in cluster B came back online, always
> saying this in A2's qpid logs (10.1.58.3 is A1, 10.1.58.4 is A2):
>
> 2010-12-28 17:19:37 info ACL Allow id:walclust@QPID action:create
> ObjectType:link Name:
> 2010-12-28 17:19:37 info Connection is a federation link
> 2010-12-28 17:19:39 error Channel exception: not-attached: Channel 1
> is not attached (qpid/amqp_0_10/SessionHandler.cpp:39)
> 2010-12-28 17:19:39 critical cluster(10.1.58.4:3128 READY/error) local
> error 3054 did not occur on member 10.1.58.3:3369: not-attached:
> Channel 1 is not)
> 2010-12-28 17:19:39 critical Error delivering frames: local error did
> not occur on all cluster members : not-attached: Channel 1 is not
> attached (qpid/a)
> 2010-12-28 17:19:39 notice cluster(10.1.58.4:3128 LEFT/error) leaving
> cluster walclust
> 2010-12-28 17:19:39 notice Shut down
>

This also sounds like a bug, can you create a separate JIRA for it? Assign to me 
as well.

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:users-subscribe@qpid.apache.org