You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by "michael j. goulish (JIRA)" <ji...@apache.org> on 2011/02/03 17:42:29 UTC
[jira] Commented: (QPID-2992) Cluster failing to resurrect durable
static route depending on order of shutdown
[ https://issues.apache.org/jira/browse/QPID-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990165#comment-12990165 ]
michael j. goulish commented on QPID-2992:
------------------------------------------
Using modifications of Ken's script, I have reproduced two
bad behaviors, including the one that Mark is reporting.
I don't think this is a bug... well, sort of. Two, actually.
I will submit a doc bug, and probably one enhancement request.
What's happening is this: messaging systems that include
clusters and stores are sensitive to timing issues around
events like broker introduction and shut-down.
Here are the timing issues that I know of:
1. When you shut down a cluster that is using a store,
there must be time for the last-broker-standing to
realize his status, and mark his store as "clean".
I.e. "my store is the one we should use at re-start."
If all brokers are killed too quickly, this will not
happen. The cluster will not be able to restart
because it will not find any store that has been
marked "clean".
2. When you make a topology change, i.e. adding a route
from one cluster to another to create a federation-of-
clusters, if you then shut down the cluster soon afterwards
you may get it before that topology-change has had a chance
to propagate across the cluster.
This can cause a problem on re-start that depends on the
order in which the brokers are killed. If you *first* kill
the broker that knew about the topology change before he
manages to communicate that knowledge to the other broker,
that's Bad. because the other broker will be the last-man-
standing, and it will be *his* store that gets marked as
"clean"! So his store will be re-used at startup, and the
cluster will have lost knowledge of the topology change.
By altering the timing of events in Ken's script, I was able
to:
A. get no failures in 200 runs. (original script, plus explicit
wait-loops fro brokers.)
B. get 100% failure because of no clean store. (kill both brokers
in B cluster too close together.)
C. get the failure that Mark reported, about 7% of the time.
( place B1 under load, then kill it too soon after route-
creation. )
So, here's what I will propose...
I. A bit of documentation ( I will take first sketch-whack at it,
then give to doc professionals) to centralize description of
this type of problem -- the two I have mentioned above, plus
whatever anyone else thinks up that is similar.
This will include best-practices on how to avoid this type of
problem.
II. A request for enhancement wherever there is no very good way
to avoid one of these multi-broker race conditions.
III. III'll come back and update this Jira with the numbers of
any resultant Jiras that I open.
> Cluster failing to resurrect durable static route depending on order of shutdown
> --------------------------------------------------------------------------------
>
> Key: QPID-2992
> URL: https://issues.apache.org/jira/browse/QPID-2992
> Project: Qpid
> Issue Type: Bug
> Components: C++ Broker, C++ Clustering
> Affects Versions: 0.8
> Environment: Debian Linux Squeeze, 32-bit, kernel 2.6.36.2, Dell Poweredge 1950s. Corosync==1.3.0, Openais==1.1.4
> Reporter: Mark Moseley
> Assignee: michael j. goulish
> Attachments: cluster-fed.sh, error
>
>
> I've got a 2-node qpid test cluster at each of 2 datacenters, which are federated together with a single durable static route between each. Qpid is version 0.8. Corosync and openais are stock Squeeze (1.2.1-3 and 1.1.2-2, respectively). OS is Squeeze, 32-bit, on Dell Poweredge 1950s, kernel 2.6.36. The static route is durable and is set up over SSL (but I can replicate as well with non-SSL). I've tried to normalize the hostnames below to make things clearer; hopefully I didn't mess anything up.
> Given two clusters, cluster A (consisting of hosts A1 and A2) and cluster B (with B1 and B2), I've got a static exchange route from A1 to B1, as well as another from B1 to A1. Federation is working correctly, so I can send a message on A2 and have it successfully retrieved on B2. The exchange local to cluster A is walmyex1; the local exchange for B is bosmyex1.
> If I shut down the cluster in this order: B2, then B1, and start back up with B1, B2, the static route route fails to get recreated. That is, on A1/A2, looking at the bindings, exchange 'bosmyex1' does not get re-bound to cluster B; the only output for it in "qpid-config exchanges --bindings" is just:
> <snip>
> Exchange 'bosmyex1' (direct)
> </snip>
> If however I shut the cluster down in this order: B1, then B2, and start B2, then B1, the static route gets re-bound. The output then is:
> <snip>
> Exchange 'bosmyex1' (direct)
> bind [unix.boston.cust] => bridge_queue_1_8870523d-2286-408e-b5b5-50d53db2fa61
> </bind>
> and I can message over the federated link with no further modification. Prior to a few minutes ago, I was seeing this with the Squeeze stock openais==1.1.2 and corosync==1.2.1. In debugging this, I've upgraded both to the latest versions with no change.
> I can replicate this every time I try. These are just test clusters, so I don't have any other activity going on on them, or any other exchanges/queues. My steps:
> On all boxes in cluster A and B:
> * Kill the qpidd if it's running and delete all existing store files, i.e. contents of /var/lib/qpid/
> On host A1 in cluster A (I'm leaving out the -a user/test@host stuff):
> * Start up qpid
> * qpid-config add exchange direct bosmyex1 --durable
> * qpid-config add exchange direct walmyex1 --durable
> * qpid-config add queue walmyq1 --durable
> * qpid-config bind walmyex1 walmyq1 unix.waltham.cust
> On host B1 in cluster B:
> * qpid-config add exchange direct bosmyex1 --durable
> * qpid-config add exchange direct walmyex1 --durable
> * qpid-config add queue bosmyq1 --durable
> * qpid-config bind bosmyex1 bosmyq1 unix.boston.cust
> On cluster A:
> * Start other member of cluster, A2
> * qpid-route route add amqps://user/pass@HOSTA1:5671 amqps://user/pass@HOSTB1:5671 walmyex1 unix.waltham.cust -d
> On cluster B:
> * Start other member of cluster, B2
> * qpid-route route add amqps://user/pass@HOSTB1:5671 amqps://user/pass@HOSTA1:5671 bosmyex1 unix.boston.cust -d
> On either cluster:
> * Check "qpid-config exchanges --bindings" to make sure bindings are correct for remote exchanges
> * To see correct behaviour, stop cluster in the order B1->B2, or A1->A2, start cluster back up, check bindings.
> * To see broken behaviour, stop cluster in the order B2->B1, or A2->A1, start cluster back up, check bindings.
> This is a test cluster, so I'm free to do anything with it, debugging-wise, that would be useful.
--
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project: http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org