You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by "michael j. goulish (JIRA)" <ji...@apache.org> on 2011/02/03 17:42:29 UTC

[jira] Commented: (QPID-2992) Cluster failing to resurrect durable static route depending on order of shutdown

    [ https://issues.apache.org/jira/browse/QPID-2992?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12990165#comment-12990165 ] 

michael j. goulish commented on QPID-2992:
------------------------------------------

    Using modifications of Ken's script, I have reproduced two
    bad behaviors, including the one that Mark is reporting.

    I don't think this is a bug... well, sort of. Two, actually.
    I will submit a doc bug, and probably one enhancement request.

    What's happening is this:  messaging systems that include
    clusters and stores are sensitive to timing issues around
    events like broker introduction and shut-down.

    Here are the timing issues that I know of:

    1. When you shut down a cluster that is using a store,
       there must be time for the last-broker-standing to
       realize his status, and mark his store as "clean".
       I.e. "my store is the one we should use at re-start."
       If all brokers are killed too quickly, this will not
       happen.  The cluster will not be able to restart
       because it will not find any store that has been
       marked "clean".


    2. When you make a topology change, i.e. adding a route
       from one cluster to another to create a federation-of-
       clusters, if you then shut down the cluster soon afterwards
       you may get it before that topology-change has had a chance
       to propagate across the cluster.

       This can cause a problem on re-start that depends on the
       order in which the brokers are killed.  If you *first* kill
       the broker that knew about the topology change before he
       manages to communicate that knowledge to the other broker,
       that's Bad.  because the other broker will be the last-man-
       standing, and it will be *his* store that gets marked as
       "clean"!  So his store will be re-used at startup, and the
       cluster will have lost knowledge of the topology change.


    By altering the timing of events in Ken's script, I was able
    to:

    A. get no failures in 200 runs.  (original script, plus explicit
       wait-loops fro brokers.)

    B. get 100% failure because of no clean store.  (kill both brokers
       in B cluster too close together.)

    C. get the failure that Mark reported, about 7% of the time.
       ( place B1 under load, then kill it too soon after route-
         creation. )


    So, here's what I will propose...

    I. A bit of documentation ( I will take first sketch-whack at it,
       then give to doc professionals) to centralize description of
       this type of problem -- the two I have mentioned above, plus
       whatever anyone else thinks up that is similar.

       This will include best-practices on how to avoid this type of
       problem.


    II. A request for enhancement wherever there is no very good way
        to avoid one of these multi-broker race conditions.


    III. III'll come back and update this Jira with the numbers of
         any resultant Jiras that I open.



> Cluster failing to resurrect durable static route depending on order of shutdown
> --------------------------------------------------------------------------------
>
>                 Key: QPID-2992
>                 URL: https://issues.apache.org/jira/browse/QPID-2992
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Broker, C++ Clustering
>    Affects Versions: 0.8
>         Environment: Debian Linux Squeeze, 32-bit, kernel 2.6.36.2, Dell Poweredge 1950s. Corosync==1.3.0, Openais==1.1.4
>            Reporter: Mark Moseley
>            Assignee: michael j. goulish
>         Attachments: cluster-fed.sh, error
>
>
> I've got a 2-node qpid test cluster at each of 2 datacenters, which are federated together with a single durable static route between each. Qpid is version 0.8. Corosync and openais are stock Squeeze (1.2.1-3 and 1.1.2-2, respectively). OS is Squeeze, 32-bit, on Dell Poweredge 1950s, kernel 2.6.36. The static route is durable and is set up over SSL (but I can replicate as well with non-SSL). I've tried to normalize the hostnames below to make things clearer; hopefully I didn't mess anything up.
> Given two clusters, cluster A (consisting of hosts A1 and A2) and cluster B (with B1 and B2), I've got a static exchange route from A1 to B1, as well as another from B1 to A1. Federation is working correctly, so I can send a message on A2 and have it successfully retrieved on B2. The exchange local to cluster A is walmyex1; the local exchange for B is bosmyex1.
> If I shut down the cluster in this order: B2, then B1, and start back up with B1, B2, the static route route fails to get recreated. That is, on A1/A2, looking at the bindings, exchange 'bosmyex1' does not get re-bound to cluster B; the only output for it in "qpid-config exchanges --bindings" is just:
> <snip>
> Exchange 'bosmyex1' (direct)
> </snip>
> If however I shut the cluster down in this order: B1, then B2, and start B2, then B1, the static route gets re-bound. The output then is:
> <snip>
> Exchange 'bosmyex1' (direct)
>     bind [unix.boston.cust] => bridge_queue_1_8870523d-2286-408e-b5b5-50d53db2fa61
> </bind>
> and I can message over the federated link with no further modification. Prior to a few minutes ago, I was seeing this with the Squeeze stock openais==1.1.2 and corosync==1.2.1. In debugging this, I've upgraded both to the latest versions with no change.
> I can replicate this every time I try. These are just test clusters, so I don't have any other activity going on on them, or any other exchanges/queues. My steps:
> On all boxes in cluster A and B:
> * Kill the qpidd if it's running and delete all existing store files, i.e. contents of /var/lib/qpid/
> On host A1 in cluster A (I'm leaving out the -a user/test@host stuff):
> * Start up qpid
> * qpid-config add exchange direct bosmyex1 --durable
> * qpid-config add exchange direct walmyex1 --durable
> * qpid-config add queue walmyq1 --durable
> * qpid-config bind walmyex1 walmyq1 unix.waltham.cust
> On host B1 in cluster B:
> * qpid-config add exchange direct bosmyex1 --durable
> * qpid-config add exchange direct walmyex1 --durable
> * qpid-config add queue bosmyq1 --durable
> * qpid-config bind bosmyex1 bosmyq1 unix.boston.cust
> On cluster A:
> * Start other member of cluster, A2
> * qpid-route route add amqps://user/pass@HOSTA1:5671 amqps://user/pass@HOSTB1:5671 walmyex1 unix.waltham.cust -d
> On cluster B:
> * Start other member of cluster, B2
> * qpid-route route add amqps://user/pass@HOSTB1:5671 amqps://user/pass@HOSTA1:5671 bosmyex1 unix.boston.cust -d
> On either cluster:
> * Check "qpid-config exchanges --bindings" to make sure bindings are correct for remote exchanges
> * To see correct behaviour, stop cluster in the order B1->B2, or A1->A2, start cluster back up, check bindings.
> * To see broken behaviour, stop cluster in the order B2->B1, or A2->A1, start cluster back up, check bindings.
> This is a test cluster, so I'm free to do anything with it, debugging-wise, that would be useful. 

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org