You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by "Alan Conway (JIRA)" <qp...@incubator.apache.org> on 2009/10/22 21:37:59 UTC

[jira] Created: (QPID-2157) Persistent cluster restart

Persistent cluster restart
--------------------------

                 Key: QPID-2157
                 URL: https://issues.apache.org/jira/browse/QPID-2157
             Project: Qpid
          Issue Type: Bug
          Components: C++ Broker
    Affects Versions: 0.5
            Reporter: Alan Conway
            Assignee: Alan Conway


Currently, when restarting a persistent cluster, the first broker to start loads from its store and all other brokers move their store aside and update from the cluster.  If some brokers failed and have out-of-date stores, we assume manual intervention to ensure that the correct broker is started first.

The goal is to have the brokers automatically compare their stores, allowing all brokers with clean stores to load from store and all other brokers to update from the cluster.

A design note for this issue is at http://cwiki.apache.org/confluence/display/qpid/Persistent+Cluster+Restart+Design+Note

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


[jira] Resolved: (QPID-2157) Persistent cluster restart

Posted by "Alan Conway (JIRA)" <qp...@incubator.apache.org>.
     [ https://issues.apache.org/jira/browse/QPID-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Conway resolved QPID-2157.
-------------------------------

    Resolution: Fixed

Fixed by commits from 883842 to 884226.
Manual recovery from a complete cluster failure is not addressed yet, see QPID-2220

> Persistent cluster restart
> --------------------------
>
>                 Key: QPID-2157
>                 URL: https://issues.apache.org/jira/browse/QPID-2157
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Broker
>    Affects Versions: 0.5
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>
> Currently, when restarting a persistent cluster, the first broker to start loads from its store and all other brokers move their store aside and update from the cluster.  If some brokers failed and have out-of-date stores, we assume manual intervention to ensure that the correct broker is started first.
> The goal is to have the brokers automatically compare their stores, allowing all brokers with clean stores to load from store and all other brokers to update from the cluster.
> A design note for this issue is at http://cwiki.apache.org/confluence/display/qpid/Persistent+Cluster+Restart+Design+Note

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


[jira] Commented: (QPID-2157) Persistent cluster restart

Posted by "Carl Trieloff (JIRA)" <qp...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/QPID-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769208#action_12769208 ] 

Carl Trieloff commented on QPID-2157:
-------------------------------------



I the reqs, I don't understand this statement "Non-persistent members are allowed but must also have - cluster-store-count."

Carl.

> Persistent cluster restart
> --------------------------
>
>                 Key: QPID-2157
>                 URL: https://issues.apache.org/jira/browse/QPID-2157
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Broker
>    Affects Versions: 0.5
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>
> Currently, when restarting a persistent cluster, the first broker to start loads from its store and all other brokers move their store aside and update from the cluster.  If some brokers failed and have out-of-date stores, we assume manual intervention to ensure that the correct broker is started first.
> The goal is to have the brokers automatically compare their stores, allowing all brokers with clean stores to load from store and all other brokers to update from the cluster.
> A design note for this issue is at http://cwiki.apache.org/confluence/display/qpid/Persistent+Cluster+Restart+Design+Note

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


[jira] Commented: (QPID-2157) Persistent cluster restart

Posted by "Alan Conway (JIRA)" <qp...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/QPID-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12777147#action_12777147 ] 

Alan Conway commented on QPID-2157:
-----------------------------------

I've updated the design page, it's simpler now and should answer some of the questions above.

> Persistent cluster restart
> --------------------------
>
>                 Key: QPID-2157
>                 URL: https://issues.apache.org/jira/browse/QPID-2157
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Broker
>    Affects Versions: 0.5
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>
> Currently, when restarting a persistent cluster, the first broker to start loads from its store and all other brokers move their store aside and update from the cluster.  If some brokers failed and have out-of-date stores, we assume manual intervention to ensure that the correct broker is started first.
> The goal is to have the brokers automatically compare their stores, allowing all brokers with clean stores to load from store and all other brokers to update from the cluster.
> A design note for this issue is at http://cwiki.apache.org/confluence/display/qpid/Persistent+Cluster+Restart+Design+Note

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


[jira] Commented: (QPID-2157) Persistent cluster restart

Posted by "Ken Giusti (JIRA)" <qp...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/QPID-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12769257#action_12769257 ] 

Ken Giusti commented on QPID-2157:
----------------------------------

Looks good.  Minor question/notes  (remember, I'm a cluster noob, so feel free to reject if common knowledge makes these obvious) - thx.

Section Tests:, last bullet item:

>start 2 clusters, shut down. Attempt to re-start members as a single cluster. Unrelated stores detected and cluster re-start exits with error.

Section Design:

?  Should we always assume that --cluster-store-count also indicates the minimum number of cluster members that should be waited for before  the cluster selects persistent state?   For example, if I have 5 persistent members, must all 5 be available before a store is selected and propagated?    Would it be useful to recover state if, say, 3 persistent members become ready, all have matching clean store, and timeout waiting for the remaining two?   Should we provide an optional --cluster-store-min?  If not, what is the behavior on timeout if < --cluster-store-count members are available?

Section Startup:

>If no member has a clean store: member with highest frame-sequence loads store and provides updates.

? Would it be safer to require manual intervention in this case as the default behaviour?

? The following statement (under stored state):

Orderly shutdown means qpid-cluster -k, or any shutdown for last [persistent] member cluster.

implies that cluster members can shutdown at different times - some later than others.   The tests seem to imply this, too, and any later state recovered.   If true, then:

>If some members have a clean store:
>    * compare stored state, if any mismatch then manual intervention required


Shouldn't a mismatch on just the frame seq # still be allowed,  selecting the "oldest" sequence number as the store to recover?

thx,

-K    

> Persistent cluster restart
> --------------------------
>
>                 Key: QPID-2157
>                 URL: https://issues.apache.org/jira/browse/QPID-2157
>             Project: Qpid
>          Issue Type: Bug
>          Components: C++ Broker
>    Affects Versions: 0.5
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>
> Currently, when restarting a persistent cluster, the first broker to start loads from its store and all other brokers move their store aside and update from the cluster.  If some brokers failed and have out-of-date stores, we assume manual intervention to ensure that the correct broker is started first.
> The goal is to have the brokers automatically compare their stores, allowing all brokers with clean stores to load from store and all other brokers to update from the cluster.
> A design note for this issue is at http://cwiki.apache.org/confluence/display/qpid/Persistent+Cluster+Restart+Design+Note

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org