You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@qpid.apache.org by "Alan Conway (JIRA)" <qp...@incubator.apache.org> on 2009/11/25 21:42:39 UTC

[jira] Created: (QPID-2220) Assisign manual recovery from a complete persistent cluster crash.

Assisign manual recovery from a complete persistent cluster crash.
------------------------------------------------------------------

                 Key: QPID-2220
                 URL: https://issues.apache.org/jira/browse/QPID-2220
             Project: Qpid
          Issue Type: Improvement
          Components: C++ Broker
    Affects Versions: 0.5
            Reporter: Alan Conway
            Assignee: Alan Conway


If every member of a persistent cluster crashes then manual intervention is required to identify which store is most up-to-date, so it can be used to recover.
We need to provide tools to assist in this identification.

The cluster can save a config-change counter with each config change. In recovery, the broker with the highest config-change counter has the best store. However if the last brokers in the cluster crash so close together that none can record a config-change we need an additional decider.

The store at http://qpidcomponents.org/download.html#persistence maintains a global counter called the RecordIdentifier (RID) that is incremented for each enqueue and dequeue. If the cluster stores  (config-change,RID) pairs then in recovery we can use actual-RID - RID at config-change as a tiebreaker.

Is it reasonable to provide access to this counter in the generic MessageStore API? Stores that don't implement it can simply return 0, and the cluster must fall back to relying on config-change counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


[jira] Commented: (QPID-2220) Assisting manual recovery from a complete persistent cluster crash.

Posted by "Alan Conway (JIRA)" <qp...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/QPID-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782961#action_12782961 ] 

Alan Conway commented on QPID-2220:
-----------------------------------

To clarify the situation: the problem is recovering from a total cluster failure, no clean stores. We want to identify the store that is the most up to date, i.e. the last one modifed in respect of cluster order. We can do a pretty good job  just in cluster code by recording config changes.

Now if 2 or more brokers were killed at the same configuration , we'd like a more fine grained comparison.

Using the Persistence ID works for the Red Hat store because it is a monotonically increasing value that gets incremented for (almost) every change to the store (currently not incremented for deleting queues/exchanges/bindings.) So if we record the PID value N with the config-change and in recovery we find  the store is at PID M then we know there were M-N changes to that store since the config change. Thats a number we can meaningfully compare for brokers that died at the same membership.

Factors that make this work:
 - value that increases with each change to the db.
 - at runtime we can query the current value to save at each config change  
 - in recovery we can find the value associated with the database

Is that something we could have as an optional API on a MessageStore, or should we put it on a separate plugin that can optionally be provided  by the store plugin.

> Assisting manual recovery from a complete persistent cluster crash.
> -------------------------------------------------------------------
>
>                 Key: QPID-2220
>                 URL: https://issues.apache.org/jira/browse/QPID-2220
>             Project: Qpid
>          Issue Type: Improvement
>          Components: C++ Broker
>    Affects Versions: 0.5
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>
> If every member of a persistent cluster crashes then manual intervention is required to identify which store is most up-to-date, so it can be used to recover. We need to provide tools to assist in this identification.
> The cluster can save a config-change counter with each config change (cluster membership change). In recovery, the broker with the highest config-change counter has the best store. 
> However if the last brokers in the cluster crash so close together that none can record a config-change we need an additional decider.
> The store at http://qpidcomponents.org/download.html#persistence maintains a global Persistence ID, a 64 bit value that is incremented for each enqueue, dequeue. If the cluster stores  (config-change,PID) pairs then in recovery we can use actual-PID - config-change PID as a tiebreaker.
> Proposed change to MessageStore API:
>   /** Returns a monotonically increasing value reflecting changes to the store.
>   * The value can wrap-around to 0.
>   * Stores need not implement this function, they can simply return 0.
>   */
>   uint64_t getChangeCounter();
> The default implementation just returns 0  and the cluster must fall back to relying on config-change counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


[jira] Commented: (QPID-2220) Assisting manual recovery from a complete persistent cluster crash.

Posted by "Carl Trieloff (JIRA)" <qp...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/QPID-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782655#action_12782655 ] 

Carl Trieloff commented on QPID-2220:
-------------------------------------


One note is we should use PID, as that will also include adding bindings, exchanges and queues and all enqueues and dequeues.

Carl.

> Assisting manual recovery from a complete persistent cluster crash.
> -------------------------------------------------------------------
>
>                 Key: QPID-2220
>                 URL: https://issues.apache.org/jira/browse/QPID-2220
>             Project: Qpid
>          Issue Type: Improvement
>          Components: C++ Broker
>    Affects Versions: 0.5
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>
> If every member of a persistent cluster crashes then manual intervention is required to identify which store is most up-to-date, so it can be used to recover. We need to provide tools to assist in this identification.
> The cluster can save a config-change counter with each config change (cluster membership change). In recovery, the broker with the highest config-change counter has the best store. 
> However if the last brokers in the cluster crash so close together that none can record a config-change we need an additional decider.
> The store at http://qpidcomponents.org/download.html#persistence maintains a global Persistence ID, a 64 bit value that is incremented for each enqueue, dequeue. If the cluster stores  (config-change,PID) pairs then in recovery we can use actual-PID - config-change PID as a tiebreaker.
> Proposed change to MessageStore API:
>   /** Returns a monotonically increasing value reflecting changes to the store.
>   * The value can wrap-around to 0.
>   * Stores need not implement this function, they can simply return 0.
>   */
>   uint64_t getChangeCounter();
> The default implementation just returns 0  and the cluster must fall back to relying on config-change counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


[jira] Updated: (QPID-2220) Assisting manual recovery from a complete persistent cluster crash.

Posted by "Alan Conway (JIRA)" <qp...@incubator.apache.org>.
     [ https://issues.apache.org/jira/browse/QPID-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Conway updated QPID-2220:
------------------------------

    Summary: Assisting manual recovery from a complete persistent cluster crash.  (was: Assisign manual recovery from a complete persistent cluster crash.)

> Assisting manual recovery from a complete persistent cluster crash.
> -------------------------------------------------------------------
>
>                 Key: QPID-2220
>                 URL: https://issues.apache.org/jira/browse/QPID-2220
>             Project: Qpid
>          Issue Type: Improvement
>          Components: C++ Broker
>    Affects Versions: 0.5
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>
> If every member of a persistent cluster crashes then manual intervention is required to identify which store is most up-to-date, so it can be used to recover. We need to provide tools to assist in this identification.
> The cluster can save a config-change counter with each config change (cluster membership change). In recovery, the broker with the highest config-change counter has the best store. However if the last brokers in the cluster crash so close together that none can record a config-change we need an additional decider.
> The store at http://qpidcomponents.org/download.html#persistence maintains a global Record Identifier (RID), a 64 bit value that is incremented for each enqueue and dequeue. If the cluster stores  (config-change,RID) pairs then in recovery we can use actual-RID - RID at config-change as a tiebreaker.
> Proposed change to MessageStore API:
>   /** Returns a monotonically increasing value reflecting the number of changes to the store.
>   * The value can wrap-around to 0.
>   * Stores need not implement this function, they can simply return 0.
>   */
>   uint64_t getChangeCounter();
> The default implementation just returns 0  and the cluster must fall back to relying on config-change counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


[jira] Commented: (QPID-2220) Assisign manual recovery from a complete persistent cluster crash.

Posted by "Carl Trieloff (JIRA)" <qp...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/QPID-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782613#action_12782613 ] 

Carl Trieloff commented on QPID-2220:
-------------------------------------



Alan,

The PID can use used, which maps thethe RID.  PID is a count for all store operations, so adding so making init take an initial ID for new store joining, and getID may do it.

I think we may already have the former

Carl.

> Assisign manual recovery from a complete persistent cluster crash.
> ------------------------------------------------------------------
>
>                 Key: QPID-2220
>                 URL: https://issues.apache.org/jira/browse/QPID-2220
>             Project: Qpid
>          Issue Type: Improvement
>          Components: C++ Broker
>    Affects Versions: 0.5
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>
> If every member of a persistent cluster crashes then manual intervention is required to identify which store is most up-to-date, so it can be used to recover. We need to provide tools to assist in this identification.
> The cluster can save a config-change counter with each config change (cluster membership change). In recovery, the broker with the highest config-change counter has the best store. However if the last brokers in the cluster crash so close together that none can record a config-change we need an additional decider.
> The store at http://qpidcomponents.org/download.html#persistence maintains a global Record Identifier (RID), a 64 bit value that is incremented for each enqueue and dequeue. If the cluster stores  (config-change,RID) pairs then in recovery we can use actual-RID - RID at config-change as a tiebreaker.
> Proposed change to MessageStore API:
>   /** Returns a monotonically increasing value reflecting the number of changes to the store.
>   * The value can wrap-around to 0.
>   * Stores need not implement this function, they can simply return 0.
>   */
>   uint64_t getChangeCounter();
> The default implementation just returns 0  and the cluster must fall back to relying on config-change counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


[jira] Updated: (QPID-2220) Assisign manual recovery from a complete persistent cluster crash.

Posted by "Alan Conway (JIRA)" <qp...@incubator.apache.org>.
     [ https://issues.apache.org/jira/browse/QPID-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Conway updated QPID-2220:
------------------------------

    Description: 
If every member of a persistent cluster crashes then manual intervention is required to identify which store is most up-to-date, so it can be used to recover. We need to provide tools to assist in this identification.

The cluster can save a config-change counter with each config change (cluster membership change). In recovery, the broker with the highest config-change counter has the best store. However if the last brokers in the cluster crash so close together that none can record a config-change we need an additional decider.

The store at http://qpidcomponents.org/download.html#persistence maintains a global Record Identifier (RID), a 64 bit value that is incremented for each enqueue and dequeue. If the cluster stores  (config-change,RID) pairs then in recovery we can use actual-RID - RID at config-change as a tiebreaker.

Proposed change to MessageStore API:
  /** Returns a monotonically increasing value reflecting the number of changes to the store.
  * The value can wrap-around to 0.
  * Stores need not implement this function, they can simply return 0.
  */
  uint64_t getChangeCounter();

The default implementation just returns 0  and the cluster must fall back to relying on config-change counts.

  was:
If every member of a persistent cluster crashes then manual intervention is required to identify which store is most up-to-date, so it can be used to recover.
We need to provide tools to assist in this identification.

The cluster can save a config-change counter with each config change. In recovery, the broker with the highest config-change counter has the best store. However if the last brokers in the cluster crash so close together that none can record a config-change we need an additional decider.

The store at http://qpidcomponents.org/download.html#persistence maintains a global counter called the RecordIdentifier (RID) that is incremented for each enqueue and dequeue. If the cluster stores  (config-change,RID) pairs then in recovery we can use actual-RID - RID at config-change as a tiebreaker.

Is it reasonable to provide access to this counter in the generic MessageStore API? Stores that don't implement it can simply return 0, and the cluster must fall back to relying on config-change counts.


> Assisign manual recovery from a complete persistent cluster crash.
> ------------------------------------------------------------------
>
>                 Key: QPID-2220
>                 URL: https://issues.apache.org/jira/browse/QPID-2220
>             Project: Qpid
>          Issue Type: Improvement
>          Components: C++ Broker
>    Affects Versions: 0.5
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>
> If every member of a persistent cluster crashes then manual intervention is required to identify which store is most up-to-date, so it can be used to recover. We need to provide tools to assist in this identification.
> The cluster can save a config-change counter with each config change (cluster membership change). In recovery, the broker with the highest config-change counter has the best store. However if the last brokers in the cluster crash so close together that none can record a config-change we need an additional decider.
> The store at http://qpidcomponents.org/download.html#persistence maintains a global Record Identifier (RID), a 64 bit value that is incremented for each enqueue and dequeue. If the cluster stores  (config-change,RID) pairs then in recovery we can use actual-RID - RID at config-change as a tiebreaker.
> Proposed change to MessageStore API:
>   /** Returns a monotonically increasing value reflecting the number of changes to the store.
>   * The value can wrap-around to 0.
>   * Stores need not implement this function, they can simply return 0.
>   */
>   uint64_t getChangeCounter();
> The default implementation just returns 0  and the cluster must fall back to relying on config-change counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


[jira] Commented: (QPID-2220) Assisting manual recovery from a complete persistent cluster crash.

Posted by "Alan Conway (JIRA)" <qp...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/QPID-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850901#action_12850901 ] 

Alan Conway commented on QPID-2220:
-----------------------------------

As of r916475, the last survivor in a cluster automatically marks its store as clean, so the only way we can end up with no clean store is if N>1 members fail so close together that none of them receives a config-change showing them to be the last member.  For that we need a counter from the store as described above.

> Assisting manual recovery from a complete persistent cluster crash.
> -------------------------------------------------------------------
>
>                 Key: QPID-2220
>                 URL: https://issues.apache.org/jira/browse/QPID-2220
>             Project: Qpid
>          Issue Type: Improvement
>          Components: C++ Broker
>    Affects Versions: 0.5
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>
> If every member of a persistent cluster crashes then manual intervention is required to identify which store is most up-to-date, so it can be used to recover. We need to provide tools to assist in this identification.
> The cluster can save a config-change counter with each config change (cluster membership change). In recovery, the broker with the highest config-change counter has the best store. 
> However if the last brokers in the cluster crash so close together that none can record a config-change we need an additional decider.
> The store at http://qpidcomponents.org/download.html#persistence maintains a global Persistence ID, a 64 bit value that is incremented for each enqueue, dequeue. If the cluster stores  (config-change,PID) pairs then in recovery we can use actual-PID - config-change PID as a tiebreaker.
> Proposed change to MessageStore API:
>   /** Returns a monotonically increasing value reflecting changes to the store.
>   * The value can wrap-around to 0.
>   * Stores need not implement this function, they can simply return 0.
>   */
>   uint64_t getChangeCounter();
> The default implementation just returns 0  and the cluster must fall back to relying on config-change counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


[jira] Commented: (QPID-2220) Assisting manual recovery from a complete persistent cluster crash.

Posted by "Carl Trieloff (JIRA)" <qp...@incubator.apache.org>.
    [ https://issues.apache.org/jira/browse/QPID-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12782616#action_12782616 ] 

Carl Trieloff commented on QPID-2220:
-------------------------------------

I would change recover to return the current PID on recover & not void. That should solve your issue.

I.e. I get the highest PID, and if it is then not the latest, I re-init the store --- question, if I do an update does PID get synced. If not this will not work

Carl.

> Assisting manual recovery from a complete persistent cluster crash.
> -------------------------------------------------------------------
>
>                 Key: QPID-2220
>                 URL: https://issues.apache.org/jira/browse/QPID-2220
>             Project: Qpid
>          Issue Type: Improvement
>          Components: C++ Broker
>    Affects Versions: 0.5
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>
> If every member of a persistent cluster crashes then manual intervention is required to identify which store is most up-to-date, so it can be used to recover. We need to provide tools to assist in this identification.
> The cluster can save a config-change counter with each config change (cluster membership change). In recovery, the broker with the highest config-change counter has the best store. However if the last brokers in the cluster crash so close together that none can record a config-change we need an additional decider.
> The store at http://qpidcomponents.org/download.html#persistence maintains a global Record Identifier (RID), a 64 bit value that is incremented for each enqueue and dequeue. If the cluster stores  (config-change,RID) pairs then in recovery we can use actual-RID - RID at config-change as a tiebreaker.
> Proposed change to MessageStore API:
>   /** Returns a monotonically increasing value reflecting the number of changes to the store.
>   * The value can wrap-around to 0.
>   * Stores need not implement this function, they can simply return 0.
>   */
>   uint64_t getChangeCounter();
> The default implementation just returns 0  and the cluster must fall back to relying on config-change counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org


[jira] Updated: (QPID-2220) Assisting manual recovery from a complete persistent cluster crash.

Posted by "Alan Conway (JIRA)" <qp...@incubator.apache.org>.
     [ https://issues.apache.org/jira/browse/QPID-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alan Conway updated QPID-2220:
------------------------------

    Description: 
If every member of a persistent cluster crashes then manual intervention is required to identify which store is most up-to-date, so it can be used to recover. We need to provide tools to assist in this identification.

The cluster can save a config-change counter with each config change (cluster membership change). In recovery, the broker with the highest config-change counter has the best store. 

However if the last brokers in the cluster crash so close together that none can record a config-change we need an additional decider.
The store at http://qpidcomponents.org/download.html#persistence maintains a global Persistence ID, a 64 bit value that is incremented for each enqueue, dequeue. If the cluster stores  (config-change,PID) pairs then in recovery we can use actual-PID - config-change PID as a tiebreaker.

Proposed change to MessageStore API:
  /** Returns a monotonically increasing value reflecting changes to the store.
  * The value can wrap-around to 0.
  * Stores need not implement this function, they can simply return 0.
  */
  uint64_t getChangeCounter();

The default implementation just returns 0  and the cluster must fall back to relying on config-change counts.

  was:
If every member of a persistent cluster crashes then manual intervention is required to identify which store is most up-to-date, so it can be used to recover. We need to provide tools to assist in this identification.

The cluster can save a config-change counter with each config change (cluster membership change). In recovery, the broker with the highest config-change counter has the best store. However if the last brokers in the cluster crash so close together that none can record a config-change we need an additional decider.

The store at http://qpidcomponents.org/download.html#persistence maintains a global Record Identifier (RID), a 64 bit value that is incremented for each enqueue and dequeue. If the cluster stores  (config-change,RID) pairs then in recovery we can use actual-RID - RID at config-change as a tiebreaker.

Proposed change to MessageStore API:
  /** Returns a monotonically increasing value reflecting the number of changes to the store.
  * The value can wrap-around to 0.
  * Stores need not implement this function, they can simply return 0.
  */
  uint64_t getChangeCounter();

The default implementation just returns 0  and the cluster must fall back to relying on config-change counts.


> Assisting manual recovery from a complete persistent cluster crash.
> -------------------------------------------------------------------
>
>                 Key: QPID-2220
>                 URL: https://issues.apache.org/jira/browse/QPID-2220
>             Project: Qpid
>          Issue Type: Improvement
>          Components: C++ Broker
>    Affects Versions: 0.5
>            Reporter: Alan Conway
>            Assignee: Alan Conway
>
> If every member of a persistent cluster crashes then manual intervention is required to identify which store is most up-to-date, so it can be used to recover. We need to provide tools to assist in this identification.
> The cluster can save a config-change counter with each config change (cluster membership change). In recovery, the broker with the highest config-change counter has the best store. 
> However if the last brokers in the cluster crash so close together that none can record a config-change we need an additional decider.
> The store at http://qpidcomponents.org/download.html#persistence maintains a global Persistence ID, a 64 bit value that is incremented for each enqueue, dequeue. If the cluster stores  (config-change,PID) pairs then in recovery we can use actual-PID - config-change PID as a tiebreaker.
> Proposed change to MessageStore API:
>   /** Returns a monotonically increasing value reflecting changes to the store.
>   * The value can wrap-around to 0.
>   * Stores need not implement this function, they can simply return 0.
>   */
>   uint64_t getChangeCounter();
> The default implementation just returns 0  and the cluster must fall back to relying on config-change counts.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
Apache Qpid - AMQP Messaging Implementation
Project:      http://qpid.apache.org
Use/Interact: mailto:dev-subscribe@qpid.apache.org