You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/10/20 20:22:27 UTC
[jira] [Commented] (ARTEMIS-256) Orchestrate fail-back deterministically

    [ https://issues.apache.org/jira/browse/ARTEMIS-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965506#comment-14965506 ] 

ASF GitHub Bot commented on ARTEMIS-256:
----------------------------------------

GitHub user jbertram opened a pull request:

    https://github.com/apache/activemq-artemis/pull/204

    ARTEMIS-256 orchestrate failback deterministically

    The failback process needs to be deterministic rather than relying on various
    incarnations of Thread.sleep() at crucial points. Important aspects of this
    change include:
    
    1) Make the initial replication synchronization process block at the very
    last step and wait for a response from the replica to ensure the replica has
    as the necessary data. This is a critical piece of knowledge during the
    failback process because it allows the soon-to-become-backup server to know
    for sure when it can shut itself down and allow the soon-to-become-live
    server to take over. Also, introduce a new configuration element called
    "initial-replication-sync-timeout" to conrol how long this blocking will occur.
    
    2) Set the state of the server as 'LIVE' only after the server is fully
    started. This is necessary because once the soon-to-be-backup server shuts
    down it needs to know that the soon-to-be-live server has started fully before
    it restarts itself as the new backup. If the soon-to-be-backup server restarts
    before the soon-to-be-live is fully started then it won't actually become a
    backup server but instead will become a live server which will break the
    failback process.
    
    3) Wait to receive the announcement of a backup server before failing-back.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/jbertram/activemq-artemis ARTEMIS-256

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/activemq-artemis/pull/204.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #204
    
----
commit 908776eff3e5410b851aaaa1f62f7db764187acc
Author: jbertram <jb...@apache.org>
Date:   2015-10-14T17:07:17Z

    ARTEMIS-256 orchestrate failback deterministically
    
    The failback process needs to be deterministic rather than relying on various
    incarnations of Thread.sleep() at crucial points. Important aspects of this
    change include:
    
    1) Make the initial replication synchronization process block at the very
    last step and wait for a response from the replica to ensure the replica has
    as the necessary data. This is a critical piece of knowledge during the
    failback process because it allows the soon-to-become-backup server to know
    for sure when it can shut itself down and allow the soon-to-become-live
    server to take over. Also, introduce a new configuration element called
    "initial-replication-sync-timeout" to conrol how long this blocking will occur.
    
    2) Set the state of the server as 'LIVE' only after the server is fully
    started. This is necessary because once the soon-to-be-backup server shuts
    down it needs to know that the soon-to-be-live server has started fully before
    it restarts itself as the new backup. If the soon-to-be-backup server restarts
    before the soon-to-be-live is fully started then it won't actually become a
    backup server but instead will become a live server which will break the
    failback process.
    
    3) Wait to receive the announcement of a backup server before failing-back.

----


> Orchestrate fail-back deterministically
> ---------------------------------------
>
>                 Key: ARTEMIS-256
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-256
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>    Affects Versions: 1.1.0
>            Reporter: Justin Bertram
>            Assignee: Justin Bertram
>             Fix For: 1.1.1
>
>
> Currently fail-back using replication relies on a simple delay (i.e. failback-delay) to determine when the live broker has synchronized all its data to the backup broker (which will become the new live broker once fail-back completes). However, if failback-delay is too short then synchronization won't complete and both the live and backup brokers will be nonfunctional.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)