You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/10/20 20:22:27 UTC
[jira] [Commented] (ARTEMIS-256) Orchestrate fail-back
deterministically
[ https://issues.apache.org/jira/browse/ARTEMIS-256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14965506#comment-14965506 ]
ASF GitHub Bot commented on ARTEMIS-256:
----------------------------------------
GitHub user jbertram opened a pull request:
https://github.com/apache/activemq-artemis/pull/204
ARTEMIS-256 orchestrate failback deterministically
The failback process needs to be deterministic rather than relying on various
incarnations of Thread.sleep() at crucial points. Important aspects of this
change include:
1) Make the initial replication synchronization process block at the very
last step and wait for a response from the replica to ensure the replica has
as the necessary data. This is a critical piece of knowledge during the
failback process because it allows the soon-to-become-backup server to know
for sure when it can shut itself down and allow the soon-to-become-live
server to take over. Also, introduce a new configuration element called
"initial-replication-sync-timeout" to conrol how long this blocking will occur.
2) Set the state of the server as 'LIVE' only after the server is fully
started. This is necessary because once the soon-to-be-backup server shuts
down it needs to know that the soon-to-be-live server has started fully before
it restarts itself as the new backup. If the soon-to-be-backup server restarts
before the soon-to-be-live is fully started then it won't actually become a
backup server but instead will become a live server which will break the
failback process.
3) Wait to receive the announcement of a backup server before failing-back.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/jbertram/activemq-artemis ARTEMIS-256
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/activemq-artemis/pull/204.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #204
----
commit 908776eff3e5410b851aaaa1f62f7db764187acc
Author: jbertram <jb...@apache.org>
Date: 2015-10-14T17:07:17Z
ARTEMIS-256 orchestrate failback deterministically
The failback process needs to be deterministic rather than relying on various
incarnations of Thread.sleep() at crucial points. Important aspects of this
change include:
1) Make the initial replication synchronization process block at the very
last step and wait for a response from the replica to ensure the replica has
as the necessary data. This is a critical piece of knowledge during the
failback process because it allows the soon-to-become-backup server to know
for sure when it can shut itself down and allow the soon-to-become-live
server to take over. Also, introduce a new configuration element called
"initial-replication-sync-timeout" to conrol how long this blocking will occur.
2) Set the state of the server as 'LIVE' only after the server is fully
started. This is necessary because once the soon-to-be-backup server shuts
down it needs to know that the soon-to-be-live server has started fully before
it restarts itself as the new backup. If the soon-to-be-backup server restarts
before the soon-to-be-live is fully started then it won't actually become a
backup server but instead will become a live server which will break the
failback process.
3) Wait to receive the announcement of a backup server before failing-back.
----
> Orchestrate fail-back deterministically
> ---------------------------------------
>
> Key: ARTEMIS-256
> URL: https://issues.apache.org/jira/browse/ARTEMIS-256
> Project: ActiveMQ Artemis
> Issue Type: Bug
> Affects Versions: 1.1.0
> Reporter: Justin Bertram
> Assignee: Justin Bertram
> Fix For: 1.1.1
>
>
> Currently fail-back using replication relies on a simple delay (i.e. failback-delay) to determine when the live broker has synchronized all its data to the backup broker (which will become the new live broker once fail-back completes). However, if failback-delay is too short then synchronization won't complete and both the live and backup brokers will be nonfunctional.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)