You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Francesco Nigro (Jira)" <ji...@apache.org> on 2021/06/14 16:35:00 UTC
[jira] [Commented] (ARTEMIS-3345) Shared-Nothing Replication Master loose its Node ID on failed fail-back

    [ https://issues.apache.org/jira/browse/ARTEMIS-3345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17363061#comment-17363061 ] 

Francesco Nigro commented on ARTEMIS-3345:
------------------------------------------

https://issues.apache.org/jira/browse/ARTEMIS-2716 has introduced an additional parameter on server data files rotation that preserve the previous lock file, if any, in order to preserve the Node ID for the failing-back case: need to check how it interact with colocation.

> Shared-Nothing Replication Master loose its Node ID on failed fail-back
> -----------------------------------------------------------------------
>
>                 Key: ARTEMIS-3345
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3345
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>          Components: Broker
>    Affects Versions: 2.17.0
>            Reporter: Francesco Nigro
>            Assignee: Francesco Nigro
>            Priority: Major
>
> A failing-back master forget its Node ID and on broker restart, having a different Node ID, can become live without searching any existing live with its previous Node ID.
> This is happen because of this mechanics on {{SharedNothingBackupActivation}}:
>  # {{SharedNothingBackupActivation::init}} is going to call {{activeMQServer.resetNodeManager}} that's re-creating a {{NodeManager}} with {{replicatingBackup == true}}
>  # SharedNothingBackupActivation::run is then
> {code:java}
>          // move all data away:
>          activeMQServer.getNodeManager().stop();
>          activeMQServer.moveServerData(replicaPolicy.getMaxSavedReplicatedJournalsSize());
>          activeMQServer.getNodeManager().start();
> {code}
> The server data rotation just clean up everything on the data path, including the lock file.
> {{NodeManager::start}}, due to {{replicatingBackup == true}} is going to skip setting up a new lock file (no lock files at this point)
> # this broker is setting an in-memory Node ID after a successful sync with the live, using {{NodeManager::setNodeID}}
> # *if* this broker is going to failover (or failback, given that's a master) {{activeMQServer.getNodeManager().stopBackup()}} is going to setup the lock file with the previously set Node ID, see
> {code:java}
>    @Override
>    public void stopBackup() throws NodeManagerException {
>       if (replicatedBackup && getNodeId() != null) {
>          try {
>             setUpServerLockFile();
>          } catch (IOException e) {
>             throw new NodeManagerException(e);
>          }
>       }
>       super.stopBackup();
>    }
> {code}
> This process shows that if anything wrong is going to happen before the Node ID is being written on the durable storage, could be either because the broker was unable to become live (no majority or just still alive live) or because of a restart with unlucky timing, the broker won't have any lock file and it just forget its original Node ID.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)