You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Gary Tully (Jira)" <ji...@apache.org> on 2021/06/28 12:02:00 UTC
[jira] [Commented] (ARTEMIS-3340) Replicated Journal quorum-based logical timestamp/version

    [ https://issues.apache.org/jira/browse/ARTEMIS-3340?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17370561#comment-17370561 ] 

Gary Tully commented on ARTEMIS-3340:
-------------------------------------

thinking about naming, we have a replicated broker, a reliable distributed lock, some share persistent state and some coordinated local state transitions.

 

the shared persistent state, a monotonically increasing value, captures a local coordinated state transition. What should this be called?

 

This is what we can have:

All state transitions are guarded by the distributed lock which is used in an [advisory|https://unix.stackexchange.com/questions/147392/what-is-advisory-locking-on-files-that-unix-systems-typically-employs] manner to protect the shared persistent state and the local state (the journal etc)

 

the local state transitions:

For the lock owner (live) transitioning from UN_REPLICATED to REPLICATED, 0->1 and on failure to replicate but staying active, from  REPLICATED to UN_REPLICATED 1->2.

For a backup, finding a lock owner and transitioning from REPLICATING to INSYNC_REPLICA x->1 (the 1 is propagated in the replication stream)

 

For a backup to take over, it gets the lock and verifies that the next state transition is 1->2 (if the live spend some time UN_REPLICATED, then moving to 2 is not possible (the share persistent state will be at 2) and the backup is stale.

 

The value (1) here, is part of the shared persistent state, and shared with a replica to get it in-sync.

It is used to to validate an uncoordinated state transition, ie: to make a unilateral decision about the next step.

 

it is a:

shared_state_transition_sequence_number

 

thoughts?

 

along with that, it would make sense to store the broker identity, the ip address etc. Any information that would help identify the node that must be started to make the correct next state transition, that would be nice for an operator.

> Replicated Journal quorum-based logical timestamp/version
> ---------------------------------------------------------
>
>                 Key: ARTEMIS-3340
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3340
>             Project: ActiveMQ Artemis
>          Issue Type: Improvement
>            Reporter: Francesco Nigro
>            Priority: Major
>
> Shared-nothing replication can cause journal misalignment despite no split-brain events.
> There are several ways that can cause this to happen.
> Below some scenario that won't involve network partitions/drastic outages.
> Scenario 1:
>  # Master/Primary start as live, clients connect to it
>  # Backup become an in-sync replica
>  # User stop live and backup failover to it
>  # *Backup serve clients alone, modifying its journal*
>  # User stop backup
>  # User start master/primary: it become live with a journal misaligned to the most up-to-date one ie on the stopped backup
> Scenario 2:
>  # Master/Primary start as live, clients connect to it
>  # Backup become an in-sync replica
>  # Connection glitch between backup -> live
>  # backup start trying to failover (for {{vote-retries * vote-retry-wait}} milliseconds)
>  # *Live serve clients alone, modifying its journal*
>  # User stop live
>  # Backup succeed to failover: it become live with a journal misaligned to the most up-to-date one ie on the stopped live
> The main cause of this issue is because we allow *a single broker to serve clients*, despite configured with HA, generating the journal misalignment.
>  The quorum service (classic or pluggable) just take care of mutual exclusive presence of broker for the live role (vs a NodeID), without considering live role ordering based on the most up-to-date journal.
> A possible solution is to use https://issues.apache.org/jira/browse/ARTEMIS-2716 and use a quorum "logical timestamp/version" marking the age/ownership changes of the journal in order to force live to always have the most up-to-date journal. It means that such value has to be locally saved and exchanged during the initial replica sync, involving both journal data and core message protocol change (valid just for the replication channel, without impacting clients).
> In case of quorum service restart/outage, admin must use command/configuration to let a broker to ignore the age of its journal and just force it to start.
> In addition new journal CLI commands should be implemented to inspect the age of a (local) broker journal or query/force the quorum journal version too, for troubleshooting reasons.
> It's very important to capture every possible event that cause the journal age/ownership to change
>  eg Possible scenario 2 (again):
>  # live broker start because it matches the most up to date journal version, increasing it (locally and remotely) when it become fully alive
>  # backup found it and trust that, given that's live, it already has the most-up-to-date journal for a specific NodeID 
>  # live broker send its journal files to the backup, along with its local journal version
>  # backup is now ready to failover in any moment
>  # network glitch happen
>  # backup try to become live for vote-retries times
>  # live detect replication disconnection and *increment the journal version* (quorum and locally)
>  # live serve clients alone, modifying its journal
>  # an outage/stop cause live to die
>  # backup detect that *journal version no longer matching its own local journal version*: it stop trying to become live
> The key parts related to journal age/version are:
>  * only who's live can change quorum (and local) journal version (with a monotonic increment)
>  * every ownership change event must cause journal age/version to change eg starting as live, loosing its backup, etc etc
> Re the RI implementation using Apache Curator, this could use a separate [DistributedAtomicLong|https://curator.apache.org/apidocs/org/apache/curator/framework/recipes/atomic/DistributedAtomicLong.html]  to manage the journal version.
> Although tempting, it's not a good idea to not just use the data field on {{InterProcessSemaphoreV2}}, because:
> * there's no API to query it if no lease is acquired yet (or created)
> * we more need to "age" the journal independently from the lock acquisition/release process eg a live that drop its replica need to increment the journal version
> Athough tempting, it's not a good idea to just use the last alive broker connector identity instead of a journal version, because of the ABA problem (see https://en.wikipedia.org/wiki/ABA_problem).
> This versioning mechanism isn't without drawbacks: quorum journal versioning requires to store a local copy of the version in order to allow the broker to query and compare it with the quorum one on restart; having 2 separate and not atomic operations means that there must be a way to reconcile/fix it in case of misalignments. As said above, this could be done with admin operations.
> Journal versioning change the way roles behave, but they still retain theirs key characteristics:
> - backup should try start as live in case it has the most up to date journal and there is no other live around: differently, can just rotate journal and be available to replicate some live
> - primary try to fail-back to any existing live with the most up to date journal or await it to appear, without becoming live if it doesn't have the most up-to-date journal
> This would ensure that If both broker are up and running and backup allow a primary to failback, the primary eventually become live and backup replicates it, preserving the desired broker roles.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)