You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@activemq.apache.org by "Francesco Nigro (Jira)" <ji...@apache.org> on 2021/08/19 14:23:00 UTC

[jira] [Updated] (ARTEMIS-3430) Activation Sequence Auto-Repair

     [ https://issues.apache.org/jira/browse/ARTEMIS-3430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Francesco Nigro updated ARTEMIS-3430:
-------------------------------------
    Description: 
This can be seen both as a bug or an improvement over the existing self-heal behaviour of activation sequence introduced by https://issues.apache.org/jira/browse/ARTEMIS-3340.

In short, the existing protocol to increase activation sequence while un-replicated is:
# remote i -> -(i + 1) ie remote CLAIM 
# local i -> (i + 1) ie local commit
# remote -(i + 1) -> (i + 1) ie remote COMMIT

This protocol has been designed to allow witness brokers to acknowledge if their data is no longer up-to-date and to save them to throw it away if still valuable, during a partial failure while increasing activation sequence.

In the current version, self-repairing is allowed only if the live broker has performed 2. but not 3. ie local activation sequence is updated, but coordinated one isn't committed.
If the failing broker is restarted it can "fix" the coordinated sequence and move on to become live again, but if 2. fail (or just never happen), the coordinated activation sequence cannot be fixed if not with some admin intervention, after inspecting local activation sequences.

The reason why other brokers cannot "fix" the sequence is because the local sequence of the failed broker is unknown and just roll-backing the claimed one can makes the failed broker to believe to have up-to-date data too, causing journal misalignments.

The solution to this can be to fix claimed sequence, forbidding any broker to run un-replicated with it by further increasing it *after* repaired: this would age forcibly brokers with "goodish" data, but will allow others brokers to auto-repair without admin intervention.
The sole drawback of this strategy is that a further fail of the repairing broker while further increasing sequence will give it full and exclusive responsibility to auto-repair, because no other brokers can have an high-enough local sequence.

  was:
This can be seen both as a bug or an improvement over the existing self-heal behaviour of activation sequence introduced by https://issues.apache.org/jira/browse/ARTEMIS-3340.

In short, the existing protocol to increase activation sequence while un-replicated is:
# remote i -> -(i + 1) ie remote CLAIM 
# local i -> (i + 1) ie local commit
# remote -(i + 1) -> (i + 1) ie remote COMMIT

This protocol has been designed to allow witness brokers to acknowledge if their data is no longer up-to-date and to save them to throw away it, if it still have some value (because of a failure to commit sequence).

In the current version, self-repairing is allowed only if the live broker has performed 2. but not 3. ie local activation sequence is updated, but coordinated one isn't committed.
If the failing broker is restarted it can "fix" the coordinated sequence and move on to become live again, but if 2. fail (or just never happen), the coordinated activation sequence cannot be fixed if not with some admin intervention, after inspecting local activation sequences.

The reason why other brokers cannot "fix" the sequence is because the local sequence of the failed broker is unknown and just roll-backing the claimed one can makes the failed broker to believe to have up-to-date data too, causing journal misalignments.

The solution to this can be to fix claimed sequence, forbidding any broker to run un-replicated with it by further increasing it *after* repaired: this would age forcibly brokers with "goodish" data, but will allow others brokers to auto-repair without admin intervention.
The sole drawback of this strategy is that a further fail of the repairing broker while further increasing sequence will give it full and exclusive responsibility to auto-repair, because no other brokers can have an high-enough local sequence.


> Activation Sequence Auto-Repair
> -------------------------------
>
>                 Key: ARTEMIS-3430
>                 URL: https://issues.apache.org/jira/browse/ARTEMIS-3430
>             Project: ActiveMQ Artemis
>          Issue Type: Bug
>            Reporter: Francesco Nigro
>            Assignee: Francesco Nigro
>            Priority: Major
>
> This can be seen both as a bug or an improvement over the existing self-heal behaviour of activation sequence introduced by https://issues.apache.org/jira/browse/ARTEMIS-3340.
> In short, the existing protocol to increase activation sequence while un-replicated is:
> # remote i -> -(i + 1) ie remote CLAIM 
> # local i -> (i + 1) ie local commit
> # remote -(i + 1) -> (i + 1) ie remote COMMIT
> This protocol has been designed to allow witness brokers to acknowledge if their data is no longer up-to-date and to save them to throw it away if still valuable, during a partial failure while increasing activation sequence.
> In the current version, self-repairing is allowed only if the live broker has performed 2. but not 3. ie local activation sequence is updated, but coordinated one isn't committed.
> If the failing broker is restarted it can "fix" the coordinated sequence and move on to become live again, but if 2. fail (or just never happen), the coordinated activation sequence cannot be fixed if not with some admin intervention, after inspecting local activation sequences.
> The reason why other brokers cannot "fix" the sequence is because the local sequence of the failed broker is unknown and just roll-backing the claimed one can makes the failed broker to believe to have up-to-date data too, causing journal misalignments.
> The solution to this can be to fix claimed sequence, forbidding any broker to run un-replicated with it by further increasing it *after* repaired: this would age forcibly brokers with "goodish" data, but will allow others brokers to auto-repair without admin intervention.
> The sole drawback of this strategy is that a further fail of the repairing broker while further increasing sequence will give it full and exclusive responsibility to auto-repair, because no other brokers can have an high-enough local sequence.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)