You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Anton Vinogradov (Jira)" <ji...@apache.org> on 2021/02/24 08:16:00 UTC

[jira] [Comment Edited] (IGNITE-13873) Milti-cell transaction changes may be not visible (during some time) after the Cellular switch

    [ https://issues.apache.org/jira/browse/IGNITE-13873?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17289141#comment-17289141 ] 

Anton Vinogradov edited comment on IGNITE-13873 at 2/24/21, 8:15 AM:
---------------------------------------------------------------------

Fixed the issue, also simplified the Cellular switch code/logic.

1) Got rid of useless "remote counters update on rollback on tx recovery" (PartitionCountersNeighborcastRequest/Response/Future).
Previously, each rollback (it's gaps) was propagated by these messages to other backups to solve counters de-sync issue happens because of partially prepared transactions.
Partially prepare means some (not all) backups have them, so, when we taking these gaps into account as updates we must propagate them to other nodes to avoid counters de-sync.
This approach may cause a lot of messages sending/handling :(

This, actually, was a dirty hack to fit "new" counters.
Just changed the strategy to ignore such updates (keep as gaps).
So, now, partially prepared and failed on preparation (not prepared at all) transactions are handled in the same way.
Gaps are closing on counters finalization on node left for partitions with failed primary.
This way allows us to avoid the "gaps tail" we have before (when rolled back transactions with counters higher than committed transactions required counters (gaps tail) propagation to other nodes).

2) Simplified PME-Free switch code.
Got rid of any latch-based synchronizations during the switch, which was used to ... fit dirty hack to ... fit dirty hack to ... new counters (to wait for all gaps tails are propagated).
Currently, counters will be synchronized on the latest committed tx counters by design, no need for latches/sync.

3) Fixed Celular switch code for transactions that affect more than one cell (see issue description for details).
Got rid of 'primaryNodes' computation. 
Now each node just waits for recovery of every tx having one of its primaries failed.

Also, I've checked the speedup using [IEP-56 framework|https://cwiki.apache.org/confluence/display/IGNITE/IEP-56%3A+Distributed+environment+tests] on 48 real servers.
On the slide, you may see the difference between the 2.9 and the PR.
Numbers represent the worst (max) latency happen on node failure on broken/alive cell.

!master (as 2.9) vs improved (as dev) .png|width=100%!

[~alex_pl], [~ascherbakov], Could you please review the changes?


was (Author: av):
Fixed the issue, also simplified the Cellular switch code/logic.

1) Got rid of useless "remote counters update on rollback on tx recovery" (PartitionCountersNeighborcastRequest/Response/Future).
Previously, each rollback (it's gaps) was propagated by these messages to other backups to solve counters de-sync issue happens because of partially prepared transactions.
Partially prepare means some (not all) backups have them, so, when we taking these gaps into account as updates we must propagate them to other nodes to avoid counters de-sync.
This approach may cause a lot of messages sending/handling :(

This, actually, was a dirty hack to fit "new" counters.
Just changed the strategy to ignore such updates (keep as gaps).
So, now, partially prepared and failed on preparation (not prepared at all) transactions are handled in the same way.
Gaps are closing on counters finalization on node left for partitions with failed primary.
This way allows us to avoid the "gaps tail" we have before (when rolled back transactions with counters higher than committed transactions required counters (gaps tail) propagation to other nodes).

2) Simplified PME-Free switch code.
Got rid of any latch-based synchronizations during the switch, which was used to ... fit dirty hack to ... fit new counters (to wait for all gaps tails are propagated).
Currently, counters will be synchronized on the latest committed tx counters by design, no need for latches/sync.

3) Fixed Celular switch code for transactions that affect more than one cell (see issue description for details).
Got rid of 'primaryNodes' computation. 
Now each node just waits for recovery of every tx having one of its primaries failed.

Also, I've checked the speedup using [IEP-56 framework|https://cwiki.apache.org/confluence/display/IGNITE/IEP-56%3A+Distributed+environment+tests] on 48 real servers.
On the slide, you may see the difference between the 2.9 and the PR.
Numbers represent the worst (max) latency happen on node failure on broken/alive cell.

!master (as 2.9) vs improved (as dev) .png|width=100%!

[~alex_pl], [~ascherbakov], Could you please review the changes?

> Milti-cell transaction changes may be not visible (during some time) after the Cellular switch
> ----------------------------------------------------------------------------------------------
>
>                 Key: IGNITE-13873
>                 URL: https://issues.apache.org/jira/browse/IGNITE-13873
>             Project: Ignite
>          Issue Type: Improvement
>            Reporter: Anton Vinogradov (Obsolete, actual is "av")
>            Assignee: Anton Vinogradov
>            Priority: Critical
>              Labels: iep-45
>             Fix For: 2.11
>
>         Attachments: master (as 2.9) vs improved (as dev) .png
>
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> Transactions over some cells may be recovered after the stale data read.
> For example:
> We have 2 cells, the first contains partitions for k1, second for k2.
> Tx with put(k1,v1) and put(k2,v2) started and prepared.
> Then node from the first cell, which is the primary for k1, failed.
> Currently, the second cell (with key2) may finish the cellular switch before tx recovered and stale data read is possible.
> Primaries from the {{tx.transactionNodes()}} should be taken into account instead of the current logic that awaits for all transactions located on nodes who are backups to the failed primary.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)