You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Mikhail Petrov (Jira)" <ji...@apache.org> on 2022/11/22 13:23:00 UTC
[jira] [Updated] (IGNITE-17737) Cluster snapshots may be inconsistent under load.

     [ https://issues.apache.org/jira/browse/IGNITE-17737?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mikhail Petrov updated IGNITE-17737:
------------------------------------
    Release Note: Fixed snapshot inconsistency if it was taken under cache workload.

> Cluster snapshots may be inconsistent under load. 
> --------------------------------------------------
>
>                 Key: IGNITE-17737
>                 URL: https://issues.apache.org/jira/browse/IGNITE-17737
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Nikita Amelchev
>            Assignee: Mikhail Petrov
>            Priority: Major
>              Labels: ise
>         Attachments: SnapshotTest.java
>
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Cluster snapshots may be inconsistent under load. 
> Reproducer:
> One thread does a transactional load: cache#put into transactional cache.
> Another thread does periodic snapshots and checks them.
> Reproducer attached (flacky, please repeat several times).
> Example of a fail:
> {noformat}
> [2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] The check procedure has failed, conflict partitions has been found: [counterConflicts=1, hashConflicts=1]
> [2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] Update counter conflicts:
> [2022-09-21T19:35:51,158][WARN ][async-runnable-runner-1][] Conflict partition: PartitionKeyV2 [grpId=1544803905, grpName=default, partId=432]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Partition instances: [PartitionHashRecordV2 [isPrimary=false, consistentId=snapshot.SnapshotTest2, updateCntr=21, partitionState=OWNING, size=19, partHash=1245894112], PartitionHashRecordV2 [isPrimary=false, consistentId=snapshot.SnapshotTest0, updateCntr=22, partitionState=OWNING, size=20, partHash=1705601802], PartitionHashRecordV2 [isPrimary=false, consistentId=snapshot.SnapshotTest1, updateCntr=21, partitionState=OWNING, size=19, partHash=1245894112]]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] 
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Hash conflicts:
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Conflict partition: PartitionKeyV2 [grpId=1544803905, grpName=default, partId=432]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] Partition instances: [PartitionHashRecordV2 [isPrimary=false, consistentId=snapshot.SnapshotTest2, updateCntr=21, partitionState=OWNING, size=19, partHash=1245894112], PartitionHashRecordV2 [isPrimary=false, consistentId=snapshot.SnapshotTest0, updateCntr=22, partitionState=OWNING, size=20, partHash=1705601802], PartitionHashRecordV2 [isPrimary=false, consistentId=snapshot.SnapshotTest1, updateCntr=21, partitionState=OWNING, size=19, partHash=1245894112]]
> [2022-09-21T19:35:51,159][WARN ][async-runnable-runner-1][] 
> {noformat}
> The following steps can lead to the following behaviour:
> 1. Client node starts a cache key update operation on the current topology version.
> 2. Simultaniosly a snapshot operation is started. It causes a PME (free switch) which increments the current minor topology version.
> 3. Node which is primary for the key being updated completes PME locally, starts snapshot partitions copy procedure and proceeds with the update request ignoring the fact that it was initiated on the stale topology (see IGNITE-9558). Therefore, the primary node will not include the updated key in the snapshot.  
> 4. Backup nodes have not yet completed PME, so the snaphot has not been started. 
> 5. Backup nodes receive requests to update the key. And since the update operation has been mapped to the already completed topology version, backup nodes successfully update the key ignoring the fact that PME related to the snapshot operation is in progress.
> 6. Backup nodes completes PME and finishes snapshot procedure.
> 7. As a result snapshot from backup nodes includes the updated key and primary node does not.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)