You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Roman Puchkovskiy (Jira)" <ji...@apache.org> on 2022/12/30 14:12:00 UTC
[jira] [Assigned] (IGNITE-18428) After a RAFT snapshot install timed out, subsequent installs consistently failed
[ https://issues.apache.org/jira/browse/IGNITE-18428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Roman Puchkovskiy reassigned IGNITE-18428:
------------------------------------------
Assignee: Roman Puchkovskiy
> After a RAFT snapshot install timed out, subsequent installs consistently failed
> --------------------------------------------------------------------------------
>
> Key: IGNITE-18428
> URL: https://issues.apache.org/jira/browse/IGNITE-18428
> Project: Ignite
> Issue Type: Bug
> Reporter: Roman Puchkovskiy
> Assignee: Roman Puchkovskiy
> Priority: Major
> Labels: ignite-3
> Fix For: 3.0.0-beta2
>
> Attachments: test.log.txt
>
>
> If a RAFT snapshot installation takes more than the corresponding timeout (10 seconds in this case), a retry is attempted. The retry, if it finds an ongoing snapshot copier, tries to cancel it, so that on next retry the installation will start over.
> In one run of a test, the initial attempt to install a snapshot failed, but then all subsequent attempts were trying to cancel the installation and none of them was actually starting another copier, so an infinite loop was created.
> Normally, {{onSnapshotLoadDone()}} is invoked even if snapshot load has failed to clean everything up and make next install attempt possible. This clean up includes nullufiying the contents of {{downloadingSnapshot}} in {{{}SnapshotExecutorImpl{}}}. But this time, according to the log, {{onSnapshotLoadDone()}} was never invoked, so the old snapshot was remaining as 'downloading' forever.
> This could something to do with the fact that the {{IncomingSnapshotCopier}} does not set its status as error (with {{{}setError(){}}}) on cancellation as {{LocalSnapshotCopier}} does.
> Also, there could be some race.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)