You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@ignite.apache.org by "Roman Puchkovskiy (Jira)" <ji...@apache.org> on 2022/12/30 14:12:00 UTC

[jira] [Assigned] (IGNITE-18428) After a RAFT snapshot install timed out, subsequent installs consistently failed

     [ https://issues.apache.org/jira/browse/IGNITE-18428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Roman Puchkovskiy reassigned IGNITE-18428:
------------------------------------------

    Assignee: Roman Puchkovskiy

> After a RAFT snapshot install timed out, subsequent installs consistently failed
> --------------------------------------------------------------------------------
>
>                 Key: IGNITE-18428
>                 URL: https://issues.apache.org/jira/browse/IGNITE-18428
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Roman Puchkovskiy
>            Assignee: Roman Puchkovskiy
>            Priority: Major
>              Labels: ignite-3
>             Fix For: 3.0.0-beta2
>
>         Attachments: test.log.txt
>
>
> If a RAFT snapshot installation takes more than the corresponding timeout (10 seconds in this case), a retry is attempted. The retry, if it finds an ongoing snapshot copier, tries to cancel it, so that on next retry the installation will start over.
> In one run of a test, the initial attempt to install a snapshot failed, but then all subsequent attempts were trying to cancel the installation and none of them was actually starting another copier, so an infinite loop was created.
> Normally, {{onSnapshotLoadDone()}} is invoked even if snapshot load has failed to clean everything up and make next install attempt possible. This clean up includes nullufiying the contents of {{downloadingSnapshot}} in {{{}SnapshotExecutorImpl{}}}. But this time, according to the log, {{onSnapshotLoadDone()}} was never invoked, so the old snapshot was remaining as 'downloading' forever.
> This could something to do with the fact that the {{IncomingSnapshotCopier}} does not set its status as error (with {{{}setError(){}}})  on cancellation as {{LocalSnapshotCopier}} does.
> Also, there could be some race.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)