You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Sivaprasanna <si...@gmail.com> on 2020/07/30 12:45:12 UTC

Unable to recover from checkpoint

Hello,

We recently ran into an unexpected scenario. Our stateful streaming
pipeline uses RocksDB as the backend and has incremental checkpointing
enabled. We have RETAIN_ON_CANCELATION enabled so some of the previous
cancellation and restarts had left a lot of unattended checkpoint
directories which amounted to almost 1 TB . Today we manually cleared these
directories and left the current running job's checkpoint directory alone
untouched. Few hours later, the job ran into some other error and failed
but when it attempted to use the latest successful checkpoint, it failed
saying java.io.FileNotFoundException: File does not exist:
/path/to/an/older/checkpoint/45a55300adab66d7cc49ff5e50ee5b62/shared/f7ace888-059b-4256-966c-51c1549aa6e4

So I have few questions:
- Are we not supposed to clear these older checkpoint directories which
were created by previous runs of the pipeline?
- Does the /shared directory under the current checkpoint directory not
have all the necessary files to recover?
- What is the recommended procedure to clear remnant checkpoint
directories? Here, by remnant, I mean previous runs of the job which was
cancelled and we manually restarted with the latest checkpoint (lets say
chk-123). The new job is running fine and has made further checkpoints. Can
we delete chk-123?

Thanks,
Sivaprasanna

Re: Unable to recover from checkpoint

Posted by Congxian Qiu <qc...@gmail.com>.

Hi  Sivaprasanna
    For  RocksDBStateBackend incremental checkpoint, the latest checkpoint
may contain the files of the previous checkpoint(the files in the shared
directory), so delete the files belong to the previous checkpoint may lead
to FileNotFoundException. Currently, we can only parse the metadata
manually to know what the files belong to a specific checkpoint. There is
an issue FLINK-17571 wants to show the files belong to a specific
checkpoint.

Best,
Congxian


Sivaprasanna <si...@gmail.com> 于2020年7月30日周四 下午8:46写道：

> Hello,
>
> We recently ran into an unexpected scenario. Our stateful streaming
> pipeline uses RocksDB as the backend and has incremental checkpointing
> enabled. We have RETAIN_ON_CANCELATION enabled so some of the previous
> cancellation and restarts had left a lot of unattended checkpoint
> directories which amounted to almost 1 TB . Today we manually cleared these
> directories and left the current running job's checkpoint directory alone
> untouched. Few hours later, the job ran into some other error and failed
> but when it attempted to use the latest successful checkpoint, it failed
> saying java.io.FileNotFoundException: File does not exist:
> /path/to/an/older/checkpoint/45a55300adab66d7cc49ff5e50ee5b62/shared/f7ace888-059b-4256-966c-51c1549aa6e4
>
> So I have few questions:
> - Are we not supposed to clear these older checkpoint directories which
> were created by previous runs of the pipeline?
> - Does the /shared directory under the current checkpoint directory not
> have all the necessary files to recover?
> - What is the recommended procedure to clear remnant checkpoint
> directories? Here, by remnant, I mean previous runs of the job which was
> cancelled and we manually restarted with the latest checkpoint (lets say
> chk-123). The new job is running fine and has made further checkpoints. Can
> we delete chk-123?
>
> Thanks,
> Sivaprasanna
>