You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "ChangjiGuo (Jira)" <ji...@apache.org> on 2022/08/11 10:29:00 UTC

[jira] [Created] (FLINK-28927) When the checkpoint times out, the uploaded shared files are not cleaned up

ChangjiGuo created FLINK-28927:
----------------------------------

Summary: When the checkpoint times out, the uploaded shared files are not cleaned up
Key: FLINK-28927
URL: https://issues.apache.org/jira/browse/FLINK-28927
Project: Flink
Issue Type: Bug
Components: Runtime / State Backends
Environment: Flink-1.11
Reporter: ChangjiGuo

If a checkpoint times out, the task will cancel the AsyncCheckpointRunnable thread and do some cleanup work, including the following:
* Cancel all AsyncSnapshotCallable thread.
* If the thread has finished, it will clean up all state object.
* If the thread has not completed, it will be interrupted.
* Close snapshotCloseableRegistry.

In my case, the thread was interrupted while waiting for the file upload to complete, but the file was not cleaned up.

RocksDBStateUploader.java
{code:java}
FutureUtils.waitForAll(futures.values()).get();
{code}

It will wait for all files to be uploaded here. Although it has been interrupted, the uploaded files will not be cleaned up. The remaining files are mainly divided into:
* Files that have finished uploading before the thread is canceled.
* outputStream.closeAndGetHandle() is called, but snapshotCloseableRegistry has not been closed.

How to reproduce?
Shorten the checkpoint timeout time, making the checkpoint fail. Then check if there are any files in the shared directory.

I'm testing on Flink-1.11, but I look at the code from the latest branch and there may be the same problem here. I tried to fix it and it works well so far.

--
This message was sent by Atlassian Jira
(v8.20.10#820010)