You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Stephan Ewen (Jira)" <ji...@apache.org> on 2020/05/25 18:06:00 UTC
[jira] [Commented] (FLINK-15012) Checkpoint directory not cleaned up

    [ https://issues.apache.org/jira/browse/FLINK-15012?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17116169#comment-17116169 ] 

Stephan Ewen commented on FLINK-15012:
--------------------------------------

I think there is a very difference between the working/temp directories and the checkpoint directories.

The working/temp directories can be cleaned up after processes shut down, because no data in them will ever be needed.
The checkpoint directories may contain retained checkpoints or savepoints that are still relevant. I think we should not ever try to delete these with things like "shutdown hooks".

I understand that job cancellation should remove the job's empty parent checkpoint directories. That makes sense. And [~yunta] proposed an issue to fix this.

I would question whether we should try and do anything about the {{stop-cluster.sh}} behavior. This is forceful wiping of the cluster rather than proper shutdown, so left-over data is to be expected. And, in my mind, the caution to not accidentally delete a still-needed checkpoint is more important than making the "hard stop" as nice as possible (cleanup wise).


> Checkpoint directory not cleaned up
> -----------------------------------
>
>                 Key: FLINK-15012
>                 URL: https://issues.apache.org/jira/browse/FLINK-15012
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.9.1
>            Reporter: Nico Kruber
>            Assignee: Yun Tang
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.12.0
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I started a Flink cluster with 2 TMs using {{start-cluster.sh}} and the following config (in addition to the default {{flink-conf.yaml}})
> {code:java}
> state.checkpoints.dir: file:///path/to/checkpoints/
> state.backend: rocksdb {code}
> After submitting a jobwith checkpoints enabled (every 5s), checkpoints show up, e.g.
> {code:java}
> bb969f842bbc0ecc3b41b7fbe23b047b/
> ├── chk-2
> │   ├── 238969e1-6949-4b12-98e7-1411c186527c
> │   ├── 2702b226-9cfc-4327-979d-e5508ab2e3d5
> │   ├── 4c51cb24-6f71-4d20-9d4c-65ed6e826949
> │   ├── e706d574-c5b2-467a-8640-1885ca252e80
> │   └── _metadata
> ├── shared
> └── taskowned {code}
> If I shut down the cluster via {{stop-cluster.sh}}, these files will remain on disk and not be cleaned up.
> In contrast, if I cancel the job, at least {{chk-2}} will be deleted, but still leaving the (empty) directories.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)