You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "Zhanghao Chen (Jira)" <ji...@apache.org> on 2022/12/27 13:41:00 UTC

[jira] [Created] (FLINK-30513) HA storage dir leaks on cluster termination

Zhanghao Chen created FLINK-30513:
-------------------------------------

             Summary: HA storage dir leaks on cluster termination 
                 Key: FLINK-30513
                 URL: https://issues.apache.org/jira/browse/FLINK-30513
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Coordination
    Affects Versions: 1.16.0, 1.15.0
            Reporter: Zhanghao Chen
         Attachments: image-2022-12-27-21-32-17-510.png

*Problem*

We found that HA storage dir leaks on cluster termination for a Flink job with HA enabled. The following picture shows the HA storage dir (here on HDFS) of the cluster czh-flink-test-offline (of application mode) after canelling the job with flink-cancel. We are left with an empty dir, and too many empty dirs will greatly hurt the stability of HDFS NameNode!  !image-2022-12-27-21-32-17-510.png|width=582,height=158!

Furthermore, in case the user choose to retain the checkpoints on job termination, we will have the completedCheckpoints leaked as well. Note that we no longer need the completedCheckpoints files as we'll directly recover retained CPs from the CP data dir.

*Root Cause*

When we run AbstractHaServices#closeAndCleanupAllData(), we cleaned up blob store, but didn't clean the HA storage dir.

*Proposal*

Clean up the HA storage dir after cleaning up blob store in AbstractHaServices#closeAndCleanupAllData().



--
This message was sent by Atlassian Jira
(v8.20.10#820010)