You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "ming li (Jira)" <ji...@apache.org> on 2021/08/31 14:18:00 UTC

[jira] [Created] (FLINK-24086) Do not re-register SharedStateRegistry to reduce the recovery time of the job

ming li created FLINK-24086:
-------------------------------

             Summary: Do not re-register SharedStateRegistry to reduce the recovery time of the job
                 Key: FLINK-24086
                 URL: https://issues.apache.org/jira/browse/FLINK-24086
             Project: Flink
          Issue Type: Improvement
          Components: Runtime / Checkpointing
            Reporter: ming li


At present, we only recover the {{CompletedCheckpointStore}} when the {{JobManager}} starts, so it seems that we do not need to re-register the {{SharedStateRegistry}} when the task restarts.


The reason for this issue is that in our production environment, we discard part of the data and state to only restart the failed task, but found that it may take several seconds to register the {{SharedStateRegistry}} (thousands of tasks and dozens of TB states). When there are a large number of task failures at the same time, this may take several minutes (number of tasks * several seconds).

 

Therefore, if the {{SharedStateRegistry}} can be reused, the time for task recovery can be reduced.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)