You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Flink Jira Bot (Jira)" <ji...@apache.org> on 2021/10/29 22:40:01 UTC

[jira] [Updated] (FLINK-20912) Increase Log and Metric: Time consumed by Checkpoint Restore

     [ https://issues.apache.org/jira/browse/FLINK-20912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Flink Jira Bot updated FLINK-20912:
-----------------------------------
    Labels: auto-deprioritized-major stale-minor  (was: auto-deprioritized-major)

I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issues has been marked as Minor but is unassigned and neither itself nor its Sub-Tasks have been updated for 180 days. I have gone ahead and marked it "stale-minor". If this ticket is still Minor, please either assign yourself or give an update. Afterwards, please remove the label or in 7 days the issue will be deprioritized.


> Increase Log and Metric: Time consumed by Checkpoint Restore
> ------------------------------------------------------------
>
>                 Key: FLINK-20912
>                 URL: https://issues.apache.org/jira/browse/FLINK-20912
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Checkpointing, Runtime / State Backends
>    Affects Versions: 1.12.1, 1.13.0
>            Reporter: future
>            Priority: Minor
>              Labels: auto-deprioritized-major, stale-minor
>
> In a production environment, some jobs with higher SLAs need to be restarted quickly if failover occurs. Checkpoint restore is an important part of task start. When the Flink task starts slowly, the related Log and Metric should be added to facilitate troubleshooting.
> For example: ByteDance shared in FFA 2020: They made OperatorState parallelized restore. Without these metrics, there will be two problems:
> 1. It is not easy to find the problem. If the task starts slowly, it is not known whether the root cause is the slow Checkpoint restore.
> 2. If optimized, how much speed has been improved for restore? Need to be quantified.
> I believe that many companies have made relevant metrics in their internal Flink versions.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)