You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by "Woods, Jessica Hui" <je...@campus.tu-berlin.de> on 2020/02/18 13:16:23 UTC

checkpoint total restart time

Hi,

I am working with Apache Flink and am interested in knowing how one could estimate the total amount of time an application spends in recovery, including the input stream "catch-up" after checkpoint recovery. What I am specifically interested in is knowing the time needed for the recovery of the state + the catch-up phase (since the application's source tasks are reset to an earlier input position after recovery, this would be the data it processed before the failure and data that accumulated while the application was down).

My question is, what important considerations should I take into account when estimating this time and which portions of the Apache Flink codebase would be most helpful?

Thanks?