You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Dongwon Kim <ea...@gmail.com> on 2022/11/22 08:42:35 UTC
[Flink K8s operator] HA metadata not available to restore from last state
Hi,
While using a last-state upgrade mode on flink-k8s-operator-1.2.0 and
flink-1.14.3, we're occasionally facing the following error:
Status:
> Cluster Info:
> Flink - Revision: 98997ea @ 2022-01-08T23:23:54+01:00
> Flink - Version: 1.14.3
> Error: HA metadata not available to restore
> from last state. It is possible that the job has finished or terminally
> failed, or the configmaps have been deleted. Manual restore required.
> Job Manager Deployment Status: ERROR
> Job Status:
> Job Id: e8dd04ea4b03f1817a4a4b9e5282f433
> Job Name: flinktest
> Savepoint Info:
> Last Periodic Savepoint Timestamp: 0
> Savepoint History:
> Trigger Id:
> Trigger Timestamp: 0
> Trigger Type: UNKNOWN
> Start Time: 1668660381400
> State: RECONCILING
> Update Time: 1668994910151
> Reconciliation Status:
> Last Reconciled Spec: ...
> Reconciliation Timestamp: 1668660371853
> State: DEPLOYED
> Task Manager:
> Label Selector: component=taskmanager,app=flinktest
> Replicas: 1
> Events:
> Type Reason Age From
> Message
> ---- ------ ---- ----
> -------
> Normal JobStatusChanged 30m Job
> Job status changed from RUNNING to RESTARTING
> Normal JobStatusChanged 29m Job
> Job status changed from RESTARTING to CREATED
> Normal JobStatusChanged 28m Job
> Job status changed from CREATED to RESTARTING
> Warning Missing 26m JobManagerDeployment
> Missing JobManager deployment
> Warning RestoreFailed 9s (x106 over 26m) JobManagerDeployment HA
> metadata not available to restore from last state. It is possible that the
> job has finished or terminally failed, or the configmaps have been
> deleted. Manual restore required.
> Normal Submit 9s (x106 over 26m) JobManagerDeployment
> Starting deployment
We're happy with the last state mode most of the time, but we face it
occasionally.
We found that it's not easy to reproduce the problem; we tried to kill JMs
and TMs and even shutdown the nodes on which JMs and TMs are running.
We also checked that the file size is not zero.
Thanks,
Dongwon
Re: [Flink K8s operator] HA metadata not available to restore from last state
Posted by Dongwon Kim <ea...@gmail.com>.
Hi Gyula :-)
Okay, we're gonna upgrade to 1.15 and see what happens.
Thanks a lot for the quick feedback and the detailed explanation!
Best,
Dongwon
On Tue, Nov 22, 2022 at 5:57 PM Gyula Fóra <gy...@gmail.com> wrote:
> Hi Dongwon!
>
> This error mostly occurs when using Flink 1.14 and the Flink cluster goes
> into a terminal state. If a Flink job is FAILED/FINISHED (such as it
> exhausted the retry strategy), in Flink 1.14 the cluster shuts itself down
> and removes the HA metadata.
>
> In these cases the operator will only see that the cluster completely
> disappeared and there is no HA metadata and it will throw the error you
> mentioned. It does not know what happened and doesn't have any way to
> recover checkpoint information.
>
> This is fixed in Flink 1.15 where even after terminal FAILED/FINISHED
> states, the jobmanager would not shut down. This allows the operator to
> observe this terminal state and actually recover the job even if the HA
> metadata was removed.
>
> To summarize, this is mostly caused by Flink 1.14 behaviour that the
> operator cannot control. Upgrading to 1.15 allows much more robustness and
> should eliminate most of these cases.
>
> Cheers,
> Gyula
>
> On Tue, Nov 22, 2022 at 9:43 AM Dongwon Kim <ea...@gmail.com> wrote:
>
>> Hi,
>>
>> While using a last-state upgrade mode on flink-k8s-operator-1.2.0 and
>> flink-1.14.3, we're occasionally facing the following error:
>>
>> Status:
>>> Cluster Info:
>>> Flink - Revision: 98997ea @ 2022-01-08T23:23:54+01:00
>>> Flink - Version: 1.14.3
>>> Error: HA metadata not available to restore
>>> from last state. It is possible that the job has finished or terminally
>>> failed, or the configmaps have been deleted. Manual restore required.
>>> Job Manager Deployment Status: ERROR
>>> Job Status:
>>> Job Id: e8dd04ea4b03f1817a4a4b9e5282f433
>>> Job Name: flinktest
>>> Savepoint Info:
>>> Last Periodic Savepoint Timestamp: 0
>>> Savepoint History:
>>> Trigger Id:
>>> Trigger Timestamp: 0
>>> Trigger Type: UNKNOWN
>>> Start Time: 1668660381400
>>> State: RECONCILING
>>> Update Time: 1668994910151
>>> Reconciliation Status:
>>> Last Reconciled Spec: ...
>>> Reconciliation Timestamp: 1668660371853
>>> State: DEPLOYED
>>> Task Manager:
>>> Label Selector: component=taskmanager,app=flinktest
>>> Replicas: 1
>>> Events:
>>> Type Reason Age From
>>> Message
>>> ---- ------ ---- ----
>>> -------
>>> Normal JobStatusChanged 30m Job
>>> Job status changed from RUNNING to RESTARTING
>>> Normal JobStatusChanged 29m Job
>>> Job status changed from RESTARTING to CREATED
>>> Normal JobStatusChanged 28m Job
>>> Job status changed from CREATED to RESTARTING
>>> Warning Missing 26m JobManagerDeployment
>>> Missing JobManager deployment
>>> Warning RestoreFailed 9s (x106 over 26m) JobManagerDeployment
>>> HA metadata not available to restore from last state. It is possible that
>>> the job has finished or terminally failed, or the configmaps have been
>>> deleted. Manual restore required.
>>> Normal Submit 9s (x106 over 26m) JobManagerDeployment
>>> Starting deployment
>>
>>
>> We're happy with the last state mode most of the time, but we face it
>> occasionally.
>>
>> We found that it's not easy to reproduce the problem; we tried to kill
>> JMs and TMs and even shutdown the nodes on which JMs and TMs are running.
>>
>> We also checked that the file size is not zero.
>>
>> Thanks,
>>
>> Dongwon
>>
>>
>>
Re: [Flink K8s operator] HA metadata not available to restore from last state
Posted by Gyula Fóra <gy...@gmail.com>.
Hi Dongwon!
This error mostly occurs when using Flink 1.14 and the Flink cluster goes
into a terminal state. If a Flink job is FAILED/FINISHED (such as it
exhausted the retry strategy), in Flink 1.14 the cluster shuts itself down
and removes the HA metadata.
In these cases the operator will only see that the cluster completely
disappeared and there is no HA metadata and it will throw the error you
mentioned. It does not know what happened and doesn't have any way to
recover checkpoint information.
This is fixed in Flink 1.15 where even after terminal FAILED/FINISHED
states, the jobmanager would not shut down. This allows the operator to
observe this terminal state and actually recover the job even if the HA
metadata was removed.
To summarize, this is mostly caused by Flink 1.14 behaviour that the
operator cannot control. Upgrading to 1.15 allows much more robustness and
should eliminate most of these cases.
Cheers,
Gyula
On Tue, Nov 22, 2022 at 9:43 AM Dongwon Kim <ea...@gmail.com> wrote:
> Hi,
>
> While using a last-state upgrade mode on flink-k8s-operator-1.2.0 and
> flink-1.14.3, we're occasionally facing the following error:
>
> Status:
>> Cluster Info:
>> Flink - Revision: 98997ea @ 2022-01-08T23:23:54+01:00
>> Flink - Version: 1.14.3
>> Error: HA metadata not available to restore
>> from last state. It is possible that the job has finished or terminally
>> failed, or the configmaps have been deleted. Manual restore required.
>> Job Manager Deployment Status: ERROR
>> Job Status:
>> Job Id: e8dd04ea4b03f1817a4a4b9e5282f433
>> Job Name: flinktest
>> Savepoint Info:
>> Last Periodic Savepoint Timestamp: 0
>> Savepoint History:
>> Trigger Id:
>> Trigger Timestamp: 0
>> Trigger Type: UNKNOWN
>> Start Time: 1668660381400
>> State: RECONCILING
>> Update Time: 1668994910151
>> Reconciliation Status:
>> Last Reconciled Spec: ...
>> Reconciliation Timestamp: 1668660371853
>> State: DEPLOYED
>> Task Manager:
>> Label Selector: component=taskmanager,app=flinktest
>> Replicas: 1
>> Events:
>> Type Reason Age From
>> Message
>> ---- ------ ---- ----
>> -------
>> Normal JobStatusChanged 30m Job
>> Job status changed from RUNNING to RESTARTING
>> Normal JobStatusChanged 29m Job
>> Job status changed from RESTARTING to CREATED
>> Normal JobStatusChanged 28m Job
>> Job status changed from CREATED to RESTARTING
>> Warning Missing 26m JobManagerDeployment
>> Missing JobManager deployment
>> Warning RestoreFailed 9s (x106 over 26m) JobManagerDeployment
>> HA metadata not available to restore from last state. It is possible that
>> the job has finished or terminally failed, or the configmaps have been
>> deleted. Manual restore required.
>> Normal Submit 9s (x106 over 26m) JobManagerDeployment
>> Starting deployment
>
>
> We're happy with the last state mode most of the time, but we face it
> occasionally.
>
> We found that it's not easy to reproduce the problem; we tried to kill JMs
> and TMs and even shutdown the nodes on which JMs and TMs are running.
>
> We also checked that the file size is not zero.
>
> Thanks,
>
> Dongwon
>
>
>