You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Dongwon Kim <ea...@gmail.com> on 2022/11/22 08:42:35 UTC

[Flink K8s operator] HA metadata not available to restore from last state

Hi,

While using a last-state upgrade mode on flink-k8s-operator-1.2.0 and
flink-1.14.3, we're occasionally facing the following error:

Status:
>   Cluster Info:
>     Flink - Revision:             98997ea @ 2022-01-08T23:23:54+01:00
>     Flink - Version:              1.14.3
>   Error:                          HA metadata not available to restore
> from last state. It is possible that the job has finished or terminally
> failed, or the configmaps have been deleted. Manual restore required.
>   Job Manager Deployment Status:  ERROR
>   Job Status:
>     Job Id:    e8dd04ea4b03f1817a4a4b9e5282f433
>     Job Name:  flinktest
>     Savepoint Info:
>       Last Periodic Savepoint Timestamp:  0
>       Savepoint History:
>       Trigger Id:
>       Trigger Timestamp:  0
>       Trigger Type:       UNKNOWN
>     Start Time:           1668660381400
>     State:                RECONCILING
>     Update Time:          1668994910151
>   Reconciliation Status:
>     Last Reconciled Spec:  ...
>     Reconciliation Timestamp:  1668660371853
>     State:                     DEPLOYED
>   Task Manager:
>     Label Selector:  component=taskmanager,app=flinktest
>     Replicas:        1
> Events:
>   Type     Reason            Age                 From
> Message
>   ----     ------            ----                ----
> -------
>   Normal   JobStatusChanged  30m                 Job
> Job status changed from RUNNING to RESTARTING
>   Normal   JobStatusChanged  29m                 Job
> Job status changed from RESTARTING to CREATED
>   Normal   JobStatusChanged  28m                 Job
> Job status changed from CREATED to RESTARTING
>   Warning  Missing           26m                 JobManagerDeployment
> Missing JobManager deployment
>   Warning  RestoreFailed     9s (x106 over 26m)  JobManagerDeployment  HA
> metadata not available to restore from last state. It is possible that the
> job has finished or terminally failed, or the configmaps have been
> deleted. Manual restore required.
>   Normal   Submit            9s (x106 over 26m)  JobManagerDeployment
> Starting deployment


We're happy with the last state mode most of the time, but we face it
occasionally.

We found that it's not easy to reproduce the problem; we tried to kill JMs
and TMs and even shutdown the nodes on which JMs and TMs are running.

We also checked that the file size is not zero.

Thanks,

Dongwon

Re: [Flink K8s operator] HA metadata not available to restore from last state

Posted by Dongwon Kim <ea...@gmail.com>.

Hi Gyula :-)

Okay, we're gonna upgrade to 1.15 and see what happens.

Thanks a lot for the quick feedback and the detailed explanation!

Best,

Dongwon


On Tue, Nov 22, 2022 at 5:57 PM Gyula Fóra <gy...@gmail.com> wrote:

> Hi Dongwon!
>
> This error mostly occurs when using Flink 1.14 and the Flink cluster goes
> into a terminal state. If a Flink job is FAILED/FINISHED (such as it
> exhausted the retry strategy), in Flink 1.14 the cluster shuts itself down
> and removes the HA metadata.
>
> In these cases the operator will only see that the cluster completely
> disappeared and there is no HA metadata and it will throw the error you
> mentioned. It does not know what happened and doesn't have any way to
> recover checkpoint information.
>
> This is fixed in Flink 1.15 where even after terminal FAILED/FINISHED
> states, the jobmanager would not shut down. This allows the operator to
> observe this terminal state and actually recover the job even if the HA
> metadata was removed.
>
> To summarize, this is mostly caused by Flink 1.14 behaviour that the
> operator cannot control. Upgrading to 1.15 allows much more robustness and
> should eliminate most of these cases.
>
> Cheers,
> Gyula
>
> On Tue, Nov 22, 2022 at 9:43 AM Dongwon Kim <ea...@gmail.com> wrote:
>
>> Hi,
>>
>> While using a last-state upgrade mode on flink-k8s-operator-1.2.0 and
>> flink-1.14.3, we're occasionally facing the following error:
>>
>> Status:
>>>   Cluster Info:
>>>     Flink - Revision:             98997ea @ 2022-01-08T23:23:54+01:00
>>>     Flink - Version:              1.14.3
>>>   Error:                          HA metadata not available to restore
>>> from last state. It is possible that the job has finished or terminally
>>> failed, or the configmaps have been deleted. Manual restore required.
>>>   Job Manager Deployment Status:  ERROR
>>>   Job Status:
>>>     Job Id:    e8dd04ea4b03f1817a4a4b9e5282f433
>>>     Job Name:  flinktest
>>>     Savepoint Info:
>>>       Last Periodic Savepoint Timestamp:  0
>>>       Savepoint History:
>>>       Trigger Id:
>>>       Trigger Timestamp:  0
>>>       Trigger Type:       UNKNOWN
>>>     Start Time:           1668660381400
>>>     State:                RECONCILING
>>>     Update Time:          1668994910151
>>>   Reconciliation Status:
>>>     Last Reconciled Spec:  ...
>>>     Reconciliation Timestamp:  1668660371853
>>>     State:                     DEPLOYED
>>>   Task Manager:
>>>     Label Selector:  component=taskmanager,app=flinktest
>>>     Replicas:        1
>>> Events:
>>>   Type     Reason            Age                 From
>>> Message
>>>   ----     ------            ----                ----
>>> -------
>>>   Normal   JobStatusChanged  30m                 Job
>>> Job status changed from RUNNING to RESTARTING
>>>   Normal   JobStatusChanged  29m                 Job
>>> Job status changed from RESTARTING to CREATED
>>>   Normal   JobStatusChanged  28m                 Job
>>> Job status changed from CREATED to RESTARTING
>>>   Warning  Missing           26m                 JobManagerDeployment
>>> Missing JobManager deployment
>>>   Warning  RestoreFailed     9s (x106 over 26m)  JobManagerDeployment
>>> HA metadata not available to restore from last state. It is possible that
>>> the job has finished or terminally failed, or the configmaps have been
>>> deleted. Manual restore required.
>>>   Normal   Submit            9s (x106 over 26m)  JobManagerDeployment
>>> Starting deployment
>>
>>
>> We're happy with the last state mode most of the time, but we face it
>> occasionally.
>>
>> We found that it's not easy to reproduce the problem; we tried to kill
>> JMs and TMs and even shutdown the nodes on which JMs and TMs are running.
>>
>> We also checked that the file size is not zero.
>>
>> Thanks,
>>
>> Dongwon
>>
>>
>>

Re: [Flink K8s operator] HA metadata not available to restore from last state

Posted by Gyula Fóra <gy...@gmail.com>.

Hi Dongwon!

This error mostly occurs when using Flink 1.14 and the Flink cluster goes
into a terminal state. If a Flink job is FAILED/FINISHED (such as it
exhausted the retry strategy), in Flink 1.14 the cluster shuts itself down
and removes the HA metadata.

In these cases the operator will only see that the cluster completely
disappeared and there is no HA metadata and it will throw the error you
mentioned. It does not know what happened and doesn't have any way to
recover checkpoint information.

This is fixed in Flink 1.15 where even after terminal FAILED/FINISHED
states, the jobmanager would not shut down. This allows the operator to
observe this terminal state and actually recover the job even if the HA
metadata was removed.

To summarize, this is mostly caused by Flink 1.14 behaviour that the
operator cannot control. Upgrading to 1.15 allows much more robustness and
should eliminate most of these cases.

Cheers,
Gyula

On Tue, Nov 22, 2022 at 9:43 AM Dongwon Kim <ea...@gmail.com> wrote:

> Hi,
>
> While using a last-state upgrade mode on flink-k8s-operator-1.2.0 and
> flink-1.14.3, we're occasionally facing the following error:
>
> Status:
>>   Cluster Info:
>>     Flink - Revision:             98997ea @ 2022-01-08T23:23:54+01:00
>>     Flink - Version:              1.14.3
>>   Error:                          HA metadata not available to restore
>> from last state. It is possible that the job has finished or terminally
>> failed, or the configmaps have been deleted. Manual restore required.
>>   Job Manager Deployment Status:  ERROR
>>   Job Status:
>>     Job Id:    e8dd04ea4b03f1817a4a4b9e5282f433
>>     Job Name:  flinktest
>>     Savepoint Info:
>>       Last Periodic Savepoint Timestamp:  0
>>       Savepoint History:
>>       Trigger Id:
>>       Trigger Timestamp:  0
>>       Trigger Type:       UNKNOWN
>>     Start Time:           1668660381400
>>     State:                RECONCILING
>>     Update Time:          1668994910151
>>   Reconciliation Status:
>>     Last Reconciled Spec:  ...
>>     Reconciliation Timestamp:  1668660371853
>>     State:                     DEPLOYED
>>   Task Manager:
>>     Label Selector:  component=taskmanager,app=flinktest
>>     Replicas:        1
>> Events:
>>   Type     Reason            Age                 From
>> Message
>>   ----     ------            ----                ----
>> -------
>>   Normal   JobStatusChanged  30m                 Job
>> Job status changed from RUNNING to RESTARTING
>>   Normal   JobStatusChanged  29m                 Job
>> Job status changed from RESTARTING to CREATED
>>   Normal   JobStatusChanged  28m                 Job
>> Job status changed from CREATED to RESTARTING
>>   Warning  Missing           26m                 JobManagerDeployment
>> Missing JobManager deployment
>>   Warning  RestoreFailed     9s (x106 over 26m)  JobManagerDeployment
>> HA metadata not available to restore from last state. It is possible that
>> the job has finished or terminally failed, or the configmaps have been
>> deleted. Manual restore required.
>>   Normal   Submit            9s (x106 over 26m)  JobManagerDeployment
>> Starting deployment
>
>
> We're happy with the last state mode most of the time, but we face it
> occasionally.
>
> We found that it's not easy to reproduce the problem; we tried to kill JMs
> and TMs and even shutdown the nodes on which JMs and TMs are running.
>
> We also checked that the file size is not zero.
>
> Thanks,
>
> Dongwon
>
>
>