You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "ChangZhuo Chen (陳昌倬)" <cz...@czchen.org> on 2022/04/29 10:15:41 UTC

flink operator sometimes cannot start jobmanager after upgrading

Hi,

We found that flink operator [0] sometimes cannot start jobmanager after
upgrading FlinkDeployment. We need to recreate FlinkDeployment to fix
the problem. Anyone has this issue?

The following is redacted log from flink operator. After status becomes
MISSING, it keeps in MISSING status for at least 15 minutes.


    2022-04-29 09:41:15,141 o.a.f.c.d.a.c.ApplicationClusterDeployer [INFO ][namespace/flink-deployment-name] Submitting application in 'Application Mode'.
    2022-04-29 09:41:15,145 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO ][namespace/flink-deployment-name] The derived from fraction jvm overhead memory (2.400gb (2576980416 bytes)) is greater than its max value 1024.000mb (1073741824 bytes), max value will be used instead
    2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO ][namespace/flink-deployment-name] The derived from fraction jvm overhead memory (5.200gb (5583457568 bytes)) is greater than its max value 1024.000mb (1073741824 bytes), max value will be used instead
    2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO ][namespace/flink-deployment-name] The derived from fraction network memory (5.050gb (5422396292 bytes)) is greater than its max value 4.000gb (4294967296 bytes), max value will be used instead
    2022-04-29 09:41:15,237 o.a.f.k.u.KubernetesUtils      [INFO ][namespace/flink-deployment-name] Kubernetes deployment requires a fixed port. Configuration high-availability.jobmanager.port will be set to 6123
    2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [WARN ][namespace/flink-deployment-name] Please note that Flink client operations(e.g. cancel, list, stop, savepoint, etc.) won't work from outside the Kubernetes cluster since 'kubernetes.rest-service.exposed.type' has been set to ClusterIP.
    2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [INFO ][namespace/flink-deployment-name] Create flink application cluster flink-deployment-name successfully, JobManager Web Interface: http://flink-deployment-name.namespace:8081
    2022-04-29 09:41:15,510 o.a.f.k.o.s.FlinkService       [INFO ][namespace/flink-deployment-name] Application cluster successfully deployed
    2022-04-29 09:41:15,583 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Reconciliation successfully completed
    2022-04-29 09:41:15,684 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Starting reconciliation
    2022-04-29 09:41:15,686 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] Observing JobManager deployment. Previous status: DEPLOYING
    2022-04-29 09:41:15,792 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] JobManager is being deployed
    2022-04-29 09:41:15,792 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Reconciliation successfully completed
    2022-04-29 09:41:20,795 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Starting reconciliation
    2022-04-29 09:41:20,797 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] Observing JobManager deployment. Previous status: DEPLOYING
    2022-04-29 09:41:20,896 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] JobManager is being deployed
    2022-04-29 09:41:20,897 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Reconciliation successfully completed
    2022-04-29 09:41:25,899 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Starting reconciliation
    2022-04-29 09:41:25,901 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] Observing JobManager deployment. Previous status: DEPLOYING
    2022-04-29 09:41:25,997 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] JobManager is being deployed
    2022-04-29 09:41:25,998 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Reconciliation successfully completed
    2022-04-29 09:41:29,518 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Starting reconciliation
    2022-04-29 09:41:29,520 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] Observing JobManager deployment. Previous status: DEPLOYING
    2022-04-29 09:41:30,631 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] JobManager is being deployed
    2022-04-29 09:41:30,631 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Reconciliation successfully completed
    2022-04-29 09:41:35,639 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Starting reconciliation
    2022-04-29 09:41:35,640 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] Observing JobManager deployment. Previous status: DEPLOYING
    2022-04-29 09:41:35,756 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] JobManager is being deployed
    2022-04-29 09:41:35,756 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Reconciliation successfully completed
    2022-04-29 09:41:40,759 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Starting reconciliation
    2022-04-29 09:41:40,760 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] Observing JobManager deployment. Previous status: DEPLOYING
    2022-04-29 09:41:40,864 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] JobManager is being deployed
    2022-04-29 09:41:40,864 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Reconciliation successfully completed
    2022-04-29 09:41:45,867 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Starting reconciliation
    2022-04-29 09:41:45,868 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] Observing JobManager deployment. Previous status: DEPLOYING
    2022-04-29 09:41:45,870 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] JobManager deployment port is ready, waiting for the Flink REST API...
    2022-04-29 09:41:45,870 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Reconciliation successfully completed
    2022-04-29 09:41:55,901 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Starting reconciliation
    2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] Observing JobManager deployment. Previous status: DEPLOYED_NOT_READY
    2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] JobManager deployment is ready
    2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] Observing job status
    2022-04-29 09:41:56,294 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] No job found on cluster yet
    2022-04-29 09:41:56,294 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Reconciliation successfully completed
    2022-04-29 09:41:58,443 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Starting reconciliation
    2022-04-29 09:41:58,445 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] Observing job status
    2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver        [ERROR][namespace/flink-deployment-name] Exception while listing jobs
    2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] Observing JobManager deployment. Previous status: READY
    2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] JobManager deployment does not exist
    2022-04-29 09:42:10,490 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Reconciliation successfully completed
    2022-04-29 09:42:25,521 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Starting reconciliation
    2022-04-29 09:42:25,522 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] Observing JobManager deployment. Previous status: MISSING
    2022-04-29 09:42:25,522 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] JobManager deployment does not exist
    2022-04-29 09:42:25,522 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Reconciliation successfully completed
    2022-04-29 09:42:40,526 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Starting reconciliation
    2022-04-29 09:42:40,527 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] Observing JobManager deployment. Previous status: MISSING
    2022-04-29 09:42:40,527 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] JobManager deployment does not exist
    2022-04-29 09:42:40,527 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Reconciliation successfully completed
    ...

    2022-04-29 10:00:55,862 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Starting reconciliation
    2022-04-29 10:00:55,863 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] Observing JobManager deployment. Previous status: MISSING
    2022-04-29 10:00:55,863 o.a.f.k.o.o.JobObserver        [INFO ][namespace/flink-deployment-name] JobManager deployment does not exist
    2022-04-29 10:00:55,863 o.a.f.k.o.c.FlinkDeploymentController [INFO ][namespace/flink-deployment-name] Reconciliation successfully completed


[0] https://github.com/apache/flink-kubernetes-operator


-- 
ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org
http://czchen.info/
Key fingerprint = BA04 346D C2E1 FE63 C790  8793 CC65 B0CD EC27 5D5B

Re: flink operator sometimes cannot start jobmanager after upgrading

Posted by Yang Wang <da...@gmail.com>.
I am afraid we do not handle the scenario that the JobManager deployment is
deleted externally.

Best,
Yang

Őrhidi Mátyás <ma...@gmail.com> 于2022年5月2日周一 16:52写道:

> I filed a Jira for tracking this issue:
> https://issues.apache.org/jira/browse/FLINK-27468
>
> On Mon, May 2, 2022 at 10:31 AM Őrhidi Mátyás <ma...@gmail.com>
> wrote:
>
>> This can be reproduced simply by deleting the kubernetes deployment. The
>> operator cannot recover from this state automatically, by defining a
>> restartNonce on the deployment should recover the state.
>>
>> Regards,
>> Matyas
>>
>> On Mon, May 2, 2022 at 10:00 AM Márton Balassi <ba...@gmail.com>
>> wrote:
>>
>>> Hi ChangZhuo,
>>>
>>> Thanks for reporting this, I think I have just run into this myself too.
>>> Will try to reproduce it, but I do not fully comprehend it yet. If anyone
>>> has a way to reproduce it is more than welcome. :-)
>>>
>>> On Fri, Apr 29, 2022 at 12:16 PM ChangZhuo Chen (陳昌倬) <cz...@czchen.org>
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> We found that flink operator [0] sometimes cannot start jobmanager after
>>>> upgrading FlinkDeployment. We need to recreate FlinkDeployment to fix
>>>> the problem. Anyone has this issue?
>>>>
>>>> The following is redacted log from flink operator. After status becomes
>>>> MISSING, it keeps in MISSING status for at least 15 minutes.
>>>>
>>>>
>>>>     2022-04-29 09:41:15,141 o.a.f.c.d.a.c.ApplicationClusterDeployer
>>>> [INFO ][namespace/flink-deployment-name] Submitting application in
>>>> 'Application Mode'.
>>>>     2022-04-29 09:41:15,145 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
>>>> ][namespace/flink-deployment-name] The derived from fraction jvm overhead
>>>> memory (2.400gb (2576980416 bytes)) is greater than its max value
>>>> 1024.000mb (1073741824 bytes), max value will be used instead
>>>>     2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
>>>> ][namespace/flink-deployment-name] The derived from fraction jvm overhead
>>>> memory (5.200gb (5583457568 bytes)) is greater than its max value
>>>> 1024.000mb (1073741824 bytes), max value will be used instead
>>>>     2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
>>>> ][namespace/flink-deployment-name] The derived from fraction network memory
>>>> (5.050gb (5422396292 bytes)) is greater than its max value 4.000gb
>>>> (4294967296 bytes), max value will be used instead
>>>>     2022-04-29 09:41:15,237 o.a.f.k.u.KubernetesUtils      [INFO
>>>> ][namespace/flink-deployment-name] Kubernetes deployment requires a fixed
>>>> port. Configuration high-availability.jobmanager.port will be set to 6123
>>>>     2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [WARN
>>>> ][namespace/flink-deployment-name] Please note that Flink client
>>>> operations(e.g. cancel, list, stop, savepoint, etc.) won't work from
>>>> outside the Kubernetes cluster since 'kubernetes.rest-service.exposed.type'
>>>> has been set to ClusterIP.
>>>>     2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [INFO
>>>> ][namespace/flink-deployment-name] Create flink application cluster
>>>> flink-deployment-name successfully, JobManager Web Interface:
>>>> http://flink-deployment-name.namespace:8081
>>>>     2022-04-29 09:41:15,510 o.a.f.k.o.s.FlinkService       [INFO
>>>> ][namespace/flink-deployment-name] Application cluster successfully deployed
>>>>     2022-04-29 09:41:15,583 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:15,684 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:15,686 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYING
>>>>     2022-04-29 09:41:15,792 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>>     2022-04-29 09:41:15,792 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:20,795 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:20,797 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYING
>>>>     2022-04-29 09:41:20,896 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>>     2022-04-29 09:41:20,897 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:25,899 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:25,901 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYING
>>>>     2022-04-29 09:41:25,997 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>>     2022-04-29 09:41:25,998 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:29,518 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:29,520 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYING
>>>>     2022-04-29 09:41:30,631 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>>     2022-04-29 09:41:30,631 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:35,639 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:35,640 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYING
>>>>     2022-04-29 09:41:35,756 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>>     2022-04-29 09:41:35,756 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:40,759 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:40,760 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYING
>>>>     2022-04-29 09:41:40,864 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>>     2022-04-29 09:41:40,864 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:45,867 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:45,868 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYING
>>>>     2022-04-29 09:41:45,870 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager deployment port is ready,
>>>> waiting for the Flink REST API...
>>>>     2022-04-29 09:41:45,870 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:55,901 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: DEPLOYED_NOT_READY
>>>>     2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager deployment is ready
>>>>     2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing job status
>>>>     2022-04-29 09:41:56,294 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] No job found on cluster yet
>>>>     2022-04-29 09:41:56,294 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:41:58,443 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:41:58,445 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing job status
>>>>     2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver
>>>> [ERROR][namespace/flink-deployment-name] Exception while listing jobs
>>>>     2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: READY
>>>>     2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>>>     2022-04-29 09:42:10,490 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:42:25,521 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:42:25,522 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: MISSING
>>>>     2022-04-29 09:42:25,522 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>>>     2022-04-29 09:42:25,522 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     2022-04-29 09:42:40,526 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 09:42:40,527 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: MISSING
>>>>     2022-04-29 09:42:40,527 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>>>     2022-04-29 09:42:40,527 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>     ...
>>>>
>>>>     2022-04-29 10:00:55,862 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>>     2022-04-29 10:00:55,863 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>>> Previous status: MISSING
>>>>     2022-04-29 10:00:55,863 o.a.f.k.o.o.JobObserver        [INFO
>>>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>>>     2022-04-29 10:00:55,863 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>>
>>>>
>>>> [0] https://github.com/apache/flink-kubernetes-operator
>>>>
>>>>
>>>> --
>>>> ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org
>>>> http://czchen.info/
>>>> Key fingerprint = BA04 346D C2E1 FE63 C790  8793 CC65 B0CD EC27 5D5B
>>>>
>>>

Re: flink operator sometimes cannot start jobmanager after upgrading

Posted by Őrhidi Mátyás <ma...@gmail.com>.
I filed a Jira for tracking this issue:
https://issues.apache.org/jira/browse/FLINK-27468

On Mon, May 2, 2022 at 10:31 AM Őrhidi Mátyás <ma...@gmail.com>
wrote:

> This can be reproduced simply by deleting the kubernetes deployment. The
> operator cannot recover from this state automatically, by defining a
> restartNonce on the deployment should recover the state.
>
> Regards,
> Matyas
>
> On Mon, May 2, 2022 at 10:00 AM Márton Balassi <ba...@gmail.com>
> wrote:
>
>> Hi ChangZhuo,
>>
>> Thanks for reporting this, I think I have just run into this myself too.
>> Will try to reproduce it, but I do not fully comprehend it yet. If anyone
>> has a way to reproduce it is more than welcome. :-)
>>
>> On Fri, Apr 29, 2022 at 12:16 PM ChangZhuo Chen (陳昌倬) <cz...@czchen.org>
>> wrote:
>>
>>> Hi,
>>>
>>> We found that flink operator [0] sometimes cannot start jobmanager after
>>> upgrading FlinkDeployment. We need to recreate FlinkDeployment to fix
>>> the problem. Anyone has this issue?
>>>
>>> The following is redacted log from flink operator. After status becomes
>>> MISSING, it keeps in MISSING status for at least 15 minutes.
>>>
>>>
>>>     2022-04-29 09:41:15,141 o.a.f.c.d.a.c.ApplicationClusterDeployer
>>> [INFO ][namespace/flink-deployment-name] Submitting application in
>>> 'Application Mode'.
>>>     2022-04-29 09:41:15,145 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
>>> ][namespace/flink-deployment-name] The derived from fraction jvm overhead
>>> memory (2.400gb (2576980416 bytes)) is greater than its max value
>>> 1024.000mb (1073741824 bytes), max value will be used instead
>>>     2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
>>> ][namespace/flink-deployment-name] The derived from fraction jvm overhead
>>> memory (5.200gb (5583457568 bytes)) is greater than its max value
>>> 1024.000mb (1073741824 bytes), max value will be used instead
>>>     2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
>>> ][namespace/flink-deployment-name] The derived from fraction network memory
>>> (5.050gb (5422396292 bytes)) is greater than its max value 4.000gb
>>> (4294967296 bytes), max value will be used instead
>>>     2022-04-29 09:41:15,237 o.a.f.k.u.KubernetesUtils      [INFO
>>> ][namespace/flink-deployment-name] Kubernetes deployment requires a fixed
>>> port. Configuration high-availability.jobmanager.port will be set to 6123
>>>     2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [WARN
>>> ][namespace/flink-deployment-name] Please note that Flink client
>>> operations(e.g. cancel, list, stop, savepoint, etc.) won't work from
>>> outside the Kubernetes cluster since 'kubernetes.rest-service.exposed.type'
>>> has been set to ClusterIP.
>>>     2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [INFO
>>> ][namespace/flink-deployment-name] Create flink application cluster
>>> flink-deployment-name successfully, JobManager Web Interface:
>>> http://flink-deployment-name.namespace:8081
>>>     2022-04-29 09:41:15,510 o.a.f.k.o.s.FlinkService       [INFO
>>> ][namespace/flink-deployment-name] Application cluster successfully deployed
>>>     2022-04-29 09:41:15,583 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>     2022-04-29 09:41:15,684 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>     2022-04-29 09:41:15,686 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>> Previous status: DEPLOYING
>>>     2022-04-29 09:41:15,792 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>     2022-04-29 09:41:15,792 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>     2022-04-29 09:41:20,795 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>     2022-04-29 09:41:20,797 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>> Previous status: DEPLOYING
>>>     2022-04-29 09:41:20,896 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>     2022-04-29 09:41:20,897 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>     2022-04-29 09:41:25,899 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>     2022-04-29 09:41:25,901 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>> Previous status: DEPLOYING
>>>     2022-04-29 09:41:25,997 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>     2022-04-29 09:41:25,998 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>     2022-04-29 09:41:29,518 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>     2022-04-29 09:41:29,520 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>> Previous status: DEPLOYING
>>>     2022-04-29 09:41:30,631 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>     2022-04-29 09:41:30,631 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>     2022-04-29 09:41:35,639 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>     2022-04-29 09:41:35,640 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>> Previous status: DEPLOYING
>>>     2022-04-29 09:41:35,756 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>     2022-04-29 09:41:35,756 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>     2022-04-29 09:41:40,759 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>     2022-04-29 09:41:40,760 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>> Previous status: DEPLOYING
>>>     2022-04-29 09:41:40,864 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] JobManager is being deployed
>>>     2022-04-29 09:41:40,864 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>     2022-04-29 09:41:45,867 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>     2022-04-29 09:41:45,868 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>> Previous status: DEPLOYING
>>>     2022-04-29 09:41:45,870 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] JobManager deployment port is ready,
>>> waiting for the Flink REST API...
>>>     2022-04-29 09:41:45,870 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>     2022-04-29 09:41:55,901 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>     2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>> Previous status: DEPLOYED_NOT_READY
>>>     2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] JobManager deployment is ready
>>>     2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] Observing job status
>>>     2022-04-29 09:41:56,294 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] No job found on cluster yet
>>>     2022-04-29 09:41:56,294 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>     2022-04-29 09:41:58,443 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>     2022-04-29 09:41:58,445 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] Observing job status
>>>     2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver
>>> [ERROR][namespace/flink-deployment-name] Exception while listing jobs
>>>     2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>> Previous status: READY
>>>     2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>>     2022-04-29 09:42:10,490 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>     2022-04-29 09:42:25,521 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>     2022-04-29 09:42:25,522 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>> Previous status: MISSING
>>>     2022-04-29 09:42:25,522 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>>     2022-04-29 09:42:25,522 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>     2022-04-29 09:42:40,526 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>     2022-04-29 09:42:40,527 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>> Previous status: MISSING
>>>     2022-04-29 09:42:40,527 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>>     2022-04-29 09:42:40,527 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>     ...
>>>
>>>     2022-04-29 10:00:55,862 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Starting reconciliation
>>>     2022-04-29 10:00:55,863 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>>> Previous status: MISSING
>>>     2022-04-29 10:00:55,863 o.a.f.k.o.o.JobObserver        [INFO
>>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>>     2022-04-29 10:00:55,863 o.a.f.k.o.c.FlinkDeploymentController [INFO
>>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>>
>>>
>>> [0] https://github.com/apache/flink-kubernetes-operator
>>>
>>>
>>> --
>>> ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org
>>> http://czchen.info/
>>> Key fingerprint = BA04 346D C2E1 FE63 C790  8793 CC65 B0CD EC27 5D5B
>>>
>>

Re: flink operator sometimes cannot start jobmanager after upgrading

Posted by Őrhidi Mátyás <ma...@gmail.com>.
This can be reproduced simply by deleting the kubernetes deployment. The
operator cannot recover from this state automatically, by defining a
restartNonce on the deployment should recover the state.

Regards,
Matyas

On Mon, May 2, 2022 at 10:00 AM Márton Balassi <ba...@gmail.com>
wrote:

> Hi ChangZhuo,
>
> Thanks for reporting this, I think I have just run into this myself too.
> Will try to reproduce it, but I do not fully comprehend it yet. If anyone
> has a way to reproduce it is more than welcome. :-)
>
> On Fri, Apr 29, 2022 at 12:16 PM ChangZhuo Chen (陳昌倬) <cz...@czchen.org>
> wrote:
>
>> Hi,
>>
>> We found that flink operator [0] sometimes cannot start jobmanager after
>> upgrading FlinkDeployment. We need to recreate FlinkDeployment to fix
>> the problem. Anyone has this issue?
>>
>> The following is redacted log from flink operator. After status becomes
>> MISSING, it keeps in MISSING status for at least 15 minutes.
>>
>>
>>     2022-04-29 09:41:15,141 o.a.f.c.d.a.c.ApplicationClusterDeployer
>> [INFO ][namespace/flink-deployment-name] Submitting application in
>> 'Application Mode'.
>>     2022-04-29 09:41:15,145 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
>> ][namespace/flink-deployment-name] The derived from fraction jvm overhead
>> memory (2.400gb (2576980416 bytes)) is greater than its max value
>> 1024.000mb (1073741824 bytes), max value will be used instead
>>     2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
>> ][namespace/flink-deployment-name] The derived from fraction jvm overhead
>> memory (5.200gb (5583457568 bytes)) is greater than its max value
>> 1024.000mb (1073741824 bytes), max value will be used instead
>>     2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
>> ][namespace/flink-deployment-name] The derived from fraction network memory
>> (5.050gb (5422396292 bytes)) is greater than its max value 4.000gb
>> (4294967296 bytes), max value will be used instead
>>     2022-04-29 09:41:15,237 o.a.f.k.u.KubernetesUtils      [INFO
>> ][namespace/flink-deployment-name] Kubernetes deployment requires a fixed
>> port. Configuration high-availability.jobmanager.port will be set to 6123
>>     2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [WARN
>> ][namespace/flink-deployment-name] Please note that Flink client
>> operations(e.g. cancel, list, stop, savepoint, etc.) won't work from
>> outside the Kubernetes cluster since 'kubernetes.rest-service.exposed.type'
>> has been set to ClusterIP.
>>     2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [INFO
>> ][namespace/flink-deployment-name] Create flink application cluster
>> flink-deployment-name successfully, JobManager Web Interface:
>> http://flink-deployment-name.namespace:8081
>>     2022-04-29 09:41:15,510 o.a.f.k.o.s.FlinkService       [INFO
>> ][namespace/flink-deployment-name] Application cluster successfully deployed
>>     2022-04-29 09:41:15,583 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>     2022-04-29 09:41:15,684 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Starting reconciliation
>>     2022-04-29 09:41:15,686 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>> Previous status: DEPLOYING
>>     2022-04-29 09:41:15,792 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] JobManager is being deployed
>>     2022-04-29 09:41:15,792 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>     2022-04-29 09:41:20,795 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Starting reconciliation
>>     2022-04-29 09:41:20,797 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>> Previous status: DEPLOYING
>>     2022-04-29 09:41:20,896 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] JobManager is being deployed
>>     2022-04-29 09:41:20,897 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>     2022-04-29 09:41:25,899 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Starting reconciliation
>>     2022-04-29 09:41:25,901 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>> Previous status: DEPLOYING
>>     2022-04-29 09:41:25,997 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] JobManager is being deployed
>>     2022-04-29 09:41:25,998 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>     2022-04-29 09:41:29,518 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Starting reconciliation
>>     2022-04-29 09:41:29,520 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>> Previous status: DEPLOYING
>>     2022-04-29 09:41:30,631 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] JobManager is being deployed
>>     2022-04-29 09:41:30,631 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>     2022-04-29 09:41:35,639 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Starting reconciliation
>>     2022-04-29 09:41:35,640 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>> Previous status: DEPLOYING
>>     2022-04-29 09:41:35,756 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] JobManager is being deployed
>>     2022-04-29 09:41:35,756 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>     2022-04-29 09:41:40,759 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Starting reconciliation
>>     2022-04-29 09:41:40,760 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>> Previous status: DEPLOYING
>>     2022-04-29 09:41:40,864 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] JobManager is being deployed
>>     2022-04-29 09:41:40,864 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>     2022-04-29 09:41:45,867 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Starting reconciliation
>>     2022-04-29 09:41:45,868 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>> Previous status: DEPLOYING
>>     2022-04-29 09:41:45,870 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] JobManager deployment port is ready,
>> waiting for the Flink REST API...
>>     2022-04-29 09:41:45,870 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>     2022-04-29 09:41:55,901 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Starting reconciliation
>>     2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>> Previous status: DEPLOYED_NOT_READY
>>     2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] JobManager deployment is ready
>>     2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] Observing job status
>>     2022-04-29 09:41:56,294 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] No job found on cluster yet
>>     2022-04-29 09:41:56,294 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>     2022-04-29 09:41:58,443 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Starting reconciliation
>>     2022-04-29 09:41:58,445 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] Observing job status
>>     2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver
>> [ERROR][namespace/flink-deployment-name] Exception while listing jobs
>>     2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>> Previous status: READY
>>     2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>     2022-04-29 09:42:10,490 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>     2022-04-29 09:42:25,521 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Starting reconciliation
>>     2022-04-29 09:42:25,522 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>> Previous status: MISSING
>>     2022-04-29 09:42:25,522 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>     2022-04-29 09:42:25,522 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>     2022-04-29 09:42:40,526 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Starting reconciliation
>>     2022-04-29 09:42:40,527 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>> Previous status: MISSING
>>     2022-04-29 09:42:40,527 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>     2022-04-29 09:42:40,527 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>     ...
>>
>>     2022-04-29 10:00:55,862 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Starting reconciliation
>>     2022-04-29 10:00:55,863 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] Observing JobManager deployment.
>> Previous status: MISSING
>>     2022-04-29 10:00:55,863 o.a.f.k.o.o.JobObserver        [INFO
>> ][namespace/flink-deployment-name] JobManager deployment does not exist
>>     2022-04-29 10:00:55,863 o.a.f.k.o.c.FlinkDeploymentController [INFO
>> ][namespace/flink-deployment-name] Reconciliation successfully completed
>>
>>
>> [0] https://github.com/apache/flink-kubernetes-operator
>>
>>
>> --
>> ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org
>> http://czchen.info/
>> Key fingerprint = BA04 346D C2E1 FE63 C790  8793 CC65 B0CD EC27 5D5B
>>
>

Re: flink operator sometimes cannot start jobmanager after upgrading

Posted by Márton Balassi <ba...@gmail.com>.
Hi ChangZhuo,

Thanks for reporting this, I think I have just run into this myself too.
Will try to reproduce it, but I do not fully comprehend it yet. If anyone
has a way to reproduce it is more than welcome. :-)

On Fri, Apr 29, 2022 at 12:16 PM ChangZhuo Chen (陳昌倬) <cz...@czchen.org>
wrote:

> Hi,
>
> We found that flink operator [0] sometimes cannot start jobmanager after
> upgrading FlinkDeployment. We need to recreate FlinkDeployment to fix
> the problem. Anyone has this issue?
>
> The following is redacted log from flink operator. After status becomes
> MISSING, it keeps in MISSING status for at least 15 minutes.
>
>
>     2022-04-29 09:41:15,141 o.a.f.c.d.a.c.ApplicationClusterDeployer [INFO
> ][namespace/flink-deployment-name] Submitting application in 'Application
> Mode'.
>     2022-04-29 09:41:15,145 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
> ][namespace/flink-deployment-name] The derived from fraction jvm overhead
> memory (2.400gb (2576980416 bytes)) is greater than its max value
> 1024.000mb (1073741824 bytes), max value will be used instead
>     2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
> ][namespace/flink-deployment-name] The derived from fraction jvm overhead
> memory (5.200gb (5583457568 bytes)) is greater than its max value
> 1024.000mb (1073741824 bytes), max value will be used instead
>     2022-04-29 09:41:15,146 o.a.f.r.u.c.m.ProcessMemoryUtils [INFO
> ][namespace/flink-deployment-name] The derived from fraction network memory
> (5.050gb (5422396292 bytes)) is greater than its max value 4.000gb
> (4294967296 bytes), max value will be used instead
>     2022-04-29 09:41:15,237 o.a.f.k.u.KubernetesUtils      [INFO
> ][namespace/flink-deployment-name] Kubernetes deployment requires a fixed
> port. Configuration high-availability.jobmanager.port will be set to 6123
>     2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [WARN
> ][namespace/flink-deployment-name] Please note that Flink client
> operations(e.g. cancel, list, stop, savepoint, etc.) won't work from
> outside the Kubernetes cluster since 'kubernetes.rest-service.exposed.type'
> has been set to ClusterIP.
>     2022-04-29 09:41:15,508 o.a.f.k.KubernetesClusterDescriptor [INFO
> ][namespace/flink-deployment-name] Create flink application cluster
> flink-deployment-name successfully, JobManager Web Interface:
> http://flink-deployment-name.namespace:8081
>     2022-04-29 09:41:15,510 o.a.f.k.o.s.FlinkService       [INFO
> ][namespace/flink-deployment-name] Application cluster successfully deployed
>     2022-04-29 09:41:15,583 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Reconciliation successfully completed
>     2022-04-29 09:41:15,684 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Starting reconciliation
>     2022-04-29 09:41:15,686 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] Observing JobManager deployment.
> Previous status: DEPLOYING
>     2022-04-29 09:41:15,792 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] JobManager is being deployed
>     2022-04-29 09:41:15,792 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Reconciliation successfully completed
>     2022-04-29 09:41:20,795 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Starting reconciliation
>     2022-04-29 09:41:20,797 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] Observing JobManager deployment.
> Previous status: DEPLOYING
>     2022-04-29 09:41:20,896 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] JobManager is being deployed
>     2022-04-29 09:41:20,897 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Reconciliation successfully completed
>     2022-04-29 09:41:25,899 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Starting reconciliation
>     2022-04-29 09:41:25,901 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] Observing JobManager deployment.
> Previous status: DEPLOYING
>     2022-04-29 09:41:25,997 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] JobManager is being deployed
>     2022-04-29 09:41:25,998 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Reconciliation successfully completed
>     2022-04-29 09:41:29,518 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Starting reconciliation
>     2022-04-29 09:41:29,520 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] Observing JobManager deployment.
> Previous status: DEPLOYING
>     2022-04-29 09:41:30,631 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] JobManager is being deployed
>     2022-04-29 09:41:30,631 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Reconciliation successfully completed
>     2022-04-29 09:41:35,639 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Starting reconciliation
>     2022-04-29 09:41:35,640 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] Observing JobManager deployment.
> Previous status: DEPLOYING
>     2022-04-29 09:41:35,756 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] JobManager is being deployed
>     2022-04-29 09:41:35,756 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Reconciliation successfully completed
>     2022-04-29 09:41:40,759 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Starting reconciliation
>     2022-04-29 09:41:40,760 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] Observing JobManager deployment.
> Previous status: DEPLOYING
>     2022-04-29 09:41:40,864 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] JobManager is being deployed
>     2022-04-29 09:41:40,864 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Reconciliation successfully completed
>     2022-04-29 09:41:45,867 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Starting reconciliation
>     2022-04-29 09:41:45,868 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] Observing JobManager deployment.
> Previous status: DEPLOYING
>     2022-04-29 09:41:45,870 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] JobManager deployment port is ready,
> waiting for the Flink REST API...
>     2022-04-29 09:41:45,870 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Reconciliation successfully completed
>     2022-04-29 09:41:55,901 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Starting reconciliation
>     2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] Observing JobManager deployment.
> Previous status: DEPLOYED_NOT_READY
>     2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] JobManager deployment is ready
>     2022-04-29 09:41:55,902 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] Observing job status
>     2022-04-29 09:41:56,294 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] No job found on cluster yet
>     2022-04-29 09:41:56,294 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Reconciliation successfully completed
>     2022-04-29 09:41:58,443 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Starting reconciliation
>     2022-04-29 09:41:58,445 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] Observing job status
>     2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver
> [ERROR][namespace/flink-deployment-name] Exception while listing jobs
>     2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] Observing JobManager deployment.
> Previous status: READY
>     2022-04-29 09:42:10,489 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] JobManager deployment does not exist
>     2022-04-29 09:42:10,490 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Reconciliation successfully completed
>     2022-04-29 09:42:25,521 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Starting reconciliation
>     2022-04-29 09:42:25,522 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] Observing JobManager deployment.
> Previous status: MISSING
>     2022-04-29 09:42:25,522 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] JobManager deployment does not exist
>     2022-04-29 09:42:25,522 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Reconciliation successfully completed
>     2022-04-29 09:42:40,526 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Starting reconciliation
>     2022-04-29 09:42:40,527 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] Observing JobManager deployment.
> Previous status: MISSING
>     2022-04-29 09:42:40,527 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] JobManager deployment does not exist
>     2022-04-29 09:42:40,527 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Reconciliation successfully completed
>     ...
>
>     2022-04-29 10:00:55,862 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Starting reconciliation
>     2022-04-29 10:00:55,863 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] Observing JobManager deployment.
> Previous status: MISSING
>     2022-04-29 10:00:55,863 o.a.f.k.o.o.JobObserver        [INFO
> ][namespace/flink-deployment-name] JobManager deployment does not exist
>     2022-04-29 10:00:55,863 o.a.f.k.o.c.FlinkDeploymentController [INFO
> ][namespace/flink-deployment-name] Reconciliation successfully completed
>
>
> [0] https://github.com/apache/flink-kubernetes-operator
>
>
> --
> ChangZhuo Chen (陳昌倬) czchen@{czchen,debian}.org
> http://czchen.info/
> Key fingerprint = BA04 346D C2E1 FE63 C790  8793 CC65 B0CD EC27 5D5B
>