You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by Xingcan Cui <xi...@gmail.com> on 2022/11/16 15:19:08 UTC

Argo CD health check for FlinkDeployment

Hi all,

We are exploring Argo CD to manage `FlinkDeployment` resources but noticed
that the health checking for it doesn't work properly.

To give you some context, Argo CD uses Lua scripts to check some
state-related fields and map them to three status values: "Healthy",
"Progressing" and "Degraded". The current implementation
<https://github.com/argoproj/argo-cd/pull/9300> uses some legacy fields
(e.g., status.reconciliationStatus.success) that have been removed
<https://github.com/apache/flink-kubernetes-operator/pull/165/files#diff-77c3de65b7bd2db04eeeae370a85cec77f7d7eb22ef801ef11305ede88cb315a>
a long time ago. Thus users will always get the "Progressing" status.

To fix the issue, we plan to re-implement the health checking logic. Got
three questions here.

1. Is it reasonable to simply use "obj.status.jobStatus.state" as the
indicator, i.e., map "running" to "Healthy", map "Failing" and "Failed" to
"Degraded" and map the remaining states to "Progressing"?
2. I know the Flink-K8s-operator project is still in active development.
Given that the health checking logic is coupled with the state fields, I'm
curious if they are stable now.
3. Can we apply the same logic to "FlinkSessionJob"?

Thanks,
Xingcan

Re: Argo CD health check for FlinkDeployment

Posted by Xingcan Cui <xi...@gmail.com>.

Hi Gyula,

Thanks for the explanation!

The distinction between Flink jobs and FlinkDeployments makes sense! I'll
try to make some changes to Argo CD and hopefully can get some review from
you or other Flink-K8s-op contributors then.

Best,
Xingcan

On Wed, Nov 16, 2022 at 10:40 AM Gyula Fóra <gy...@gmail.com> wrote:

> Hi Xingcan!
>
> If you are looking for checking the health of the deployed Flink jobs,
> status.jobStatus.state is a good place to start.
> At any given time that should represent the Flink Job Status. RUNNING means
> it's processing data other states mean that it is doing something else
> (restarting, failing etc.)
>
> This is a logic you can also apply on the session jobs.
>
> However I would not really say this is the state of a FlinkDeployment. A
> FlinkDeployment represents more than just a Flink job. Whether the job
> itself is failing or not depends mostly on the job logic.
> The operator cannot fix broken user jobs therefore from the operator
> perspective the FlinkDeployment is healthy as long as we can determine the
> correct status of it and it's reconciled to the spec that the user
> requested.
> For more information about this you can check this state diagram:
>
> https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/architecture/#flink-resource-lifecycle
>
> A side note: While true that the operator is in active development, the CRD
> (spec, status) did not change significantly since the initial stable
> release (1.0.0) in the last couple of months.
> The jobStatus is also one thing that did not change at all.
>
> Cheers,
> Gyula
>
> On Wed, Nov 16, 2022 at 4:21 PM Xingcan Cui <xi...@gmail.com> wrote:
>
> > Hi all,
> >
> > We are exploring Argo CD to manage `FlinkDeployment` resources but
> noticed
> > that the health checking for it doesn't work properly.
> >
> > To give you some context, Argo CD uses Lua scripts to check some
> > state-related fields and map them to three status values: "Healthy",
> > "Progressing" and "Degraded". The current implementation
> > <https://github.com/argoproj/argo-cd/pull/9300> uses some legacy fields
> > (e.g., status.reconciliationStatus.success) that have been removed
> > <
> >
> https://github.com/apache/flink-kubernetes-operator/pull/165/files#diff-77c3de65b7bd2db04eeeae370a85cec77f7d7eb22ef801ef11305ede88cb315a
> > >
> > a long time ago. Thus users will always get the "Progressing" status.
> >
> > To fix the issue, we plan to re-implement the health checking logic. Got
> > three questions here.
> >
> > 1. Is it reasonable to simply use "obj.status.jobStatus.state" as the
> > indicator, i.e., map "running" to "Healthy", map "Failing" and "Failed"
> to
> > "Degraded" and map the remaining states to "Progressing"?
> > 2. I know the Flink-K8s-operator project is still in active development.
> > Given that the health checking logic is coupled with the state fields,
> I'm
> > curious if they are stable now.
> > 3. Can we apply the same logic to "FlinkSessionJob"?
> >
> > Thanks,
> > Xingcan
> >
>

Re: Argo CD health check for FlinkDeployment

Posted by Gyula Fóra <gy...@gmail.com>.

Hi Xingcan!

If you are looking for checking the health of the deployed Flink jobs,
status.jobStatus.state is a good place to start.
At any given time that should represent the Flink Job Status. RUNNING means
it's processing data other states mean that it is doing something else
(restarting, failing etc.)

This is a logic you can also apply on the session jobs.

However I would not really say this is the state of a FlinkDeployment. A
FlinkDeployment represents more than just a Flink job. Whether the job
itself is failing or not depends mostly on the job logic.
The operator cannot fix broken user jobs therefore from the operator
perspective the FlinkDeployment is healthy as long as we can determine the
correct status of it and it's reconciled to the spec that the user
requested.
For more information about this you can check this state diagram:
https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/concepts/architecture/#flink-resource-lifecycle

A side note: While true that the operator is in active development, the CRD
(spec, status) did not change significantly since the initial stable
release (1.0.0) in the last couple of months.
The jobStatus is also one thing that did not change at all.

Cheers,
Gyula

On Wed, Nov 16, 2022 at 4:21 PM Xingcan Cui <xi...@gmail.com> wrote:

> Hi all,
>
> We are exploring Argo CD to manage `FlinkDeployment` resources but noticed
> that the health checking for it doesn't work properly.
>
> To give you some context, Argo CD uses Lua scripts to check some
> state-related fields and map them to three status values: "Healthy",
> "Progressing" and "Degraded". The current implementation
> <https://github.com/argoproj/argo-cd/pull/9300> uses some legacy fields
> (e.g., status.reconciliationStatus.success) that have been removed
> <
> https://github.com/apache/flink-kubernetes-operator/pull/165/files#diff-77c3de65b7bd2db04eeeae370a85cec77f7d7eb22ef801ef11305ede88cb315a
> >
> a long time ago. Thus users will always get the "Progressing" status.
>
> To fix the issue, we plan to re-implement the health checking logic. Got
> three questions here.
>
> 1. Is it reasonable to simply use "obj.status.jobStatus.state" as the
> indicator, i.e., map "running" to "Healthy", map "Failing" and "Failed" to
> "Degraded" and map the remaining states to "Progressing"?
> 2. I know the Flink-K8s-operator project is still in active development.
> Given that the health checking logic is coupled with the state fields, I'm
> curious if they are stable now.
> 3. Can we apply the same logic to "FlinkSessionJob"?
>
> Thanks,
> Xingcan
>