You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Tianyi Deng <td...@blizzard.com> on 2022/01/04 18:22:56 UTC

Pod Disruption in Flink Kubernetes Cluster

Hello Flink community,

We have a Flink cluster deployed to AWS EKS along with many other applications. This cluster is managed by Spotify’s Flink operator. After deployment I notice the Stateful pods of job manager and task managers intermittently received SIGTERM to terminate themselves. I assume this has something to do with the voluntary pod disruption from K8s’s descheduler, perhaps because of node draining since other applications’ pods scale up and down or other reasons. It seems like this is inevitable as K8s usually move pods here and there, however it causes the Flink job to restart every time. I feel this is quite unstable.

Has anyone also seen this voluntary pod disruption in Flink cluster at K8s? Is there any best practice or recommendation for the Flink operation in K8s?

Thanks,
Tianyi

Re: Pod Disruption in Flink Kubernetes Cluster

Posted by Yang Wang <da...@gmail.com>.

Maybe the Flink applications could run more stably if you configure enough
resources(e.g. memory, cpu, ephemeral-storage) for the JobManager and
TaskManager pods.

Best,
Yang

David Morávek <dm...@apache.org> 于2022年1月5日周三 16:46写道：

> Hi Tianyi,
>
> this really depends on your kubernetes setup (eg. if autoscaling is
> enabled, you're using spot / preemtible instances). In general applications
> that run on Kubernetes needs be resilient to these kind of failures, Flink
> is no exception.
>
> In case of the failure, Flink needs to restart the job from the latest
> checkpoint to ensure consistency. In this kind of environment, you should
> be OK-ish with replying one checkpoint worth of data (you're able to adjust
> the checkpointing interval).
>
> Still it would be worth looking into why this disruptions happen and fix
> the cause. Even though you should be able to recover from these types of
> failures, doesn't mean that it's a good thing to do that more often then
> necessary :) I think if you describe the pod / sts you should see the k8s
> events that resulted in the container being terminated.
>
> Also we're currently working on sever efforts to make the restarting
> experience smoother and checkpointing interval shorter (eg. FLIP-198 [1],
> FLINK-25277 [2], FLIP-158 [3], ..).
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-198%3A+Working+directory+for+Flink+processes
> [2] https://issues.apache.org/jira/browse/FLINK-25277
> [3]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints
>
> Best,
> D.
>
> On Tue, Jan 4, 2022 at 7:23 PM Tianyi Deng <td...@blizzard.com> wrote:
>
>> Hello Flink community,
>>
>>
>>
>> We have a Flink cluster deployed to AWS EKS along with many other
>> applications. This cluster is managed by Spotify’s Flink operator. After
>> deployment I notice the Stateful pods of job manager and task managers
>> intermittently received *SIGTERM* to terminate themselves. I assume this
>> has something to do with the voluntary pod disruption from K8s’s
>> descheduler, perhaps because of node draining since other applications’
>> pods scale up and down or other reasons. It seems like this is inevitable
>> as K8s usually move pods here and there, however it causes the Flink job to
>> restart every time. I feel this is quite unstable.
>>
>>
>>
>> Has anyone also seen this voluntary pod disruption in Flink cluster at
>> K8s? Is there any best practice or recommendation for the Flink operation
>> in K8s?
>>
>>
>>
>> Thanks,
>>
>> Tianyi
>>
>

Re: Pod Disruption in Flink Kubernetes Cluster

Posted by David Morávek <dm...@apache.org>.

Hi Tianyi,

this really depends on your kubernetes setup (eg. if autoscaling is
enabled, you're using spot / preemtible instances). In general applications
that run on Kubernetes needs be resilient to these kind of failures, Flink
is no exception.

In case of the failure, Flink needs to restart the job from the latest
checkpoint to ensure consistency. In this kind of environment, you should
be OK-ish with replying one checkpoint worth of data (you're able to adjust
the checkpointing interval).

Still it would be worth looking into why this disruptions happen and fix
the cause. Even though you should be able to recover from these types of
failures, doesn't mean that it's a good thing to do that more often then
necessary :) I think if you describe the pod / sts you should see the k8s
events that resulted in the container being terminated.

Also we're currently working on sever efforts to make the restarting
experience smoother and checkpointing interval shorter (eg. FLIP-198 [1],
FLINK-25277 [2], FLIP-158 [3], ..).

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-198%3A+Working+directory+for+Flink+processes
[2] https://issues.apache.org/jira/browse/FLINK-25277
[3]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-158%3A+Generalized+incremental+checkpoints

Best,
D.

On Tue, Jan 4, 2022 at 7:23 PM Tianyi Deng <td...@blizzard.com> wrote:

> Hello Flink community,
>
>
>
> We have a Flink cluster deployed to AWS EKS along with many other
> applications. This cluster is managed by Spotify’s Flink operator. After
> deployment I notice the Stateful pods of job manager and task managers
> intermittently received *SIGTERM* to terminate themselves. I assume this
> has something to do with the voluntary pod disruption from K8s’s
> descheduler, perhaps because of node draining since other applications’
> pods scale up and down or other reasons. It seems like this is inevitable
> as K8s usually move pods here and there, however it causes the Flink job to
> restart every time. I feel this is quite unstable.
>
>
>
> Has anyone also seen this voluntary pod disruption in Flink cluster at
> K8s? Is there any best practice or recommendation for the Flink operation
> in K8s?
>
>
>
> Thanks,
>
> Tianyi
>