You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Talat Uyarer via user <us...@flink.apache.org> on 2023/04/27 07:14:45 UTC

Flink Kubernetes Operator Scale Issue

Hi All,

We are using Flink Kubernetes Operator on our production. We have 3k+ jobs
in standalone mode. But after 2.5k jobs operator getting slow. Now when we
submit a job it takes 10+ minutes to the job runs. Does anyone use similar
scale or more job ?

Now we run as a single pod. Does operator support multi pods if i increase
replicas ?

Do you have any suggestions where should i start looking to debug ?

Thanks

Re: Flink Kubernetes Operator Scale Issue

Posted by Gyula Fóra <gy...@gmail.com>.

Hi!

It’s currently not possible to run the operator in parallel by simply
adding more replicas. However there are different things you can do to
scale both vertically and horizontally.

First of all you can run multiple operators each watching different set of
namespaces to partition the load.

The operator also supports watching CRs with a certain label selector which
would allow you to horizontally partition the load with custom CR labels if
necessary.

You can also try increasing the reconciler parallelism of the operator to
use more threads and reconcile more CRs in parallel. If you increase this
you might need to increase the heap size as well.

Let me know if this helps!

Gyula

On Thu, 27 Apr 2023 at 09:15, Talat Uyarer via user <us...@flink.apache.org>
wrote:

> Hi All,
>
> We are using Flink Kubernetes Operator on our production. We have 3k+ jobs
> in standalone mode. But after 2.5k jobs operator getting slow. Now when we
> submit a job it takes 10+ minutes to the job runs. Does anyone use similar
> scale or more job ?
>
> Now we run as a single pod. Does operator support multi pods if i increase
> replicas ?
>
> Do you have any suggestions where should i start looking to debug ?
>
> Thanks
>