You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Gyula Fora (Jira)" <ji...@apache.org> on 2024/03/19 07:42:00 UTC

[jira] [Commented] (FLINK-34726) Flink Kubernetes Operator has some room for optimizing performance.

    [ https://issues.apache.org/jira/browse/FLINK-34726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828210#comment-17828210 ] 

Gyula Fora commented on FLINK-34726:
------------------------------------

Thanks for the detailed analysis [~Fei Feng] . You are completely right that we don't optimise the rest client usage and that may add a significant overhead. We have done similar optimisation in the past for config access/generation by using the FlinkResourceContext class. 

We could probably move the rest client generation logic there instead of hiding it under the FlinkService completely. This will be however a bigger change as it will affect the methods of the FlinkService interface as well.

Sounds a bit strange that getSecondaryResource is so expensive as that should happen from a cache. We should look into it while it's expensive in the first place because passing the FlinkDeployment objects around will make the code a bit more complicated, but I guess that could also be hidden under the FlinkSessionJobContext

> Flink Kubernetes Operator has some room for optimizing performance.
> -------------------------------------------------------------------
>
>                 Key: FLINK-34726
>                 URL: https://issues.apache.org/jira/browse/FLINK-34726
>             Project: Flink
>          Issue Type: Improvement
>          Components: Kubernetes Operator
>    Affects Versions: kubernetes-operator-1.5.0, kubernetes-operator-1.6.0, kubernetes-operator-1.7.0
>            Reporter: Fei Feng
>            Priority: Major
>         Attachments: operator_no_submit_no_kill.flamegraph.html
>
>
> When there is a huge number of FlinkDeployment and FlinkSessionJob in a kubernetes cluster, there will be a significant delay between event submit into reconcile thread pool and  event is processed. 
> this is our test:we give operator enough resource(cpu: 10core, memory: 20g, reconcile thread pool  size was 200 ) and we deployed 10000 jobs firstly (one FlinkDeployment and one SessionJob per job) , then we do submit/delete job tests. we found that 
> 1. it cost about 2min between create new FlinkDeployment and FlinkSessionJob CR to k8s and the flink job submited to jobmanager.
> 2. it cost about 1min between delete a FlinkDeployment and FlinkSessionJob CR  and the flink job and session cluster cleared.
>  
> I use async-profiler to get flamegraph when  there is a huge number FlinkDeployment and FlinkSessionJob. I found two obvious areas for optimization
> 1. For Flinkdeployment: in the observe step, we call AbstractFlinkService.getClusterInfo/listJobs/getTaskManagerInfo , every time we call these method we need create RestClusterClient/ send requests/ close, I think we should reuse RestClusterClient as much as possible to avoid frequently creating objects to reduce GC pressure
> 2. For FlinkSessionJob (This issue is more obvious): in the whole reconcile loop, we call getSecondaryResource 5 times to get FlinkDeployement resource info. Based on my current understanding of the Flink Operator, I think we do not need to call it 5 times in a single reconcile loop, calling it once is enough. If yes, we cloud save 30% cpu usage (every getSecondaryResource cost 6% cpu usage)
> [^operator_no_submit_no_kill.flamegraph.html]
> I hope we can discuss solutions to address this problem together. I'm very willing to optimize and resolve this issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)