You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Christopher Gustafson <ch...@kth.se> on 2022/05/30 06:29:19 UTC

Large backpressure and slow checkpoints in StateFun

Hi,


I am running some benchmarks using StateFun and have encountered a problem with backpressure and slow checkpoints that I can't figure out the reason for, and was hoping that someone might have an idea of what is causing it. My setup is the following:


I am running the Shopping Cart application from the StateFun playground. The job is submitted as an uber jar to an existing Flink Cluster with 3 TaskManagers and 1 JobManager. The functions are served using the Undertow example from the documentation and I am using Kafka ingresses and egresses. My workload is only at 1000 events/s. Everything is run in separate GCP VMs.


The issue is with very long checkpoints, which I assume is caused by a backpressured ingress caused by the function dispatcher operator not being able to handle the workload. The only thing that has helped so far is to increase the parallelism of the job, but it feels like the still is some other bottleneck that is causing the issues. I have seen other benchmarks reaching much higher throughput than 1000 events/s, without more CPU or memory resources than I am using.


Any ideas of bottlenecks or ways to figure them out are greatly appreciated.


Best Regards,

Christopher Gustafson

Re: Large backpressure and slow checkpoints in StateFun

Posted by yuxia <lu...@alumni.sjtu.edu.cn>.
May be you can use jstack or flame graph to analyze what's the bottleneck. 
BTW, about generating flame graph, arthas[1] is a good tool. 

[1] https://github.com/alibaba/arthas 

Best regards, 
Yuxia 


发件人: "Christopher Gustafson" <ch...@kth.se> 
收件人: "User" <us...@flink.apache.org> 
发送时间: 星期一, 2022年 5 月 30日 下午 2:29:19 
主题: Large backpressure and slow checkpoints in StateFun 



Hi, 




I am running some benchmarks using StateFun and have encountered a problem with backpressure and slow checkpoints that I can't figure out the reason for, and was hoping that someone might have an idea of what is causing it. My setup is the following: 



I am running the Shopping Cart application from the StateFun playground. The job is submitted as an uber jar to an existing Flink Cluster with 3 TaskManagers and 1 JobManager. The functions are served using the Undertow example from the documentation and I am using Kafka ingresses and egresses. My workload is only at 1000 events/s. Everything is run in separate GCP VMs. 




The issue is with very long checkpoints, which I assume is caused by a backpressured ingress caused by the function dispatcher operator not being able to handle the workload. The only thing that has helped so far is to increase the parallelism of the job, but it feels like the still is some other bottleneck that is causing the issues. I have seen other benchmarks reaching much higher throughput than 1000 events/s, without more CPU or memory resources than I am using. 




Any ideas of bottlenecks or ways to figure them out are greatly appreciated. 




Best Regards, 

Christopher Gustafson