You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by "Christian Krudewig (Corporate Development) via user" <us...@flink.apache.org> on 2023/01/22 16:59:40 UTC

Flink Statefun: How to find the performance bottleneck?

Hi fellow flink users,

 

I'd like to seek advice on how to find the performance bottleneck of a
stateful functions pipeline. The throughput is too low. Ideally we could
push it to 2000 messages/s, but I don't get it above 100/s. The pipeline
quickly gets under backpressure.

 

Some facts: 

*	The pipeline is running on a powerful Kubernetes cluster, with
rocksDB state backend writing to a Hadoop volume. 
*	There are six functions, only one of them makes use of state
*	Ingress and egress are via kafka
*	The pipeline is set to "exactly once" semantics with checkpoints
every 10 seconds

 

Here a picture from the Flink UI, showing that the active ingress is
backpressured. The functions task has subtasks which take turns in being
100% busy:



 

What I tried:

*	Scale up all functions deployments heavily, although each container
is under low load
*	Increase the memory for the task managers to 16 GB each
*	Increase the parallelism from 3 to 7 task managers
*	Tried switching on "buffer debloating"
(https://nightlies.apache.org/flink/flink-docs-master/docs/ops/state/checkpo
inting_under_backpressure/)
*	Set execution.checkpointing.aligned-checkpoint-timeout: 300sec,
because I saw 
*	Increase "maxNumBatchRequests" for all funtions

 

I hope this is all, I tried so many things.

 

How can I figure out, why the pipeline is slow, i.e. what the bottleneck is?

 

Thanks for any advice.

 

Best,

 

Christian

 

--

Dr. Christian Krudewig
Corporate Development - Data Analytics

Deutsche Post DHL