You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by marco andreas <ma...@gmail.com> on 2023/01/25 19:55:52 UTC

OOM taskmanager

Hello,

We are deploying a flink application cluster in kubernetes, 2 pods one for
the JM and the other for the TM.

The problem is when we launch load tests we see that task manager memory
usage increases,  after the tests  are finished and flink stop processing
data the memory usage never comes down where it was before, eventually when
we launch tests again and again the memory of TM continues to grow until it
reaches the memory resource limit specified in the container templates and
it get killed because of OOM.


Has anyone faced the same issue and what is the best way to investigate
this error in order to know the root cause of why the memory usage of the
TM never comes down when flink finishes processing.

FLink version is 1.16.0.
Thanks,

Re: OOM taskmanager

Posted by weijie guo <gu...@gmail.com>.

Hi Marco,

I think you may need to do heap-dump for TM to check whether there is
memory leak.

Best regards,

Weijie

Re: OOM taskmanager

Posted by "Teoh, Hong" <li...@amazon.co.uk>.

Hi Marco,

When you say OOM, I assume you mean TM pod being OOMKilled, is that correct? If so, this usually means that the TM is using more than the actual memory allocated to the pod. First I would check your memory configuration to figure out where this extra memory use is coming from. This is a non trivial task, and I’ll list down some common situations I’ve seen tin the past to get you started.


  *   Misconfigured process memory. Flink configuration of `taskmanager.memory.process.size` will set the memory of the entire TM, which Flink will use and break down into smaller buckets. IF this is higher than memory resource of container, this will cause OOMKilled situations
  *   User code has memory leak (e.g. spins up too many threads). Would be useful to test the Flink job you have on a local cluster and monitor the memory use.
  *   State backend (if you use rocksdb) using too much memory.

You can also look at [1] and [2] for more information.

Regards,
Hong

[1] Talk on Flink memory utilisation https://www.youtube.com/watch?v=F5yKSznkls8
[2] Flink description of TM memory breakdown https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/mem_setup_tm/


From: marco andreas <ma...@gmail.com>
Date: Wednesday, 25 January 2023 at 19:57
To: user <us...@flink.apache.org>
Subject: [EXTERNAL] OOM taskmanager


CAUTION: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.




Hello,

We are deploying a flink application cluster in kubernetes, 2 pods one for the JM and the other for the TM.

The problem is when we launch load tests we see that task manager memory usage increases,  after the tests  are finished and flink stop processing data the memory usage never comes down where it was before, eventually when we launch tests again and again the memory of TM continues to grow until it reaches the memory resource limit specified in the container templates and it get killed because of OOM.


Has anyone faced the same issue and what is the best way to investigate this error in order to know the root cause of why the memory usage of the TM never comes down when flink finishes processing.

FLink version is 1.16.0.
Thanks,