You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@flink.apache.org by tom yang <en...@gmail.com> on 2023/04/03 17:46:53 UTC

TaskManager OutOfMemoryError. Metaspace out of memory with python udf

Hi Flink community,

I am running a session cluster with 1gb of jvm metaspace. Each time I
submit and cancel the flink job with python udf I am noticing that the
metaspace is gradually increasing until it eventually kills the task
manager due to an out of memory exception.

To reproduce this error locally I installed flink v1.16.1 and pyflink
1.16.1 with python version 3.9 . Using the word count python example
here https://nightlies.apache.org/flink/flink-docs-master/docs/dev/python/table_api_tutorial/

I would submit this python job to my local cluster via
./flink-1.16.1/bin/flink run -pyexec /opt/homebrew/bin/python3.9
--python wordcount.py

and then cancel the running job. Over time I can see from the flink UI
the metaspace is gradually increasing until the job manager crashes
with the following exception

2023-03-29 10:17:19,270 ERROR
org.apache.flink.runtime.taskexecutor.TaskManagerRunner [] - Fatal
error occurred while executing the TaskManager. Shutting it down...
java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory
error has occurred. This can mean two things: either the job requires
a larger size of JVM metaspace to load classes or there is a class
loading leak. In the first case
'taskmanager.memory.jvm-metaspace.size' configuration option should be
increased. If the error persists (usually in cluster after several job
(re-)submissions) then there is probably a class loading leak in user
code or some of its dependencies which has to be investigated and
fixed. The task executor has to be shutdown...
at java.lang.ClassLoader.defineClass1(Native Method) ~[?:?]
at java.lang.ClassLoader.defineClass(ClassLoader.java:1017) ~[?:?]

I noticed that a similar issue was mentioned in
https://issues.apache.org/jira/browse/FLINK-15338 due to a leaky class
loader but was fixed in version 1.10.

Has anyone else encountered similar issues?

Thanks,
Tom