You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Lukas Mahl (Jira)" <ji...@apache.org> on 2022/11/25 14:57:00 UTC
[jira] [Created] (FLINK-30218) [Kafka Connector] java.lang.OutOfMemoryError: Metaspace
Lukas Mahl created FLINK-30218:
----------------------------------
Summary: [Kafka Connector] java.lang.OutOfMemoryError: Metaspace
Key: FLINK-30218
URL: https://issues.apache.org/jira/browse/FLINK-30218
Project: Flink
Issue Type: Bug
Components: Connectors / Kafka
Affects Versions: 1.15.1
Environment: +*AWS EMR*+
* Standard AWS EMR Cluster (1 master YARN node, 1 core node) - 2 vCore, 8 GB memory each
* JDK 11 Coretto
+*Kafka Consumer Config*+
{code:java}
acks = 1
batch.size = 16384
bootstrap.servers = [...]
buffer.memory = 33554432
client.dns.lookup = use_all_dns_ips
client.id = producer-1
compression.type = none
connections.max.idle.ms = 540000
delivery.timeout.ms = 120000
enable.idempotence = false
interceptor.classes = []
internal.auto.downgrade.txn.commit = false
key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
linger.ms = 0
max.block.ms = 60000
max.in.flight.requests.per.connection = 5
max.request.size = 1048576
metadata.max.age.ms = 300000
metadata.max.idle.ms = 300000
metric.reporters = []
metrics.num.samples = 2
metrics.recording.level = INFO
metrics.sample.window.ms = 30000
partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
receive.buffer.bytes = 32768
reconnect.backoff.max.ms = 1000
reconnect.backoff.ms = 50
request.timeout.ms = 30000
retries = 2147483647
retry.backoff.ms = 100
sasl.client.callback.handler.class = class software.amazon.msk.auth.iam.IAMClientCallbackHandler
sasl.jaas.config = [hidden]
sasl.kerberos.kinit.cmd = /usr/bin/kinit
sasl.kerberos.min.time.before.relogin = 60000
sasl.kerberos.service.name = null
sasl.kerberos.ticket.renew.jitter = 0.05
sasl.kerberos.ticket.renew.window.factor = 0.8
sasl.login.callback.handler.class = null
sasl.login.class = null
sasl.login.refresh.buffer.seconds = 300
sasl.login.refresh.min.period.seconds = 60
sasl.login.refresh.window.factor = 0.8
sasl.login.refresh.window.jitter = 0.05
sasl.mechanism = AWS_MSK_IAM
security.protocol = SASL_SSL
security.providers = null
send.buffer.bytes = 131072
socket.connection.setup.timeout.max.ms = 30000
socket.connection.setup.timeout.ms = 10000
ssl.cipher.suites = null
ssl.enabled.protocols = [TLSv1.2, TLSv1.3]
ssl.endpoint.identification.algorithm = https
ssl.engine.factory.class = null
ssl.key.password = null
ssl.keymanager.algorithm = SunX509
ssl.keystore.certificate.chain = null
ssl.keystore.key = null
ssl.keystore.location = null
ssl.keystore.password = null
ssl.keystore.type = JKS
ssl.protocol = TLSv1.3
ssl.provider = null
ssl.secure.random.implementation = null
ssl.trustmanager.algorithm = PKIX
ssl.truststore.certificates = null
ssl.truststore.location = null
ssl.truststore.password = null
ssl.truststore.type = JKS
transaction.timeout.ms = 3600000
transactional.id = null
value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer{code}
+*Flink Config*+
{code:java}
taskmanager.memory.process.size=3g
taskmanager.memory.jvm-metaspace.size=512m
taskmanager.numberOfTaskSlots=2
jobmanager.memory.process.size=3g
jobmanager.memory.jvm-metaspace.size=256m
jobmanager.web.address0.0.0.0
env.java.opts.all-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=${FLINK_LOG_PREFIX}.hprof {code}
Reporter: Lukas Mahl
Attachments: dump.hprof.zip, image-2022-11-25-15-55-43-559.png
Hello!
I'm running a Flink application on AWS EMR which consumes from a Kafka Topic using the official Flink Kafka consumer. I'm running the application as a Flink batch job every 30 minutes and I see that the JobManager's metaspace is increasing every time I submit a new job and doesn't reduce once a job has finished executing. Eventually the metaspace overflows and I get a OutOfMemory metaspace exception. I've tried increasing the metaspace to 512m, but this just delays the problem - hence it's definitely a classloading leak.
I debugged the issue by creating a simple Flink application with a Kafka consumer only and the issue still occurred, hence I suppose the issue is somewhere in the Kafka consumer. Only other third party plugin I was using when doing so was the AWS IAM Kafka dependency (software.amazon.msk:aws-msk-iam-auth:1.1.5).
I also tried debugging the issue by generating a heap dump as described here ([https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leak|[https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks]),] but I wasn't really able to spot the origin of the memory leak. I can see references to org.apache.kafka.common.utils.AppInfoParser$AppInfo though:
!image-2022-11-25-15-55-43-559.png|width=666,height=278!
I have attached the heap dump - if you need any more information from my set up feel free to ask, I can provide anything you need.
Many thanks in advance!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)