You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Lukas Mahl (Jira)" <ji...@apache.org> on 2022/11/25 14:57:00 UTC
[jira] [Updated] (FLINK-30218) [Kafka Connector] java.lang.OutOfMemoryError: Metaspace
[ https://issues.apache.org/jira/browse/FLINK-30218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Lukas Mahl updated FLINK-30218:
-------------------------------
Description:
Hello!
I'm running a Flink application on AWS EMR which consumes from a Kafka Topic using the official Flink Kafka consumer. I'm running the application as a Flink batch job every 30 minutes and I see that the JobManager's metaspace is increasing every time I submit a new job and doesn't reduce once a job has finished executing. Eventually the metaspace overflows and I get a OutOfMemory metaspace exception. I've tried increasing the metaspace to 512m, but this just delays the problem - hence it's definitely a classloading leak.
I debugged the issue by creating a simple Flink application with a Kafka consumer only and the issue still occurred, hence I suppose the issue is somewhere in the Kafka consumer. Only other third party plugin I was using when doing so was the AWS IAM Kafka dependency (software.amazon.msk:aws-msk-iam-auth:1.1.5).
I also tried debugging the issue by generating a heap dump as described here ([https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leak|[https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks]),] but I wasn't really able to spot the origin of the memory leak. I can see references to org.apache.kafka.common.utils.AppInfoParser$AppInfo though:
!image-2022-11-25-15-55-43-559.png|width=860,height=359!
I have attached the heap dump - if you need any more information from my set up feel free to ask, I can provide anything you need.
Many thanks in advance!
was:
Hello!
I'm running a Flink application on AWS EMR which consumes from a Kafka Topic using the official Flink Kafka consumer. I'm running the application as a Flink batch job every 30 minutes and I see that the JobManager's metaspace is increasing every time I submit a new job and doesn't reduce once a job has finished executing. Eventually the metaspace overflows and I get a OutOfMemory metaspace exception. I've tried increasing the metaspace to 512m, but this just delays the problem - hence it's definitely a classloading leak.
I debugged the issue by creating a simple Flink application with a Kafka consumer only and the issue still occurred, hence I suppose the issue is somewhere in the Kafka consumer. Only other third party plugin I was using when doing so was the AWS IAM Kafka dependency (software.amazon.msk:aws-msk-iam-auth:1.1.5).
I also tried debugging the issue by generating a heap dump as described here ([https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leak|[https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks]),] but I wasn't really able to spot the origin of the memory leak. I can see references to org.apache.kafka.common.utils.AppInfoParser$AppInfo though:
!image-2022-11-25-15-55-43-559.png|width=666,height=278!
I have attached the heap dump - if you need any more information from my set up feel free to ask, I can provide anything you need.
Many thanks in advance!
> [Kafka Connector] java.lang.OutOfMemoryError: Metaspace
> -------------------------------------------------------
>
> Key: FLINK-30218
> URL: https://issues.apache.org/jira/browse/FLINK-30218
> Project: Flink
> Issue Type: Bug
> Components: Connectors / Kafka
> Affects Versions: 1.15.1
> Environment: +*AWS EMR*+
> * Standard AWS EMR Cluster (1 master YARN node, 1 core node) - 2 vCore, 8 GB memory each
> * JDK 11 Coretto
> +*Kafka Consumer Config*+
> {code:java}
> acks = 1
> batch.size = 16384
> bootstrap.servers = [...]
> buffer.memory = 33554432
> client.dns.lookup = use_all_dns_ips
> client.id = producer-1
> compression.type = none
> connections.max.idle.ms = 540000
> delivery.timeout.ms = 120000
> enable.idempotence = false
> interceptor.classes = []
> internal.auto.downgrade.txn.commit = false
> key.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer
> linger.ms = 0
> max.block.ms = 60000
> max.in.flight.requests.per.connection = 5
> max.request.size = 1048576
> metadata.max.age.ms = 300000
> metadata.max.idle.ms = 300000
> metric.reporters = []
> metrics.num.samples = 2
> metrics.recording.level = INFO
> metrics.sample.window.ms = 30000
> partitioner.class = class org.apache.kafka.clients.producer.internals.DefaultPartitioner
> receive.buffer.bytes = 32768
> reconnect.backoff.max.ms = 1000
> reconnect.backoff.ms = 50
> request.timeout.ms = 30000
> retries = 2147483647
> retry.backoff.ms = 100
> sasl.client.callback.handler.class = class software.amazon.msk.auth.iam.IAMClientCallbackHandler
> sasl.jaas.config = [hidden]
> sasl.kerberos.kinit.cmd = /usr/bin/kinit
> sasl.kerberos.min.time.before.relogin = 60000
> sasl.kerberos.service.name = null
> sasl.kerberos.ticket.renew.jitter = 0.05
> sasl.kerberos.ticket.renew.window.factor = 0.8
> sasl.login.callback.handler.class = null
> sasl.login.class = null
> sasl.login.refresh.buffer.seconds = 300
> sasl.login.refresh.min.period.seconds = 60
> sasl.login.refresh.window.factor = 0.8
> sasl.login.refresh.window.jitter = 0.05
> sasl.mechanism = AWS_MSK_IAM
> security.protocol = SASL_SSL
> security.providers = null
> send.buffer.bytes = 131072
> socket.connection.setup.timeout.max.ms = 30000
> socket.connection.setup.timeout.ms = 10000
> ssl.cipher.suites = null
> ssl.enabled.protocols = [TLSv1.2, TLSv1.3]
> ssl.endpoint.identification.algorithm = https
> ssl.engine.factory.class = null
> ssl.key.password = null
> ssl.keymanager.algorithm = SunX509
> ssl.keystore.certificate.chain = null
> ssl.keystore.key = null
> ssl.keystore.location = null
> ssl.keystore.password = null
> ssl.keystore.type = JKS
> ssl.protocol = TLSv1.3
> ssl.provider = null
> ssl.secure.random.implementation = null
> ssl.trustmanager.algorithm = PKIX
> ssl.truststore.certificates = null
> ssl.truststore.location = null
> ssl.truststore.password = null
> ssl.truststore.type = JKS
> transaction.timeout.ms = 3600000
> transactional.id = null
> value.serializer = class org.apache.kafka.common.serialization.ByteArraySerializer{code}
> +*Flink Config*+
> {code:java}
> taskmanager.memory.process.size=3g
> taskmanager.memory.jvm-metaspace.size=512m
> taskmanager.numberOfTaskSlots=2
> jobmanager.memory.process.size=3g
> jobmanager.memory.jvm-metaspace.size=256m
> jobmanager.web.address0.0.0.0
> env.java.opts.all-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=${FLINK_LOG_PREFIX}.hprof {code}
>
> Reporter: Lukas Mahl
> Priority: Major
> Attachments: dump.hprof.zip, image-2022-11-25-15-55-43-559.png
>
>
> Hello!
> I'm running a Flink application on AWS EMR which consumes from a Kafka Topic using the official Flink Kafka consumer. I'm running the application as a Flink batch job every 30 minutes and I see that the JobManager's metaspace is increasing every time I submit a new job and doesn't reduce once a job has finished executing. Eventually the metaspace overflows and I get a OutOfMemory metaspace exception. I've tried increasing the metaspace to 512m, but this just delays the problem - hence it's definitely a classloading leak.
> I debugged the issue by creating a simple Flink application with a Kafka consumer only and the issue still occurred, hence I suppose the issue is somewhere in the Kafka consumer. Only other third party plugin I was using when doing so was the AWS IAM Kafka dependency (software.amazon.msk:aws-msk-iam-auth:1.1.5).
> I also tried debugging the issue by generating a heap dump as described here ([https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leak|[https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks]),] but I wasn't really able to spot the origin of the memory leak. I can see references to org.apache.kafka.common.utils.AppInfoParser$AppInfo though:
> !image-2022-11-25-15-55-43-559.png|width=860,height=359!
> I have attached the heap dump - if you need any more information from my set up feel free to ask, I can provide anything you need.
> Many thanks in advance!
--
This message was sent by Atlassian Jira
(v8.20.10#820010)