You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Thomas Wozniakowski (Jira)" <ji...@apache.org> on 2020/02/18 16:46:00 UTC

[jira] [Commented] (FLINK-16142) Memory Le!k causes Metaspace OOM error on repeated job submission

    [ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17039227#comment-17039227 ] 

Thomas Wozniakowski commented on FLINK-16142:
---------------------------------------------

Hi Guys,

Thanks for the speedy response.

It's going to be a little tricky to get at the heap dumps, as our application is very specifically written to only run on Flink running in docker. Do you know if there are any configuration options that we could pass in from the outside that would encourage the dockerised Flink to dump some more useful information to the logs when the OOM-metaspace error occurs?

I'll take a look at our connectors. We are currently using:

- Kinesis source, official build from Maven (now that it is published there). Prior to 1.10.0 we built it from source internally.
- Custom SQS sink. Basically just dumps events straight onto a queue using the AWS SDK (version 1, blocking). This code has not changed since it was originally written (for Flink 1.3). 

We do not explicitly shut down the threads that presumably run in the background for the SQS SDK. I will have a look and see if there's a way we can explicitly close them when the job shuts down.

Any other ideas of places I can look, give me a shout.

Tom

> Memory Leak causes Metaspace OOM error on repeated job submission
> -----------------------------------------------------------------
>
>                 Key: FLINK-16142
>                 URL: https://issues.apache.org/jira/browse/FLINK-16142
>             Project: Flink
>          Issue Type: Bug
>          Components: Client / Job Submission
>    Affects Versions: 1.10.0
>            Reporter: Thomas Wozniakowski
>            Priority: Blocker
>             Fix For: 1.10.1, 1.11.0
>
>
> Hi Guys,
> We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our use-case exactly (RocksDB state backend running in a containerised cluster). Unfortunately, it seems like there is a memory leak somewhere in the job submission logic. We are getting this error:
> {code:java}
> 2020-02-18 10:22:10,020 INFO org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME switched from RUNNING to FAILED.
> java.lang.OutOfMemoryError: Metaspace
> at java.lang.ClassLoader.defineClass1(Native Method)
> at java.lang.ClassLoader.defineClass(ClassLoader.java:757)
> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468)
> at java.net.URLClassLoader.access$100(URLClassLoader.java:74)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:369)
> at java.net.URLClassLoader$1.run(URLClassLoader.java:363)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:362)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
> at org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
> at org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27)
> at org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398)
> at org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.<clinit>(AwsSdkMetrics.java:359)
> at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728)
> at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660)
> at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652)
> at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611)
> at org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606)
> at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534)
> at org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528)
> at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439)
> at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389)
> at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279)
> at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686)
> at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287)
> at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100)
> at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63)
> {code}
> (The only change in the above text is the OPERATOR_NAME text where I removed some of the internal specifics of our system).
> This will reliably happen on a fresh cluster after submitting and cancelling our job 3 times.
> We are using the presto-s3 plugin, the CEP library and the Kinesis connector.
> Please let me know what other diagnostics would be useful.
> Tom



--
This message was sent by Atlassian Jira
(v8.3.4#803005)