You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Hao Sun <ha...@zendesk.com> on 2017/11/15 17:35:56 UTC

org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

Hi team, I am looking at some memory/GC issues for my flink setup. I am
running flink 1.3.2 in docker for my development environment. Using
Kubernetes for production.
I see instances of org.apache.flink.runtime.io.network.NetworkEnvironment
are increasing dramatically and not GC-ed very well for my application.
My simple app collects Kafka events and transforms the information and logs
the results out.

Is this expected? I am new to Java memory analysis not sure what is
actually wrong.

[image: image.png]
[image: image.png]
[image: image.png]
[image: image.png]

Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

Posted by Piotr Nowojski <pi...@data-artisans.com>.
Hi,

If the TM is not responding check the TM logs if there is some long gap in logs. There might be three main reasons for such gaps:

1. Machine is swapping - setup/configure your machine/processes that machine never swap (best to disable swap altogether)
2. Long GC full stops - look how to analyse those either by printing GC logs or attaching to the JVM with some profiler.
3. Network issues - but this usually shouldn’t cause gaps in the logs.

Piotrek

> On 16 Nov 2017, at 17:48, Hao Sun <ha...@zendesk.com> wrote:
> 
> Sorry, the "killed" I mean here is JM lost the TM. The TM instance is still running inside kubernetes, but it is not responding to any requests, probably due to high load. And from JM side, JM lost heartbeat tracking of the TM, so it marked the TM as died.
> 
> The „volume“ of Kafka topics, I mean, the volume of messages for a topic. e.g. 10000 msg/sec, I have not check the size of the message yet.
> But overall, as you suggested, I think I need more tuning for my TM params, so it can maintain a reasonable load. I am not sure what params to look for, but I will do my research first.
> 
> Always thanks for your help Stefan.
> 
> On Thu, Nov 16, 2017 at 8:27 AM Stefan Richter <s.richter@data-artisans.com <ma...@data-artisans.com>> wrote:
> Hi,
>> In addition to your comments, what are the items retained by NetworkEnvironment? They grew seems like indefinitely, do they ever reduce?
>> 
> 
> Mostly the network buffers, which should be ok. They are always recycled and should not be released until the network environment is GCed.
> 
>> I think there is a GC issue because my task manager is killed somehow after a job run. The duration correlates to the volume of Kafka topics. More volume TM dies quickly. Do you have any tips to debug it?
>> 
> 
> What killed your task manager? For example do you see a see an java.lang.OutOfMemoryError or is the process killed by the OS’s OOM killer? In case of an OOM killer, you might need to grant more process memory or reduce the memory that you have configured for Flink to stay below the configured threshold that would kill the process. What exactly do you mean by „volume“ of Kafka topics? 
> 
> To debug, I suggest that you first figure out why the process is killed, maybe your thresholds are simply to low and the consumption can go beyond with your configuration of Flink. Then you should figure out what is actually growing more than you expect, e.g. is the problem triggered by heap space or native memory? Depending on the answer, e.g. heap dumps could help to spot the problematic objects.
> 
> Best,
> Stefan


Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

Posted by Hao Sun <ha...@zendesk.com>.
Sorry, the "killed" I mean here is JM lost the TM. The TM instance is still
running inside kubernetes, but it is not responding to any requests,
probably due to high load. And from JM side, JM lost heartbeat tracking of
the TM, so it marked the TM as died.

The „volume“ of Kafka topics, I mean, the volume of messages for a topic.
e.g. 10000 msg/sec, I have not check the size of the message yet.
But overall, as you suggested, I think I need more tuning for my TM params,
so it can maintain a reasonable load. I am not sure what params to look
for, but I will do my research first.

Always thanks for your help Stefan.

On Thu, Nov 16, 2017 at 8:27 AM Stefan Richter <s....@data-artisans.com>
wrote:

> Hi,
>
> In addition to your comments, what are the items retained by
> NetworkEnvironment? They grew seems like indefinitely, do they ever reduce?
>
>
> Mostly the network buffers, which should be ok. They are always recycled
> and should not be released until the network environment is GCed.
>
> I think there is a GC issue because my task manager is killed somehow
> after a job run. The duration correlates to the volume of Kafka topics.
> More volume TM dies quickly. Do you have any tips to debug it?
>
> What killed your task manager? For example do you see a see an
> java.lang.OutOfMemoryError or is the process killed by the OS’s OOM killer?
> In case of an OOM killer, you might need to grant more process memory or
> reduce the memory that you have configured for Flink to stay below the
> configured threshold that would kill the process. What exactly do you mean
> by „volume“ of Kafka topics?
>
> To debug, I suggest that you first figure out why the process is killed,
> maybe your thresholds are simply to low and the consumption can go beyond
> with your configuration of Flink. Then you should figure out what is
> actually growing more than you expect, e.g. is the problem triggered by
> heap space or native memory? Depending on the answer, e.g. heap dumps could
> help to spot the problematic objects.
>
> Best,
> Stefan
>

Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

Posted by Stefan Richter <s....@data-artisans.com>.
Hi,
> In addition to your comments, what are the items retained by NetworkEnvironment? They grew seems like indefinitely, do they ever reduce?
> 

Mostly the network buffers, which should be ok. They are always recycled and should not be released until the network environment is GCed.
> I think there is a GC issue because my task manager is killed somehow after a job run. The duration correlates to the volume of Kafka topics. More volume TM dies quickly. Do you have any tips to debug it?
> 
What killed your task manager? For example do you see a see an java.lang.OutOfMemoryError or is the process killed by the OS’s OOM killer? In case of an OOM killer, you might need to grant more process memory or reduce the memory that you have configured for Flink to stay below the configured threshold that would kill the process. What exactly do you mean by „volume“ of Kafka topics? 

To debug, I suggest that you first figure out why the process is killed, maybe your thresholds are simply to low and the consumption can go beyond with your configuration of Flink. Then you should figure out what is actually growing more than you expect, e.g. is the problem triggered by heap space or native memory? Depending on the answer, e.g. heap dumps could help to spot the problematic objects.

Best,
Stefan

Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

Posted by Hao Sun <ha...@zendesk.com>.
Thanks a lot! This is very helpful.
In addition to your comments, what are the items retained by
NetworkEnvironment? They grew seems like indefinitely, do they ever reduce?

I think there is a GC issue because my task manager is killed somehow after
a job run. The duration correlates to the volume of Kafka topics. More
volume TM dies quickly. Do you have any tips to debug it?

On Thu, Nov 16, 2017, 01:35 Stefan Richter <s....@data-artisans.com>
wrote:

> Hi,
>
> I cannot spot anything that indicates a leak from your screenshots. Maybe
> you misinterpret the numbers? In your heap dump, there is only a single
> instance of org.apache.flink.runtime.io.network.NetworkEnvironment and it
> retains about 400,000,000 bytes from being GCed because it holds references
> to the network buffers. This is perfectly normal because this the buffer
> pool is part of this object, and for as long as it lives, the referenced
> buffers should not be GCed and the current size of all your buffers is
> around 400 million bytes.
>
> Your heap space is also not growing without bounds, but always goes down
> after a GC was performed. Looks fine to me.
>
> Last, I think the number of G1_Young_Generation is a counter of how many
> gc cycles have been performed and the time is a sum. So naturally, those
> values would always increase.
>
> Best,
> Stefan
>
> > Am 15.11.2017 um 18:35 schrieb Hao Sun <ha...@zendesk.com>:
> >
> > Hi team, I am looking at some memory/GC issues for my flink setup. I am
> running flink 1.3.2 in docker for my development environment. Using
> Kubernetes for production.
> > I see instances of org.apache.flink.runtime.io.network.NetworkEnvironment
> are increasing dramatically and not GC-ed very well for my application.
> > My simple app collects Kafka events and transforms the information and
> logs the results out.
> >
> > Is this expected? I am new to Java memory analysis not sure what is
> actually wrong.
> >
> > <image.png>
> > <image.png>
> > <image.png>
> > <image.png>
>
>

Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

Posted by Stefan Richter <s....@data-artisans.com>.
Hi,

I cannot spot anything that indicates a leak from your screenshots. Maybe you misinterpret the numbers? In your heap dump, there is only a single instance of org.apache.flink.runtime.io.network.NetworkEnvironment and it retains about 400,000,000 bytes from being GCed because it holds references to the network buffers. This is perfectly normal because this the buffer pool is part of this object, and for as long as it lives, the referenced buffers should not be GCed and the current size of all your buffers is around 400 million bytes.

Your heap space is also not growing without bounds, but always goes down after a GC was performed. Looks fine to me.

Last, I think the number of G1_Young_Generation is a counter of how many gc cycles have been performed and the time is a sum. So naturally, those values would always increase.

Best,
Stefan

> Am 15.11.2017 um 18:35 schrieb Hao Sun <ha...@zendesk.com>:
> 
> Hi team, I am looking at some memory/GC issues for my flink setup. I am running flink 1.3.2 in docker for my development environment. Using Kubernetes for production.
> I see instances of org.apache.flink.runtime.io.network.NetworkEnvironment are increasing dramatically and not GC-ed very well for my application.
> My simple app collects Kafka events and transforms the information and logs the results out.
> 
> Is this expected? I am new to Java memory analysis not sure what is actually wrong.
> 
> <image.png>
> <image.png>
> <image.png>
> <image.png>


Re: org.apache.flink.runtime.io.network.NetworkEnvironment causing memory leak?

Posted by Hao Sun <ha...@zendesk.com>.
FYI this is why I think there is a memory leak somewhere. G1_Young_Gen kept
growing and time spend kept increasing

[image: image.png]


On Wed, Nov 15, 2017 at 9:35 AM Hao Sun <ha...@zendesk.com> wrote:

> Hi team, I am looking at some memory/GC issues for my flink setup. I am
> running flink 1.3.2 in docker for my development environment. Using
> Kubernetes for production.
> I see instances of org.apache.flink.runtime.io.network.NetworkEnvironment
> are increasing dramatically and not GC-ed very well for my application.
> My simple app collects Kafka events and transforms the information and
> logs the results out.
>
> Is this expected? I am new to Java memory analysis not sure what is
> actually wrong.
>
> [image: image.png]
> [image: image.png]
> [image: image.png]
> [image: image.png]
>