You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Tamir Sagi <Ta...@niceactimize.com> on 2021/02/28 13:24:31 UTC

Suspected classloader leak in Flink 1.11.1

Hey all,

We are encountering memory issues on a Flink client and task managers, which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of  1GB)  keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed the investigation?


Flink Client application (VisualVm)


[cid:eea81341-5d20-4e95-8bc7-dd70e5b73645]






[Shallow Size  com.fasterxmI.jackson.databind.PropertyMetadata  com.fasterxmIjackson.databind.PropertyMetadata  com.fasterxmI.jackson.databind.PropertyMetadata  com.fasterxmIjackson.databind.PropertyMetadata  com.fasterxmI.jackson.databind.PropertyMetadata  com.fasterxmIjackson.databind.PropertyMetadata  com.fasterxmI.jackson.databind.PropertyMetadata  com.fasterxmIjackson.databind.PropertyMetadata  com.fasterxmI.jackson.databind.PropertyMetadata  com.fasterxmIjackson.databind.PropertyMetadata  com.fasterxmI.jackson.databind.PropertyMetadata  com.fasterxmIjackson.databind.PropertyMetadata  com.fasterxmI.jackson.databind.PropertyMetadata  com.fasterxmIjackson.databind.PropertyMetadata  com.fasterxmI.jackson.databind.PropertyMetadata  com.fasterxmIjackson.databind.PropertyMetadata  com.fasterxmI.jackson.databind.PropertyMetadata  org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (79)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (82)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (23)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (36)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (34)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (84)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (92)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (59)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (70)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (3)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (60)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (8)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (17)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (31)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (12)  org.apache.fIink.utiI.ChiIdFirstCIassLoader (49)  Objects  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  Retained Size  120  120  120  120  120  120  120  120  120  120  120  120  120  120  120  120  120  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  z 120  z 120  z 120  z 120  z 120  z 120  z 120  z 120  z 120  z 120  z 120  z 120  z 120  z 120  z 120  z 120  z 120  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%  0%]

We have used different GCs but same results.


Task Manager


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released.

[cid:5857fb78-e04d-4326-8f2e-a2d2513eddd9]


We used jcmd tool and attached 3 files

  1.  Threads print
  2.  VM.metaspace output
  3.  VM.classloader

In addition, we have tried calling GC manually, but it did not change much.

Thank you




[https://my-email-signature.link/signature.gif?u=1088647&e=138428217&v=4d3270d8e2c01e29246b0f313f4308bf7baba69213056f6ae359b5330d610654]

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Re: Suspected classloader leak in Flink 1.11.1

Posted by Chesnay Schepler <ch...@apache.org>.
I'd suggest to take a heap dump and investigate what is referencing 
these classloaders; chances are that some thread isn't being cleaned up.

On 2/28/2021 3:46 PM, Kezhu Wang wrote:
> Hi Tamir,
>
> You could check 
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/debugging/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code 
> <https://ci.apache.org/projects/flink/flink-docs-stable/ops/debugging/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code> for 
> known class loading issues.
>
> Besides this, I think GC.class_histogram(even filtered) could help us 
> listing suspected objects.
>
>
> Best,
> Kezhu Wang
>
>
> On February 28, 2021 at 21:25:07, Tamir Sagi 
> (tamir.sagi@niceactimize.com <ma...@niceactimize.com>) wrote:
>
>>
>> Hey all,
>>
>> We are encountering memory issues on a Flink client and task 
>> managers, which I would like to raise here.
>>
>> we are running Flink on a session cluster (version 1.11.1) on 
>> Kubernetes, submitting batch jobs with Flink client on Spring boot 
>> application (using RestClusterClient).
>>
>> When jobs are being submitted and running, one after another, We see 
>> that the metaspace memory(with max size of  1GB)  keeps increasing, 
>> as well as linear increase in the heap memory (though it's a more 
>> moderate increase). We do see GC working on the heap and releasing 
>> some resources.
>>
>> By analyzing the memory of the client Java application with profiling 
>> tools, We saw that there are many instances of Flink's 
>> ChildFirstClassLoader (perhaps as the number of jobs which were 
>> running), and therefore many instances of the same class, each from a 
>> different instance of the Class Loader (as shown in the attached 
>> screenshot). Similarly, to the Flink task manager memory.
>>
>> We would expect to see one instance of Class Loader. Therefore, We 
>> suspect that the reason for the increase is Class Loaders not being 
>> cleaned.
>>
>> Does anyone have some insights about this issue, or ideas how to 
>> proceed the investigation?
>>
>>
>> *Flink Client application (VisualVm)*
>>
>>
>>
>> Shallow Size com.fasterxmI.jackson.databind.PropertyMetadata 
>> com.fasterxmIjackson.databind.PropertyMetadata 
>> com.fasterxmI.jackson.databind.PropertyMetadata 
>> com.fasterxmIjackson.databind.PropertyMetadata 
>> com.fasterxmI.jackson.databind.PropertyMetadata 
>> com.fasterxmIjackson.databind.PropertyMetadata 
>> com.fasterxmI.jackson.databind.PropertyMetadata 
>> com.fasterxmIjackson.databind.PropertyMetadata 
>> com.fasterxmI.jackson.databind.PropertyMetadata 
>> com.fasterxmIjackson.databind.PropertyMetadata 
>> com.fasterxmI.jackson.databind.PropertyMetadata 
>> com.fasterxmIjackson.databind.PropertyMetadata 
>> com.fasterxmI.jackson.databind.PropertyMetadata 
>> com.fasterxmIjackson.databind.PropertyMetadata 
>> com.fasterxmI.jackson.databind.PropertyMetadata 
>> com.fasterxmIjackson.databind.PropertyMetadata 
>> com.fasterxmI.jackson.databind.PropertyMetadata 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (41) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (79) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (82) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (23) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (36) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (34) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (84) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (92) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (59) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (70) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (3) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (60) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (8) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (17) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (31) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (12) 
>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (49) Objects 0% 0% 0% 0% 
>> 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Retained Size 120 120 120 120 120 
>> 120 120 120 120 120 120 120 120 120 120 120 120 0% 0% 0% 0% 0% 0% 0% 
>> 0% 0% 0% 0% 0% 0% 0% 0% 0% z 120 z 120 z 120 z 120 z 120 z 120 z 120 
>> z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 0% 0% 0% 
>> 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
>>
>> We have used different GCs but same results.
>>
>>
>> _*Task Manager*_
>>
>>
>> Total Size 4GB
>>
>> metaspace 1GB
>>
>> Off heap 512mb
>>
>>
>> Screenshot form Task manager, 612MB are occupied and not being released.
>>
>>
>> We used jcmd tool and attached 3 files
>>
>>  1. Threads print
>>  2. VM.metaspace output
>>  3. VM.classloader
>>
>> In addition, we have tried calling GC manually, but it did not change 
>> much.
>>
>> Thank you
>>
>>
>>
>>
>> Confidentiality: This communication and any attachments are intended 
>> for the above-named persons only and may be confidential and/or 
>> legally privileged. Any opinions expressed in this communication are 
>> not necessarily those of NICE Actimize. If this communication has 
>> come to you in error you must take no action based on it, nor must 
>> you copy or show it to anyone; please delete/destroy and inform the 
>> sender by e-mail immediately.
>> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
>> Viruses: Although we have taken steps toward ensuring that this 
>> e-mail and attachments are free from any virus, we advise that in 
>> keeping with good computing practice the recipient should ensure they 
>> are actually virus free.
>>


Re: Suspected classloader leak in Flink 1.11.1

Posted by Kezhu Wang <ke...@gmail.com>.
Hi all,

@Chesnay is right, there is no code execution coupling between client and
task manager.

But before job is submitted to flink cluster, client need various steps to
build a job graph for submission.

These steps could includes:
* Construct user functions.
* Construct runtime stream operator if necessary.
* Other possible unrelated steps.

The constrcuted functions/operators are only *opened* for function in flink
cluster not client.

There are no cleanup operations for these functions/operators in client
side. If you ever do some
resources consuming in construction of these functions/operators, then you
probably will leak these
consumed resources in client side.

In your case, these resources consuming operations could be:
* Register `com.amazonaws.metrics.MetricAdmin` mbean directly or indirectly.
* Start `IdleConnectionReaper` directly or indirectly.


For task manager side resource cleanup,
`RuntimeContext.registerUserCodeClassLoaderReleaseHookIfAbsent`
could also be useful for global resource cleanup such as mbean
un-registration.

Besides this, I observed two additional symptoms which might be useful:

* "kafka-producer-network-thread"(loaded through AppClassLoader) still
running.
* `MetricAdmin` mbean and `IdleConnectionReaper` are also loaded by
`PluginClassLoader`.

> shutdown method always returns false, the instance is null

Outside `PackagedProgramUtils.createJobGraph`, the class loader is your
application class loader while the leaking resources is created inside
`createJoGraph` through `ChildFirstClassLoader`.


Best,
Kezhu Wang

On March 3, 2021 at 02:33:58, Chesnay Schepler (chesnay@apache.org) wrote:

The client and TaskManager are not coupled in any way. The client
serializes individual functions that are transmitted to the task managers,
deserialized and run.
Hence, if your functions rely on any library that needs cleanup then you
must add this to the respective function, likely by extending the
RichFunction variants, to ensure this cleanup is executed on the task
manager.

On 3/2/2021 4:52 PM, Tamir Sagi wrote:

Thank you Kezhu and Chesnay,

The code I provided you is a minimal code to show what is executed as part
of the batch along the Flink client app. you were right that it's not
consistent with the heap dump. (which has been taken in dev env)

We run multiple Integration tests(Job per test) against Flink session
cluster(Running on Kubernetes). with 2 task manager, single job manager.
The jobs are submitted via Flink Client app which runs on top of spring
boot application along Kafka.

I suspected that IdleConnectionReaper is the root cause to some sort of
leak(In the flink client app) however, I was trying to manually shutdown
the IdleConnectionReaper Once the job finished.
via calling *"**com.amazonaws.http.IdleConnectionReaper.shutdown()"*.  -
which is suggested as a workaround.
Ref: https://forums.aws.amazon.com/thread.jspa?messageID=500552#500552

It did not affect much. the memory has not been released .(shutdown method
always returns false, the instance is null) ref:
https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/http/IdleConnectionReaper.java#L148-L157

On the batch code , I added close method which close the connections to aws
clients once the operation finished. it did not help either, as the memory
keep growing gradually.

We came across the following setting
https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/#object-reuse-enabled

any more ideas based on the heap dump/Flight recording(Task manager)?
Is it correct that the Flink client & Task manager are strongly coupled?

Thanks,
Tamir.

------------------------------
*From:* Kezhu Wang <ke...@gmail.com> <ke...@gmail.com>
*Sent:* Monday, March 1, 2021 7:54 PM
*To:* Tamir Sagi <Ta...@niceactimize.com> <Ta...@niceactimize.com>;
user@flink.apache.org <us...@flink.apache.org> <us...@flink.apache.org>;
Chesnay Schepler <ch...@apache.org> <ch...@apache.org>
*Subject:* Re: Suspected classloader leak in Flink 1.11.1


*EXTERNAL EMAIL*


Hi Chesnay,

Thanks for give a hand and solve this.

I guess `FlatMapXSightMsgProcessor` is a minimal reproducible version while
the heap dump could be taken from near production environment.


Best,
Kezhu Wang

On March 2, 2021 at 01:00:52, Chesnay Schepler (chesnay@apache.org) wrote:

the java-sdk-connection-reaper thread and amazon's JMX integration are
causing the leak.


What strikes me as odd is that I see some dynamodb classes being referenced
in the child classloaders, but I don't see where they could come from based
on the application that you provided us with.


Could you clarify how exactly you depend on Amazon dependencies?
(connectors, filesystems, _*other stuff*_)

On 3/1/2021 5:24 PM, Tamir Sagi wrote:

Hey,

I'd expect that what happens in a single execution will repeat itself in N
executions.

I ran entire cycle of jobs(28 jobs).
Once it finished:

   - Memory has grown to 1GB
   - I called GC ~100 times using "jcmd 1 GC.run" command.  Which did not
   affect much.

Prior running the tests I started Flight recording using "jcmd 1 JFR.start
", I stopped it after calling GC ~100 times.
Following figure shows the graphs from "recording.jfr" in Virtual Vm.



and Metaspace(top right)


docker stats command filtered to relevant Task manager container


Following files of task-manager are attached :

   - task-manager-VM.metaspace - taken via "jcmd 1 VM.metaspace"
   - task-manager-gc-class-histogram.txt via "jcmd 1 GC.class_histogram"


Task manager heap dump is ~100MB,
here is a summary:



*Flink client app metric(taken from Lens):*




We see a tight coupling between Task Manager app and Flink Client app, as
the batch job runs on the client side(via reflection)
what happens with class loaders in that case?

we also noticed many logs in Task manager related to
PoolingHttpClientConnectionManager


and IdleConnectionReaper  InterruptedException



On *Client app* we noticed many instances of that thread (From heap dump)



We uploaded 2 heap dumps and task-manager flight recording file into Google
drive

   1. task-manager-heap-dump.hprof
   2. java_flink_client.hprof.
   3. task-manager-recording.jfr


Link
https://drive.google.com/drive/folders/1J9wjTmejBroyAIIp7680AsIEQ05Ewdfl?usp=sharing

Thanks,
Tamir.
------------------------------
*From:* Kezhu Wang <ke...@gmail.com> <ke...@gmail.com>
*Sent:* Monday, March 1, 2021 2:21 PM
*To:* user@flink.apache.org <us...@flink.apache.org> <us...@flink.apache.org>;
Tamir Sagi <Ta...@niceactimize.com> <Ta...@niceactimize.com>
*Subject:* Re: Suspected classloader leak in Flink 1.11.1


*EXTERNAL EMAIL*


Hi Tamir,

> The histogram has been taken from Task Manager using jcmd tool.

>From that histogram, I guest there is no classloader leaking.

> A simple batch job with single operation . The memory bumps to ~600MB
(after single execution). once the job is finished the memory never freed.

It could be just new code paths and hence new classes. A single execution
does not making much sense. Multiple or dozen runs and continuous memory
increasing among them and not decreasing after could be symptom of leaking.

You could use following steps to verify whether there are issues in your
task managers:
* Run job N times, the more the better.
* Wait all jobs finished or stopped.
* Trigger manually gc dozen times.
* Take class histogram and check whether there are any
“ChildFirstClassLoader”.
* If there are roughly N “ChildFirstClassLoader” in histogram, then we can
pretty sure there might be class loader leaking.
* If there is no “ChildFirstClassLoader” or few but memory still higher
than a threshold, say ~600MB or more, it could be other shape of leaking.


In all leaking case, an heap dump as @Chesnay said could be more helpful
since it can tell us which object/class/thread keep memory from freeing.


Besides this, I saw an attachment “task-manager-thrad-print.txt” in initial
mail, when and where did you capture ? Task Manager ? Is there any job
still running ?


Best,
Kezhu Wang

On March 1, 2021 at 18:38:55, Tamir Sagi (tamir.sagi@niceactimize.com)
wrote:

Hey Kezhu,

The histogram has been taken from Task Manager using jcmd tool.

By means of batch job, do you means that you compile job graph from DataSet
API in client side and then submit it through RestClient ? I am not
familiar with data set api, usually, there is no `ChildFirstClassLoader`
creation in client side for job graph building. Could you depict a pseudo
for this or did you create `ChildFirstClassLoader` yourself ?

Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin,
then map that data into DataSet then just print it.
Then we have a Flink Client application which saves the batch app jar.

Attached the following files:

   1. batch-source-code.java - main function
   2. FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
   3. flink-job-submit.txt - The code to submit the job


I've noticed 2 behaviors:

   1. Idle - Once Task manager application boots up the memory consumption
   gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs
   where many classes are loaded into JVM and never get released.(Might be a
   normal behavior)
   2. Batch Job Execution - A simple batch job with single operation . The
   memory bumps to ~600MB (after single execution). once the job is finished
   the memory never freed. I executed GC several times (Manually +
   Programmatically) it did not help(although some classes were unloaded). the
   memory keeps growing while more batch jobs are executed.

Attached Task Manager Logs from yesterday after a single batch
execution.(Memory grew to 612MB and never freed)

   1. taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp
   when the job was submitted)
   2. gc-class-historgram.txt
   3. thread-print.txt
   4. vm-class-loader-stats.txt
   5. vm-class-loaders.txt
   6. heap_info.txt


Same behavior has been observed in Flink Client application. Once the batch
job is executed the memory is increased gradually and does not get cleaned
afterwards.(We observed many ChildFirstClassLoader instances)


Thank you
Tamir.

------------------------------
*From:* Kezhu Wang <ke...@gmail.com>
*Sent:* Sunday, February 28, 2021 6:57 PM
*To:* Tamir Sagi <Ta...@niceactimize.com>
*Subject:* Re: Suspected classloader leak in Flink 1.11.1


*EXTERNAL EMAIL*


HI Tamir,

The histogram has no instance of `ChildFirstClassLoader`.

> we are running Flink on a session cluster (version 1.11.1) on Kubernetes,
submitting batch jobs with Flink client on Spring boot application (using
RestClusterClient).

> By analyzing the memory of the client Java application with profiling
tools, We saw that there are many instances of Flink's
ChildFirstClassLoader (perhaps as the number of jobs which were running),
and therefore many instances of the same class, each from a different
instance of the Class Loader (as shown in the attached screenshot).
Similarly, to the Flink task manager memory.

By means of batch job, do you means that you compile job graph from DataSet
API in client side and then submit it through RestClient ? I am not
familiar with data set api, usually, there is no `ChildFirstClassLoader`
creation in client side for job graph building. Could you depict a pseudo
for this or did you create `ChildFirstClassLoader` yourself ?


> In addition, we have tried calling GC manually, but it did not change
much.

It might take serval runs to collect a class loader instance.


Best,
Kezhu Wang


On February 28, 2021 at 23:27:38, Tamir Sagi (tamir.sagi@niceactimize.com)
wrote:

Hey Kezhu,
Thanks for fast responding,

I've read that link few days ago.; Today I ran a simple batch job with
single operation (using hadoop s3 plugin) but the same behavior was
observed.

attached GC.class_histogram (Not filtered)


Tamir.



------------------------------
*From:* Kezhu Wang <ke...@gmail.com>
*Sent:* Sunday, February 28, 2021 4:46 PM
*To:* user@flink.apache.org <us...@flink.apache.org>; Tamir Sagi <
Tamir.Sagi@niceactimize.com>
*Subject:* Re: Suspected classloader leak in Flink 1.11.1


*EXTERNAL EMAIL*


Hi Tamir,

You could check
https://ci.apache.org/projects/flink/flink-docs-stable/ops/debugging/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code
for
known class loading issues.

Besides this, I think GC.class_histogram(even filtered) could help us
listing suspected objects.


Best,
Kezhu Wang


On February 28, 2021 at 21:25:07, Tamir Sagi (tamir.sagi@niceactimize.com)
wrote:


Hey all,

We are encountering memory issues on a Flink client and task managers,
which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes,
submitting batch jobs with Flink client on Spring boot application (using
RestClusterClient).

When jobs are being submitted and running, one after another, We see that
the metaspace memory(with max size of  1GB)  keeps increasing, as well as
linear increase in the heap memory (though it's a more moderate increase).
We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling
tools, We saw that there are many instances of Flink's
ChildFirstClassLoader (perhaps as the number of jobs which were running),
and therefore many instances of the same class, each from a different
instance of the Class Loader (as shown in the attached screenshot).
Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect
that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed
the investigation?


*Flink Client application (VisualVm)*







[image: Shallow Size com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (79)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (82)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (23)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (36)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (34)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (84)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (92)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (59)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (70)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (3)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (60)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (8)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (17)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (31)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (12)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (49) Objects 0% 0% 0% 0% 0% 0%
0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Retained Size 120 120 120 120 120 120 120 120
120 120 120 120 120 120 120 120 120 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
0% 0% 0% z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120
z 120 z 120 z 120 z 120 z 120 z 120 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
0% 0% 0%]

We have used different GCs but same results.


*Task Manager*


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released.

We used jcmd tool and attached 3 files


   1. Threads print
   2. VM.metaspace output
   3. VM.classloader

In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.



Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.

Re: Suspected classloader leak in Flink 1.11.1

Posted by Chesnay Schepler <ch...@apache.org>.
The client and TaskManager are not coupled in any way. The client 
serializes individual functions that are transmitted to the task 
managers, deserialized and run.
Hence, if your functions rely on any library that needs cleanup then you 
must add this to the respective function, likely by extending the 
RichFunction variants, to ensure this cleanup is executed on the task 
manager.

On 3/2/2021 4:52 PM, Tamir Sagi wrote:
> Thank you Kezhu and Chesnay,
>
> The code I provided you is a minimal code to show what is executed as 
> part of the batch along the Flink client app. you were right that it's 
> not consistent with the heap dump. (which has been taken in dev env)
>
> We run multiple Integration tests(Job per test) against Flink session 
> cluster(Running on Kubernetes). with 2 task manager, single job 
> manager. The jobs are submitted via Flink Client app which runs on top 
> of spring boot application along Kafka.
>
> I suspected that IdleConnectionReaper is the root cause to some sort 
> of leak(In the flink client app) however, I was trying to manually 
> shutdown the IdleConnectionReaper Once the job finished.
> via calling*"**com.amazonaws.http.IdleConnectionReaper.shutdown()"*. - 
> which is suggested as a workaround.
> Ref: https://forums.aws.amazon.com/thread.jspa?messageID=500552#500552 
> <https://forums.aws.amazon.com/thread.jspa?messageID=500552#500552>
>
> It did not affect much. the memory has not been released .(shutdown 
> method always returns false, the instance is null) ref: 
> https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/http/IdleConnectionReaper.java#L148-L157 
> <https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/http/IdleConnectionReaper.java#L148-L157>
>
> On the batch code , I added close method which close the connections 
> to aws clients once the operation finished. it did not help either, as 
> the memory keep growing gradually.
>
> We came across the following setting
> https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/#object-reuse-enabled 
> <https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/#object-reuse-enabled>
>
> any more ideas based on the heap dump/Flight recording(Task manager)?
> Is it correct that the Flink client & Task manager are strongly coupled?
>
> Thanks,
> Tamir.
>
> ------------------------------------------------------------------------
> *From:* Kezhu Wang <ke...@gmail.com>
> *Sent:* Monday, March 1, 2021 7:54 PM
> *To:* Tamir Sagi <Ta...@niceactimize.com>; user@flink.apache.org 
> <us...@flink.apache.org>; Chesnay Schepler <ch...@apache.org>
> *Subject:* Re: Suspected classloader leak in Flink 1.11.1
>
> *EXTERNAL EMAIL*
>
>
>
> Hi Chesnay,
>
> Thanks for give a hand and solve this.
>
> I guess `FlatMapXSightMsgProcessor` is a minimal reproducible version 
> while the heap dump could be taken from near production environment.
>
>
> Best,
> Kezhu Wang
>
> On March 2, 2021 at 01:00:52, Chesnay Schepler (chesnay@apache.org 
> <ma...@apache.org>) wrote:
>
>> the java-sdk-connection-reaper thread and amazon's JMX integration 
>> are causing the leak.
>>
>>
>> What strikes me as odd is that I see some dynamodb classes being 
>> referenced in the child classloaders, but I don't see where they 
>> could come from based on the application that you provided us with.
>>
>>
>> Could you clarify how exactly you depend on Amazon dependencies? 
>> (connectors, filesystems, _/other stuff/_)
>>
>>
>> On 3/1/2021 5:24 PM, Tamir Sagi wrote:
>>> Hey,
>>>
>>> I'd expect that what happens in a single execution will repeat 
>>> itself in N executions.
>>>
>>> I ran entire cycle of jobs(28 jobs).
>>> Once it finished:
>>>
>>>   * Memory has grown to 1GB
>>>   * I called GC ~100 times using "jcmd 1 GC.run" command.  Which did
>>>     not affect much.
>>>
>>> Prior running the tests I started Flight recording using "jcmd 1 
>>> JFR.start ", I stopped it after calling GC ~100 times.
>>> Following figure shows the graphs from "recording.jfr" in Virtual Vm.
>>>
>>>
>>>
>>> and Metaspace(top right)
>>>
>>>
>>> docker stats command filtered to relevant Task manager container
>>>
>>>
>>> Following files of task-manager are attached :
>>>
>>>   * task-manager-VM.metaspace - taken via "jcmd 1 VM.metaspace"
>>>   * task-manager-gc-class-histogram.txt via "jcmd 1 GC.class_histogram"
>>>
>>>
>>> Task manager heap dump is ~100MB,
>>> here is a summary:
>>>
>>>
>>>
>>> *Flink client app metric(taken from Lens):*
>>>
>>>
>>>
>>>
>>> We see a tight coupling between Task Manager app and Flink Client 
>>> app, as the batch job runs on the client side(via reflection)
>>> what happens with class loaders in that case?
>>>
>>> we also noticed many logs in Task manager related to 
>>> PoolingHttpClientConnectionManager
>>>
>>>
>>> and IdleConnectionReaper  InterruptedException
>>>
>>>
>>>
>>> On *Client app* we noticed many instances of that thread (From heap 
>>> dump)
>>>
>>>
>>>
>>> We uploaded 2 heap dumps and task-manager flight recording file into 
>>> Google drive
>>>
>>>  1. task-manager-heap-dump.hprof
>>>  2. java_flink_client.hprof.
>>>  3. task-manager-recording.jfr
>>>
>>> Link 
>>> https://drive.google.com/drive/folders/1J9wjTmejBroyAIIp7680AsIEQ05Ewdfl?usp=sharing 
>>> <https://drive.google.com/drive/folders/1J9wjTmejBroyAIIp7680AsIEQ05Ewdfl?usp=sharing>
>>>
>>> Thanks,
>>> Tamir.
>>> ------------------------------------------------------------------------
>>> *From:* Kezhu Wang <ke...@gmail.com> <ma...@gmail.com>
>>> *Sent:* Monday, March 1, 2021 2:21 PM
>>> *To:* user@flink.apache.org <ma...@flink.apache.org> 
>>> <us...@flink.apache.org> <ma...@flink.apache.org>; Tamir Sagi 
>>> <Ta...@niceactimize.com> <ma...@niceactimize.com>
>>> *Subject:* Re: Suspected classloader leak in Flink 1.11.1
>>>
>>> *EXTERNAL EMAIL*
>>>
>>>
>>>
>>> Hi Tamir,
>>>
>>> > The histogram has been taken from Task Manager using jcmd tool.
>>>
>>> >From that histogram, I guest there is no classloader leaking.
>>>
>>> > A simple batch job with single operation . The memory bumps to 
>>> ~600MB (after single execution). once the job is finished the memory 
>>> never freed.
>>>
>>> It could be just new code paths and hence new classes. A single 
>>> execution does not making much sense. Multiple or dozen runs and 
>>> continuous memory increasing among them and not decreasing after 
>>> could be symptom of leaking.
>>>
>>> You could use following steps to verify whether there are issues in 
>>> your task managers:
>>> * Run job N times, the more the better.
>>> * Wait all jobs finished or stopped.
>>> * Trigger manually gc dozen times.
>>> * Take class histogram and check whether there are any 
>>> “ChildFirstClassLoader”.
>>> * If there are roughly N “ChildFirstClassLoader” in histogram, then 
>>> we can pretty sure there might be class loader leaking.
>>> * If there is no “ChildFirstClassLoader” or few but memory still 
>>> higher than a threshold, say ~600MB or more, it could be other shape 
>>> of leaking.
>>>
>>>
>>> In all leaking case, an heap dump as @Chesnay said could be more 
>>> helpful since it can tell us which object/class/thread keep memory 
>>> from freeing.
>>>
>>>
>>> Besides this, I saw an attachment “task-manager-thrad-print.txt” in 
>>> initial mail, when and where did you capture ? Task Manager ? Is 
>>> there any job still running ?
>>>
>>>
>>> Best,
>>> Kezhu Wang
>>>
>>> On March 1, 2021 at 18:38:55, Tamir Sagi 
>>> (tamir.sagi@niceactimize.com <ma...@niceactimize.com>) 
>>> wrote:
>>>
>>>> Hey Kezhu,
>>>>
>>>> The histogram has been taken from Task Manager using jcmd tool.
>>>>
>>>>     By means of batch job, do you means that you compile job graph
>>>>     from DataSet API in client side and then submit it through
>>>>     RestClient ? I am not familiar with data set api, usually,
>>>>     there is no `ChildFirstClassLoader` creation in client side for
>>>>     job graph building. Could you depict a pseudo for this or did
>>>>     you create `ChildFirstClassLoader` yourself ?
>>>>
>>>> Yes, we have a batch app. we read a file from s3 using 
>>>> hadoop-s3-plugin, then map that data into DataSet then just print it.
>>>> Then we have a Flink Client application which saves the batch app jar.
>>>>
>>>> Attached the following files:
>>>>
>>>>  1. batch-source-code.java - main function
>>>>  2. FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
>>>>  3. flink-job-submit.txt - The code to submit the job
>>>>
>>>>
>>>> I've noticed 2 behaviors:
>>>>
>>>>  1. Idle - Once Task manager application boots up the memory
>>>>     consumption gradually grows, starting ~360MB to ~430MB(within
>>>>     few minutes) I see logs where many classes are loaded into JVM
>>>>     and never get released.(Might be a normal behavior)
>>>>  2. Batch Job Execution - A simple batch job with single operation
>>>>     . The memory bumps to ~600MB (after single execution). once the
>>>>     job is finished the memory never freed. I executed GC several
>>>>     times (Manually + Programmatically) it did not help(although
>>>>     some classes were unloaded). the memory keeps growing while
>>>>     more batch jobs are executed.
>>>>
>>>> Attached Task Manager Logs from yesterday after a single batch 
>>>> execution.(Memory grew to 612MB and never freed)
>>>>
>>>>  1. taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is
>>>>     the timestamp when the job was submitted)
>>>>  2. gc-class-historgram.txt
>>>>  3. thread-print.txt
>>>>  4. vm-class-loader-stats.txt
>>>>  5. vm-class-loaders.txt
>>>>  6. heap_info.txt
>>>>
>>>>
>>>> Same behavior has been observed in Flink Client application. Once 
>>>> the batch job is executed the memory is increasedgradually and does 
>>>> not get cleaned afterwards.(We observed many ChildFirstClassLoader 
>>>> instances)
>>>>
>>>>
>>>> Thank you
>>>> Tamir.
>>>>
>>>> ------------------------------------------------------------------------
>>>> *From:* Kezhu Wang <kezhuw@gmail.com <ma...@gmail.com>>
>>>> *Sent:* Sunday, February 28, 2021 6:57 PM
>>>> *To:* Tamir Sagi <Tamir.Sagi@niceactimize.com 
>>>> <ma...@niceactimize.com>>
>>>> *Subject:* Re: Suspected classloader leak in Flink 1.11.1
>>>>
>>>> *EXTERNAL EMAIL*
>>>>
>>>>
>>>>
>>>> HI Tamir,
>>>>
>>>> The histogram has no instance of `ChildFirstClassLoader`.
>>>>
>>>> > we are running Flink on a session cluster (version 1.11.1) on 
>>>> Kubernetes, submitting batch jobs with Flink client on Spring boot 
>>>> application (using RestClusterClient).
>>>>
>>>> > By analyzing the memory of the client Java application with 
>>>> profiling tools, We saw that there are many instances of Flink's 
>>>> ChildFirstClassLoader (perhaps as the number of jobs which were 
>>>> running), and therefore many instances of the same class, each from 
>>>> a different instance of the Class Loader (as shown in the attached 
>>>> screenshot). Similarly, to the Flink task manager memory.
>>>>
>>>> By means of batch job, do you means that you compile job graph from 
>>>> DataSet API in client side and then submit it through RestClient ? 
>>>> I am not familiar with data set api, usually, there is no 
>>>> `ChildFirstClassLoader` creation in client side for job graph 
>>>> building. Could you depict a pseudo for this or did you create 
>>>> `ChildFirstClassLoader` yourself ?
>>>>
>>>>
>>>> > In addition, we have tried calling GC manually, but it did not 
>>>> change much.
>>>>
>>>> It might take serval runs to collect a class loader instance.
>>>>
>>>>
>>>> Best,
>>>> Kezhu Wang
>>>>
>>>>
>>>> On February 28, 2021 at 23:27:38, Tamir Sagi 
>>>> (tamir.sagi@niceactimize.com <ma...@niceactimize.com>) 
>>>> wrote:
>>>>
>>>>> Hey Kezhu,
>>>>> Thanks for fast responding,
>>>>>
>>>>> I've read that link few days ago.; Today I ran a simple batch job 
>>>>> with single operation (using hadoop s3 plugin) but the same 
>>>>> behavior was observed.
>>>>>
>>>>> attached GC.class_histogram (Not filtered)
>>>>>
>>>>>
>>>>> Tamir.
>>>>>
>>>>>
>>>>>
>>>>> ------------------------------------------------------------------------
>>>>> *From:* Kezhu Wang <kezhuw@gmail.com <ma...@gmail.com>>
>>>>> *Sent:* Sunday, February 28, 2021 4:46 PM
>>>>> *To:* user@flink.apache.org <ma...@flink.apache.org> 
>>>>> <user@flink.apache.org <ma...@flink.apache.org>>; Tamir Sagi 
>>>>> <Tamir.Sagi@niceactimize.com <ma...@niceactimize.com>>
>>>>> *Subject:* Re: Suspected classloader leak in Flink 1.11.1
>>>>>
>>>>> *EXTERNAL EMAIL*
>>>>>
>>>>>
>>>>>
>>>>> Hi Tamir,
>>>>>
>>>>> You could check 
>>>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/debugging/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code 
>>>>> <https://ci.apache.org/projects/flink/flink-docs-stable/ops/debugging/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code> for 
>>>>> known class loading issues.
>>>>>
>>>>> Besides this, I think GC.class_histogram(even filtered) could help 
>>>>> us listing suspected objects.
>>>>>
>>>>>
>>>>> Best,
>>>>> Kezhu Wang
>>>>>
>>>>>
>>>>> On February 28, 2021 at 21:25:07, Tamir Sagi 
>>>>> (tamir.sagi@niceactimize.com <ma...@niceactimize.com>) 
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> Hey all,
>>>>>>
>>>>>> We are encountering memory issues on a Flink client and task 
>>>>>> managers, which I would like to raise here.
>>>>>>
>>>>>> we are running Flink on a session cluster (version 1.11.1) on 
>>>>>> Kubernetes, submitting batch jobs with Flink client on Spring 
>>>>>> boot application (using RestClusterClient).
>>>>>>
>>>>>> When jobs are being submitted and running, one after another, We 
>>>>>> see that the metaspace memory(with max size of 1GB)  keeps 
>>>>>> increasing, as well as linear increase in the heap memory (though 
>>>>>> it's a more moderate increase). We do see GC working on the heap 
>>>>>> and releasing some resources.
>>>>>>
>>>>>> By analyzing the memory of the client Java application with 
>>>>>> profiling tools, We saw that there are many instances of Flink's 
>>>>>> ChildFirstClassLoader (perhaps as the number of jobs which were 
>>>>>> running), and therefore many instances of the same class, each 
>>>>>> from a different instance of the Class Loader (as shown in the 
>>>>>> attached screenshot). Similarly, to the Flink task manager memory.
>>>>>>
>>>>>> We would expect to see one instance of Class Loader. Therefore, 
>>>>>> We suspect that the reason for the increase is Class Loaders not 
>>>>>> being cleaned.
>>>>>>
>>>>>> Does anyone have some insights about this issue, or ideas how to 
>>>>>> proceed the investigation?
>>>>>>
>>>>>>
>>>>>> *Flink Client application (VisualVm)*
>>>>>>
>>>>>>
>>>>>>
>>>>>> Shallow Size com.fasterxmI.jackson.databind.PropertyMetadata 
>>>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (41) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (79) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (82) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (23) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (36) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (34) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (84) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (92) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (59) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (70) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (3) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (60) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (8) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (17) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (31) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (12) 
>>>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (49) Objects 0% 0% 0% 
>>>>>> 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Retained Size 120 120 120 
>>>>>> 120 120 120 120 120 120 120 120 120 120 120 120 120 120 0% 0% 0% 
>>>>>> 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% z 120 z 120 z 120 z 120 z 
>>>>>> 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 
>>>>>> 120 z 120 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
>>>>>>
>>>>>> We have used different GCs but same results.
>>>>>>
>>>>>>
>>>>>> _*Task Manager*_
>>>>>>
>>>>>>
>>>>>> Total Size 4GB
>>>>>>
>>>>>> metaspace 1GB
>>>>>>
>>>>>> Off heap 512mb
>>>>>>
>>>>>>
>>>>>> Screenshot form Task manager, 612MB are occupied and not being 
>>>>>> released.
>>>>>>
>>>>>>
>>>>>> We used jcmd tool and attached 3 files
>>>>>>
>>>>>>  1. Threads print
>>>>>>  2. VM.metaspace output
>>>>>>  3. VM.classloader
>>>>>>
>>>>>> In addition, we have tried calling GC manually, but it did not 
>>>>>> change much.
>>>>>>
>>>>>> Thank you
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Confidentiality: This communication and any attachments are 
>>>>>> intended for the above-named persons only and may be confidential 
>>>>>> and/or legally privileged. Any opinions expressed in this 
>>>>>> communication are not necessarily those of NICE Actimize. If this 
>>>>>> communication has come to you in error you must take no action 
>>>>>> based on it, nor must you copy or show it to anyone; please 
>>>>>> delete/destroy and inform the sender by e-mail immediately.
>>>>>> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
>>>>>> Viruses: Although we have taken steps toward ensuring that this 
>>>>>> e-mail and attachments are free from any virus, we advise that in 
>>>>>> keeping with good computing practice the recipient should ensure 
>>>>>> they are actually virus free.
>>>>>>
>>>>>
>>>>> Confidentiality: This communication and any attachments are 
>>>>> intended for the above-named persons only and may be confidential 
>>>>> and/or legally privileged. Any opinions expressed in this 
>>>>> communication are not necessarily those of NICE Actimize. If this 
>>>>> communication has come to you in error you must take no action 
>>>>> based on it, nor must you copy or show it to anyone; please 
>>>>> delete/destroy and inform the sender by e-mail immediately.
>>>>> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
>>>>> Viruses: Although we have taken steps toward ensuring that this 
>>>>> e-mail and attachments are free from any virus, we advise that in 
>>>>> keeping with good computing practice the recipient should ensure 
>>>>> they are actually virus free.
>>>>>
>>>>
>>>> Confidentiality: This communication and any attachments are 
>>>> intended for the above-named persons only and may be confidential 
>>>> and/or legally privileged. Any opinions expressed in this 
>>>> communication are not necessarily those of NICE Actimize. If this 
>>>> communication has come to you in error you must take no action 
>>>> based on it, nor must you copy or show it to anyone; please 
>>>> delete/destroy and inform the sender by e-mail immediately.
>>>> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
>>>> Viruses: Although we have taken steps toward ensuring that this 
>>>> e-mail and attachments are free from any virus, we advise that in 
>>>> keeping with good computing practice the recipient should ensure 
>>>> they are actually virus free.
>>>>
>>>
>>> Confidentiality: This communication and any attachments are intended 
>>> for the above-named persons only and may be confidential and/or 
>>> legally privileged. Any opinions expressed in this communication are 
>>> not necessarily those of NICE Actimize. If this communication has 
>>> come to you in error you must take no action based on it, nor must 
>>> you copy or show it to anyone; please delete/destroy and inform the 
>>> sender by e-mail immediately.
>>> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
>>> Viruses: Although we have taken steps toward ensuring that this 
>>> e-mail and attachments are free from any virus, we advise that in 
>>> keeping with good computing practice the recipient should ensure 
>>> they are actually virus free.
>>>
>>
>
> Confidentiality: This communication and any attachments are intended 
> for the above-named persons only and may be confidential and/or 
> legally privileged. Any opinions expressed in this communication are 
> not necessarily those of NICE Actimize. If this communication has come 
> to you in error you must take no action based on it, nor must you copy 
> or show it to anyone; please delete/destroy and inform the sender by 
> e-mail immediately.
> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
> Viruses: Although we have taken steps toward ensuring that this e-mail 
> and attachments are free from any virus, we advise that in keeping 
> with good computing practice the recipient should ensure they are 
> actually virus free.
>


Re: Suspected classloader leak in Flink 1.11.1

Posted by Tamir Sagi <Ta...@niceactimize.com>.
Thank you Kezhu and Chesnay,

The code I provided you is a minimal code to show what is executed as part of the batch along the Flink client app. you were right that it's not consistent with the heap dump. (which has been taken in dev env)

We run multiple Integration tests(Job per test) against Flink session cluster(Running on Kubernetes). with 2 task manager, single job manager. The jobs are submitted via Flink Client app which runs on top of spring boot application along Kafka.

I suspected that IdleConnectionReaper is the root cause to some sort of leak(In the flink client app) however, I was trying to manually shutdown the IdleConnectionReaper Once the job finished.
via calling "com.amazonaws.http.IdleConnectionReaper.shutdown()".  - which is suggested as a workaround.
Ref: https://forums.aws.amazon.com/thread.jspa?messageID=500552#500552

It did not affect much. the memory has not been released .(shutdown method always returns false, the instance is null) ref: https://github.com/aws/aws-sdk-java/blob/master/aws-java-sdk-core/src/main/java/com/amazonaws/http/IdleConnectionReaper.java#L148-L157

On the batch code , I added close method which close the connections to aws clients once the operation finished. it did not help either, as the memory keep growing gradually.

We came across the following setting
https://ci.apache.org/projects/flink/flink-docs-stable/dev/batch/#object-reuse-enabled

any more ideas based on the heap dump/Flight recording(Task manager)?
Is it correct that the Flink client & Task manager are strongly coupled?

Thanks,
Tamir.

________________________________
From: Kezhu Wang <ke...@gmail.com>
Sent: Monday, March 1, 2021 7:54 PM
To: Tamir Sagi <Ta...@niceactimize.com>; user@flink.apache.org <us...@flink.apache.org>; Chesnay Schepler <ch...@apache.org>
Subject: Re: Suspected classloader leak in Flink 1.11.1


EXTERNAL EMAIL


Hi Chesnay,

Thanks for give a hand and solve this.

I guess `FlatMapXSightMsgProcessor` is a minimal reproducible version while the heap dump could be taken from near production environment.


Best,
Kezhu Wang


On March 2, 2021 at 01:00:52, Chesnay Schepler (chesnay@apache.org<ma...@apache.org>) wrote:

the java-sdk-connection-reaper thread and amazon's JMX integration are causing the leak.


What strikes me as odd is that I see some dynamodb classes being referenced in the child classloaders, but I don't see where they could come from based on the application that you provided us with.


Could you clarify how exactly you depend on Amazon dependencies? (connectors, filesystems, _other stuff_)

On 3/1/2021 5:24 PM, Tamir Sagi wrote:
Hey,

I'd expect that what happens in a single execution will repeat itself in N executions.

I ran entire cycle of jobs(28 jobs).
Once it finished:

  *   Memory has grown to 1GB
  *   I called GC ~100 times using "jcmd 1 GC.run" command.  Which did not affect much.

Prior running the tests I started Flight recording using "jcmd 1 JFR.start ", I stopped it after calling GC ~100 times.
Following figure shows the graphs from "recording.jfr" in Virtual Vm.


[cid:CD9A465F-6FEC-481A-8BEE-054A87130411]
and Metaspace(top right)
[cid:F6F96172-C98E-4E38-B1A2-271B19190AA6]

docker stats command filtered to relevant Task manager container
[cid:539E5C8C-E335-4A9F-AF20-DF78AB826A33]

Following files of task-manager are attached :

  *   task-manager-VM.metaspace - taken via "jcmd 1 VM.metaspace"
  *   task-manager-gc-class-histogram.txt via "jcmd 1 GC.class_histogram"

Task manager heap dump is ~100MB,
here is a summary:
[cid:095CBC83-219D-422F-A076-598F81A63898]


Flink client app metric(taken from Lens):

[cid:A69F3039-2A4C-4B09-85E4-7BAB872D7944]


We see a tight coupling between Task Manager app and Flink Client app, as the batch job runs on the client side(via reflection)
what happens with class loaders in that case?

we also noticed many logs in Task manager related to PoolingHttpClientConnectionManager
[cid:FCE39609-8F7C-4B65-8140-9F77EEE00200]

and IdleConnectionReaper  InterruptedException
[cid:5660A849-29F8-4882-9089-C206BB989C67]


On Client app we noticed many instances of that thread (From heap dump)

[cid:3754349F-C676-453C-912F-5E1FBA48C0C9]

We uploaded 2 heap dumps and task-manager flight recording file into Google drive

  1.  task-manager-heap-dump.hprof
  2.  java_flink_client.hprof.
  3.  task-manager-recording.jfr


Link https://drive.google.com/drive/folders/1J9wjTmejBroyAIIp7680AsIEQ05Ewdfl?usp=sharing

Thanks,
Tamir.
________________________________
From: Kezhu Wang <ke...@gmail.com>
Sent: Monday, March 1, 2021 2:21 PM
To: user@flink.apache.org<ma...@flink.apache.org> <us...@flink.apache.org>; Tamir Sagi <Ta...@niceactimize.com>
Subject: Re: Suspected classloader leak in Flink 1.11.1


EXTERNAL EMAIL


Hi Tamir,

> The histogram has been taken from Task Manager using jcmd tool.

From that histogram, I guest there is no classloader leaking.

> A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed.

It could be just new code paths and hence new classes. A single execution does not making much sense. Multiple or dozen runs and continuous memory increasing among them and not decreasing after could be symptom of leaking.

You could use following steps to verify whether there are issues in your task managers:
* Run job N times, the more the better.
* Wait all jobs finished or stopped.
* Trigger manually gc dozen times.
* Take class histogram and check whether there are any “ChildFirstClassLoader”.
* If there are roughly N “ChildFirstClassLoader” in histogram, then we can pretty sure there might be class loader leaking.
* If there is no “ChildFirstClassLoader” or few but memory still higher than a threshold, say ~600MB or more, it could be other shape of leaking.


In all leaking case, an heap dump as @Chesnay said could be more helpful since it can tell us which object/class/thread keep memory from freeing.


Besides this, I saw an attachment “task-manager-thrad-print.txt” in initial mail, when and where did you capture ? Task Manager ? Is there any job still running ?


Best,
Kezhu Wang


On March 1, 2021 at 18:38:55, Tamir Sagi (tamir.sagi@niceactimize.com<ma...@niceactimize.com>) wrote:

Hey Kezhu,

The histogram has been taken from Task Manager using jcmd tool.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?
Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin, then map that data into DataSet then just print it.
Then we have a Flink Client application which saves the batch app jar.

Attached the following files:

  1.  batch-source-code.java - main function
  2.  FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
  3.  flink-job-submit.txt - The code to submit the job

I've noticed 2 behaviors:

  1.  Idle - Once Task manager application boots up the memory consumption gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs where many classes are loaded into JVM and never get released.(Might be a normal behavior)
  2.  Batch Job Execution - A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed. I executed GC several times (Manually + Programmatically) it did not help(although some classes were unloaded). the memory keeps growing while more batch jobs are executed.

Attached Task Manager Logs from yesterday after a single batch execution.(Memory grew to 612MB and never freed)

  1.  taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp when the job was submitted)
  2.  gc-class-historgram.txt
  3.  thread-print.txt
  4.  vm-class-loader-stats.txt
  5.  vm-class-loaders.txt
  6.  heap_info.txt

Same behavior has been observed in Flink Client application. Once the batch job is executed the memory is increased gradually and does not get cleaned afterwards.(We observed many ChildFirstClassLoader instances)


Thank you
Tamir.

________________________________
From: Kezhu Wang <ke...@gmail.com>>
Sent: Sunday, February 28, 2021 6:57 PM
To: Tamir Sagi <Ta...@niceactimize.com>>
Subject: Re: Suspected classloader leak in Flink 1.11.1


EXTERNAL EMAIL


HI Tamir,

The histogram has no instance of `ChildFirstClassLoader`.

> we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

> By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?


> In addition, we have tried calling GC manually, but it did not change much.

It might take serval runs to collect a class loader instance.


Best,
Kezhu Wang



On February 28, 2021 at 23:27:38, Tamir Sagi (tamir.sagi@niceactimize.com<ma...@niceactimize.com>) wrote:

Hey Kezhu,
Thanks for fast responding,

I've read that link few days ago.; Today I ran a simple batch job with single operation (using hadoop s3 plugin) but the same behavior was observed.

attached GC.class_histogram (Not filtered)


Tamir.



________________________________
From: Kezhu Wang <ke...@gmail.com>>
Sent: Sunday, February 28, 2021 4:46 PM
To: user@flink.apache.org<ma...@flink.apache.org> <us...@flink.apache.org>>; Tamir Sagi <Ta...@niceactimize.com>>
Subject: Re: Suspected classloader leak in Flink 1.11.1


EXTERNAL EMAIL


Hi Tamir,

You could check https://ci.apache.org/projects/flink/flink-docs-stable/ops/debugging/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code for known class loading issues.

Besides this, I think GC.class_histogram(even filtered) could help us listing suspected objects.


Best,
Kezhu Wang



On February 28, 2021 at 21:25:07, Tamir Sagi (tamir.sagi@niceactimize.com<ma...@niceactimize.com>) wrote:

Hey all,

We are encountering memory issues on a Flink client and task managers, which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of  1GB)  keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed the investigation?


Flink Client application (VisualVm)


[https://attachments.office.net/owa/Tamir.Sagi%40niceactimize.com/service.svc/s/GetAttachmentThumbnail?id=AAMkAGQxZDc3N2VmLTc2YjktNDBkYy1hYTlkLTlkZGJjOTVhNzgzYwBGAAAAAADTC80E8Z8%2BSZUI2Am6rB91BwBZ4%2BSIfRzgRZqm%2FnQvQJIRAAAAAAEPAABZ4%2BSIfRzgRZqm%2FnQvQJIRAAFJ%2FcQPAAABEgAQAHR5WsedsapHhxIGbUzU5Jw%3D&thumbnailType=2&token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjMwODE3OUNFNUY0QjUyRTc4QjJEQjg5NjZCQUY0RUNDMzcyN0FFRUUiLCJ0eXAiOiJKV1QiLCJ4NXQiOiJNSUY1emw5TFV1ZUxMYmlXYTY5T3pEY25ydTQifQ.eyJvcmlnaW4iOiJodHRwczovL291dGxvb2sub2ZmaWNlLmNvbSIsInVjIjoiOTRmMDM0MDU3Y2UwNDliMjlkZDhmY2U2NzEwNjkyYjkiLCJzaWduaW5fc3RhdGUiOiJbXCJrbXNpXCJdIiwidmVyIjoiRXhjaGFuZ2UuQ2FsbGJhY2suVjEiLCJhcHBjdHhzZW5kZXIiOiJPd2FEb3dubG9hZEA3MTIzZGFiZC0wZTg3LTRkYTktOWNiOS1iN2VjODIwMTFhYWQiLCJpc3NyaW5nIjoiV1ciLCJhcHBjdHgiOiJ7XCJtc2V4Y2hwcm90XCI6XCJvd2FcIixcInB1aWRcIjpcIjExNTM4MDExMTU5MTU4NzEyNzdcIixcInNjb3BlXCI6XCJPd2FEb3dubG9hZFwiLFwib2lkXCI6XCIzM2MwMWQyMS0yZGU1LTQyMDEtOWM3ZS1lYTY0OTYwZDFjMWVcIixcInByaW1hcnlzaWRcIjpcIlMtMS01LTIxLTE4MDA0OTE2MS0zOTk2NjA1NjY3LTQwMzc0MjAyNDEtMTgxNjQ3MDVcIn0iLCJuYmYiOjE2MTQ1OTE0NDEsImV4cCI6MTYxNDU5MjA0MSwiaXNzIjoiMDAwMDAwMDItMDAwMC0wZmYxLWNlMDAtMDAwMDAwMDAwMDAwQDcxMjNkYWJkLTBlODctNGRhOS05Y2I5LWI3ZWM4MjAxMWFhZCIsImF1ZCI6IjAwMDAwMDAyLTAwMDAtMGZmMS1jZTAwLTAwMDAwMDAwMDAwMC9hdHRhY2htZW50cy5vZmZpY2UubmV0QDcxMjNkYWJkLTBlODctNGRhOS05Y2I5LWI3ZWM4MjAxMWFhZCIsImhhcHAiOiJvd2EifQ.GQbcfBdbjV3c7DjQE0TQrOpgPGjhRUDYwaVh2w3dLizEyEJQ90fGfmAr88IQKzHOCy0GUCGVtXhaZ7hTecWOSxTUDR4ODz6tZvG7kVIa3csMyqZtoK2KmYnoH2dqOvqbj8PpaRiYFcOkPMVWjYo3DF5eMuBzuSFYekUknasVjN00BbZInQRauav943LMTmBKwlZlqyvvh-_0-rk2AtBORXRlO67BU7Xttf2QhfBNAuRsUVdvSPsS1JyJDnA8UfwgVqfsLVU9MfnDzcejrGJLinSvnrtoM0PZ5NHxseiw2MuaovsB2y9j_d8DMqAX2QpEZ5yiXJZNhWWtcOAWdJb-QA&X-OWA-CANARY=gU6uPQQbCEmRxIxXfgM1oTBgcjGW3NgYNc-bgvlLlsOs6jxYddBDjlm0e_6DwfGuAB22O0FwK-g.&owa=outlook.office.com&scriptVer=20210222004.08&animation=true]






[Shallow Size                                                     com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata                                                     com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata                                                     com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata                                                     com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata                                                     com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata                                                     com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata                                                     com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata                                                     com.fasterxmI.jackson.databind.PropertyMetadata com.fasterxmIjackson.databind.PropertyMetadata                                                     com.fasterxmI.jackson.databind.PropertyMetadata org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (79)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (82)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (23)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (36)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (34)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (84)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (92)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (59)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (70)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (3)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (60)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (8)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (17)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (31)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (12)                                                     org.apache.fIink.utiI.ChiIdFirstCIassLoader                                                     (49) Objects 0% 0%                                                     0% 0% 0% 0% 0% 0% 0%                                                     0% 0% 0% 0% 0% 0% 0%                                                     Retained Size 120                                                     120 120 120 120 120]

We have used different GCs but same results.


Task Manager


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released.

[https://attachments.office.net/owa/Tamir.Sagi%40niceactimize.com/service.svc/s/GetAttachmentThumbnail?id=AAMkAGQxZDc3N2VmLTc2YjktNDBkYy1hYTlkLTlkZGJjOTVhNzgzYwBGAAAAAADTC80E8Z8%2BSZUI2Am6rB91BwBZ4%2BSIfRzgRZqm%2FnQvQJIRAAAAAAEPAABZ4%2BSIfRzgRZqm%2FnQvQJIRAAFJ%2FcQPAAABEgAQAHBNlLLkwTdGr6Ast2cs7Xo%3D&thumbnailType=2&token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjMwODE3OUNFNUY0QjUyRTc4QjJEQjg5NjZCQUY0RUNDMzcyN0FFRUUiLCJ0eXAiOiJKV1QiLCJ4NXQiOiJNSUY1emw5TFV1ZUxMYmlXYTY5T3pEY25ydTQifQ.eyJvcmlnaW4iOiJodHRwczovL291dGxvb2sub2ZmaWNlLmNvbSIsInVjIjoiOTRmMDM0MDU3Y2UwNDliMjlkZDhmY2U2NzEwNjkyYjkiLCJzaWduaW5fc3RhdGUiOiJbXCJrbXNpXCJdIiwidmVyIjoiRXhjaGFuZ2UuQ2FsbGJhY2suVjEiLCJhcHBjdHhzZW5kZXIiOiJPd2FEb3dubG9hZEA3MTIzZGFiZC0wZTg3LTRkYTktOWNiOS1iN2VjODIwMTFhYWQiLCJpc3NyaW5nIjoiV1ciLCJhcHBjdHgiOiJ7XCJtc2V4Y2hwcm90XCI6XCJvd2FcIixcInB1aWRcIjpcIjExNTM4MDExMTU5MTU4NzEyNzdcIixcInNjb3BlXCI6XCJPd2FEb3dubG9hZFwiLFwib2lkXCI6XCIzM2MwMWQyMS0yZGU1LTQyMDEtOWM3ZS1lYTY0OTYwZDFjMWVcIixcInByaW1hcnlzaWRcIjpcIlMtMS01LTIxLTE4MDA0OTE2MS0zOTk2NjA1NjY3LTQwMzc0MjAyNDEtMTgxNjQ3MDVcIn0iLCJuYmYiOjE2MTQ1OTE0NDEsImV4cCI6MTYxNDU5MjA0MSwiaXNzIjoiMDAwMDAwMDItMDAwMC0wZmYxLWNlMDAtMDAwMDAwMDAwMDAwQDcxMjNkYWJkLTBlODctNGRhOS05Y2I5LWI3ZWM4MjAxMWFhZCIsImF1ZCI6IjAwMDAwMDAyLTAwMDAtMGZmMS1jZTAwLTAwMDAwMDAwMDAwMC9hdHRhY2htZW50cy5vZmZpY2UubmV0QDcxMjNkYWJkLTBlODctNGRhOS05Y2I5LWI3ZWM4MjAxMWFhZCIsImhhcHAiOiJvd2EifQ.GQbcfBdbjV3c7DjQE0TQrOpgPGjhRUDYwaVh2w3dLizEyEJQ90fGfmAr88IQKzHOCy0GUCGVtXhaZ7hTecWOSxTUDR4ODz6tZvG7kVIa3csMyqZtoK2KmYnoH2dqOvqbj8PpaRiYFcOkPMVWjYo3DF5eMuBzuSFYekUknasVjN00BbZInQRauav943LMTmBKwlZlqyvvh-_0-rk2AtBORXRlO67BU7Xttf2QhfBNAuRsUVdvSPsS1JyJDnA8UfwgVqfsLVU9MfnDzcejrGJLinSvnrtoM0PZ5NHxseiw2MuaovsB2y9j_d8DMqAX2QpEZ5yiXJZNhWWtcOAWdJb-QA&X-OWA-CANARY=gU6uPQQbCEmRxIxXfgM1oTBgcjGW3NgYNc-bgvlLlsOs6jxYddBDjlm0e_6DwfGuAB22O0FwK-g.&owa=outlook.office.com&scriptVer=20210222004.08&animation=true]


We used jcmd tool and attached 3 files

  1.  Threads print
  2.  VM.metaspace output
  3.  VM.classloader

In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.


[https://my-email-signature.link/signature.gif?u=1088647&e=138883581&v=5640360e78176d8d38f33151bf47291aca4e29b507f674158f585c371242e89b]

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Re: Suspected classloader leak in Flink 1.11.1

Posted by Kezhu Wang <ke...@gmail.com>.
Hi Chesnay,

Thanks for give a hand and solve this.

I guess `FlatMapXSightMsgProcessor` is a minimal reproducible version while
the heap dump could be taken from near production environment.


Best,
Kezhu Wang

On March 2, 2021 at 01:00:52, Chesnay Schepler (chesnay@apache.org) wrote:

the java-sdk-connection-reaper thread and amazon's JMX integration are
causing the leak.


What strikes me as odd is that I see some dynamodb classes being referenced
in the child classloaders, but I don't see where they could come from based
on the application that you provided us with.


Could you clarify how exactly you depend on Amazon dependencies?
(connectors, filesystems, _*other stuff*_)

On 3/1/2021 5:24 PM, Tamir Sagi wrote:

Hey,

I'd expect that what happens in a single execution will repeat itself in N
executions.

I ran entire cycle of jobs(28 jobs).
Once it finished:

   - Memory has grown to 1GB
   - I called GC ~100 times using "jcmd 1 GC.run" command.  Which did not
   affect much.

Prior running the tests I started Flight recording using "jcmd 1 JFR.start
", I stopped it after calling GC ~100 times.
Following figure shows the graphs from "recording.jfr" in Virtual Vm.



and Metaspace(top right)


docker stats command filtered to relevant Task manager container


Following files of task-manager are attached :

   - task-manager-VM.metaspace - taken via "jcmd 1 VM.metaspace"
   - task-manager-gc-class-histogram.txt via "jcmd 1 GC.class_histogram"


Task manager heap dump is ~100MB,
here is a summary:



*Flink client app metric(taken from Lens):*




We see a tight coupling between Task Manager app and Flink Client app, as
the batch job runs on the client side(via reflection)
what happens with class loaders in that case?

we also noticed many logs in Task manager related to
PoolingHttpClientConnectionManager


and IdleConnectionReaper  InterruptedException



On *Client app* we noticed many instances of that thread (From heap dump)



We uploaded 2 heap dumps and task-manager flight recording file into Google
drive

   1. task-manager-heap-dump.hprof
   2. java_flink_client.hprof.
   3. task-manager-recording.jfr


Link
https://drive.google.com/drive/folders/1J9wjTmejBroyAIIp7680AsIEQ05Ewdfl?usp=sharing

Thanks,
Tamir.
------------------------------
*From:* Kezhu Wang <ke...@gmail.com> <ke...@gmail.com>
*Sent:* Monday, March 1, 2021 2:21 PM
*To:* user@flink.apache.org <us...@flink.apache.org> <us...@flink.apache.org>;
Tamir Sagi <Ta...@niceactimize.com> <Ta...@niceactimize.com>
*Subject:* Re: Suspected classloader leak in Flink 1.11.1


*EXTERNAL EMAIL*


Hi Tamir,

> The histogram has been taken from Task Manager using jcmd tool.

From that histogram, I guest there is no classloader leaking.

> A simple batch job with single operation . The memory bumps to ~600MB
(after single execution). once the job is finished the memory never freed.

It could be just new code paths and hence new classes. A single execution
does not making much sense. Multiple or dozen runs and continuous memory
increasing among them and not decreasing after could be symptom of leaking.

You could use following steps to verify whether there are issues in your
task managers:
* Run job N times, the more the better.
* Wait all jobs finished or stopped.
* Trigger manually gc dozen times.
* Take class histogram and check whether there are any
“ChildFirstClassLoader”.
* If there are roughly N “ChildFirstClassLoader” in histogram, then we can
pretty sure there might be class loader leaking.
* If there is no “ChildFirstClassLoader” or few but memory still higher
than a threshold, say ~600MB or more, it could be other shape of leaking.


In all leaking case, an heap dump as @Chesnay said could be more helpful
since it can tell us which object/class/thread keep memory from freeing.


Besides this, I saw an attachment “task-manager-thrad-print.txt” in initial
mail, when and where did you capture ? Task Manager ? Is there any job
still running ?


Best,
Kezhu Wang

On March 1, 2021 at 18:38:55, Tamir Sagi (tamir.sagi@niceactimize.com)
wrote:

Hey Kezhu,

The histogram has been taken from Task Manager using jcmd tool.

By means of batch job, do you means that you compile job graph from DataSet
API in client side and then submit it through RestClient ? I am not
familiar with data set api, usually, there is no `ChildFirstClassLoader`
creation in client side for job graph building. Could you depict a pseudo
for this or did you create `ChildFirstClassLoader` yourself ?

Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin,
then map that data into DataSet then just print it.
Then we have a Flink Client application which saves the batch app jar.

Attached the following files:

   1. batch-source-code.java - main function
   2. FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
   3. flink-job-submit.txt - The code to submit the job


I've noticed 2 behaviors:

   1. Idle - Once Task manager application boots up the memory consumption
   gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs
   where many classes are loaded into JVM and never get released.(Might be a
   normal behavior)
   2. Batch Job Execution - A simple batch job with single operation . The
   memory bumps to ~600MB (after single execution). once the job is finished
   the memory never freed. I executed GC several times (Manually +
   Programmatically) it did not help(although some classes were unloaded). the
   memory keeps growing while more batch jobs are executed.

Attached Task Manager Logs from yesterday after a single batch
execution.(Memory grew to 612MB and never freed)

   1. taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp
   when the job was submitted)
   2. gc-class-historgram.txt
   3. thread-print.txt
   4. vm-class-loader-stats.txt
   5. vm-class-loaders.txt
   6. heap_info.txt


Same behavior has been observed in Flink Client application. Once the batch
job is executed the memory is increased gradually and does not get cleaned
afterwards.(We observed many ChildFirstClassLoader instances)


Thank you
Tamir.

------------------------------
*From:* Kezhu Wang <ke...@gmail.com>
*Sent:* Sunday, February 28, 2021 6:57 PM
*To:* Tamir Sagi <Ta...@niceactimize.com>
*Subject:* Re: Suspected classloader leak in Flink 1.11.1


*EXTERNAL EMAIL*


HI Tamir,

The histogram has no instance of `ChildFirstClassLoader`.

> we are running Flink on a session cluster (version 1.11.1) on Kubernetes,
submitting batch jobs with Flink client on Spring boot application (using
RestClusterClient).

> By analyzing the memory of the client Java application with profiling
tools, We saw that there are many instances of Flink's
ChildFirstClassLoader (perhaps as the number of jobs which were running),
and therefore many instances of the same class, each from a different
instance of the Class Loader (as shown in the attached screenshot).
Similarly, to the Flink task manager memory.

By means of batch job, do you means that you compile job graph from DataSet
API in client side and then submit it through RestClient ? I am not
familiar with data set api, usually, there is no `ChildFirstClassLoader`
creation in client side for job graph building. Could you depict a pseudo
for this or did you create `ChildFirstClassLoader` yourself ?


> In addition, we have tried calling GC manually, but it did not change
much.

It might take serval runs to collect a class loader instance.


Best,
Kezhu Wang


On February 28, 2021 at 23:27:38, Tamir Sagi (tamir.sagi@niceactimize.com)
wrote:

Hey Kezhu,
Thanks for fast responding,

I've read that link few days ago.; Today I ran a simple batch job with
single operation (using hadoop s3 plugin) but the same behavior was
observed.

attached GC.class_histogram (Not filtered)


Tamir.



------------------------------
*From:* Kezhu Wang <ke...@gmail.com>
*Sent:* Sunday, February 28, 2021 4:46 PM
*To:* user@flink.apache.org <us...@flink.apache.org>; Tamir Sagi <
Tamir.Sagi@niceactimize.com>
*Subject:* Re: Suspected classloader leak in Flink 1.11.1


*EXTERNAL EMAIL*


Hi Tamir,

You could check
https://ci.apache.org/projects/flink/flink-docs-stable/ops/debugging/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code
for
known class loading issues.

Besides this, I think GC.class_histogram(even filtered) could help us
listing suspected objects.


Best,
Kezhu Wang


On February 28, 2021 at 21:25:07, Tamir Sagi (tamir.sagi@niceactimize.com)
wrote:


Hey all,

We are encountering memory issues on a Flink client and task managers,
which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes,
submitting batch jobs with Flink client on Spring boot application (using
RestClusterClient).

When jobs are being submitted and running, one after another, We see that
the metaspace memory(with max size of  1GB)  keeps increasing, as well as
linear increase in the heap memory (though it's a more moderate increase).
We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling
tools, We saw that there are many instances of Flink's
ChildFirstClassLoader (perhaps as the number of jobs which were running),
and therefore many instances of the same class, each from a different
instance of the Class Loader (as shown in the attached screenshot).
Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect
that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed
the investigation?


*Flink Client application (VisualVm)*







[image: Shallow Size com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (79)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (82)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (23)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (36)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (34)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (84)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (92)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (59)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (70)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (3)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (60)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (8)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (17)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (31)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (12)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (49) Objects 0% 0% 0% 0% 0% 0%
0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Retained Size 120 120 120 120 120 120 120 120
120 120 120 120 120 120 120 120 120 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
0% 0% 0% z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120
z 120 z 120 z 120 z 120 z 120 z 120 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
0% 0% 0%]

We have used different GCs but same results.


*Task Manager*


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released.

We used jcmd tool and attached 3 files


   1. Threads print
   2. VM.metaspace output
   3. VM.classloader

In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.

Re: Suspected classloader leak in Flink 1.11.1

Posted by Chesnay Schepler <ch...@apache.org>.
the java-sdk-connection-reaper thread and amazon's JMX integration are 
causing the leak.


What strikes me as odd is that I see some dynamodb classes being 
referenced in the child classloaders, but I don't see where they could 
come from based on the application that you provided us with.


Could you clarify how exactly you depend on Amazon dependencies? 
(connectors, filesystems, _/other stuff/_)


On 3/1/2021 5:24 PM, Tamir Sagi wrote:
> Hey,
>
> I'd expect that what happens in a single execution will repeat itself 
> in N executions.
>
> I ran entire cycle of jobs(28 jobs).
> Once it finished:
>
>   * Memory has grown to 1GB
>   * I called GC ~100 times using "jcmd 1 GC.run" command.  Which did
>     not affect much.
>
> Prior running the tests I started Flight recording using "jcmd 1 
> JFR.start ", I stopped it after calling GC ~100 times.
> Following figure shows the graphs from "recording.jfr" in Virtual Vm.
>
>
>
> and Metaspace(top right)
>
>
> docker stats command filtered to relevant Task manager container
>
>
> Following files of task-manager are attached :
>
>   * task-manager-VM.metaspace - taken via "jcmd 1 VM.metaspace"
>   * task-manager-gc-class-histogram.txt via "jcmd 1 GC.class_histogram"
>
>
> Task manager heap dump is ~100MB,
> here is a summary:
>
>
>
> *Flink client app metric(taken from Lens):*
>
>
>
>
> We see a tight coupling between Task Manager app and Flink Client app, 
> as the batch job runs on the client side(via reflection)
> what happens with class loaders in that case?
>
> we also noticed many logs in Task manager related to 
> PoolingHttpClientConnectionManager
>
>
> and IdleConnectionReaper  InterruptedException
>
>
>
> On *Client app* we noticed many instances of that thread (From heap dump)
>
>
>
> We uploaded 2 heap dumps and task-manager flight recording file into 
> Google drive
>
>  1. task-manager-heap-dump.hprof
>  2. java_flink_client.hprof.
>  3. task-manager-recording.jfr
>
> Link 
> https://drive.google.com/drive/folders/1J9wjTmejBroyAIIp7680AsIEQ05Ewdfl?usp=sharing 
> <https://drive.google.com/drive/folders/1J9wjTmejBroyAIIp7680AsIEQ05Ewdfl?usp=sharing>
>
> Thanks,
> Tamir.
> ------------------------------------------------------------------------
> *From:* Kezhu Wang <ke...@gmail.com>
> *Sent:* Monday, March 1, 2021 2:21 PM
> *To:* user@flink.apache.org <us...@flink.apache.org>; Tamir Sagi 
> <Ta...@niceactimize.com>
> *Subject:* Re: Suspected classloader leak in Flink 1.11.1
>
> *EXTERNAL EMAIL*
>
>
>
> Hi Tamir,
>
> > The histogram has been taken from Task Manager using jcmd tool.
>
> From that histogram, I guest there is no classloader leaking.
>
> > A simple batch job with single operation . The memory bumps to 
> ~600MB (after single execution). once the job is finished the memory 
> never freed.
>
> It could be just new code paths and hence new classes. A single 
> execution does not making much sense. Multiple or dozen runs and 
> continuous memory increasing among them and not decreasing after could 
> be symptom of leaking.
>
> You could use following steps to verify whether there are issues in 
> your task managers:
> * Run job N times, the more the better.
> * Wait all jobs finished or stopped.
> * Trigger manually gc dozen times.
> * Take class histogram and check whether there are any 
> “ChildFirstClassLoader”.
> * If there are roughly N “ChildFirstClassLoader” in histogram, then we 
> can pretty sure there might be class loader leaking.
> * If there is no “ChildFirstClassLoader” or few but memory still 
> higher than a threshold, say ~600MB or more, it could be other shape 
> of leaking.
>
>
> In all leaking case, an heap dump as @Chesnay said could be more 
> helpful since it can tell us which object/class/thread keep memory 
> from freeing.
>
>
> Besides this, I saw an attachment “task-manager-thrad-print.txt” in 
> initial mail, when and where did you capture ? Task Manager ? Is there 
> any job still running ?
>
>
> Best,
> Kezhu Wang
>
> On March 1, 2021 at 18:38:55, Tamir Sagi (tamir.sagi@niceactimize.com 
> <ma...@niceactimize.com>) wrote:
>
>> Hey Kezhu,
>>
>> The histogram has been taken from Task Manager using jcmd tool.
>>
>>     By means of batch job, do you means that you compile job graph
>>     from DataSet API in client side and then submit it through
>>     RestClient ? I am not familiar with data set api, usually, there
>>     is no `ChildFirstClassLoader` creation in client side for job
>>     graph building. Could you depict a pseudo for this or did you
>>     create `ChildFirstClassLoader` yourself ?
>>
>> Yes, we have a batch app. we read a file from s3 using 
>> hadoop-s3-plugin, then map that data into DataSet then just print it.
>> Then we have a Flink Client application which saves the batch app jar.
>>
>> Attached the following files:
>>
>>  1. batch-source-code.java - main function
>>  2. FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
>>  3. flink-job-submit.txt - The code to submit the job
>>
>>
>> I've noticed 2 behaviors:
>>
>>  1. Idle - Once Task manager application boots up the memory
>>     consumption gradually grows, starting ~360MB to ~430MB(within few
>>     minutes) I see logs where many classes are loaded into JVM and
>>     never get released.(Might be a normal behavior)
>>  2. Batch Job Execution - A simple batch job with single operation .
>>     The memory bumps to ~600MB (after single execution). once the job
>>     is finished the memory never freed. I executed GC several times
>>     (Manually + Programmatically) it did not help(although some
>>     classes were unloaded). the memory keeps growing while more batch
>>     jobs are executed.
>>
>> Attached Task Manager Logs from yesterday after a single batch 
>> execution.(Memory grew to 612MB and never freed)
>>
>>  1. taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is
>>     the timestamp when the job was submitted)
>>  2. gc-class-historgram.txt
>>  3. thread-print.txt
>>  4. vm-class-loader-stats.txt
>>  5. vm-class-loaders.txt
>>  6. heap_info.txt
>>
>>
>> Same behavior has been observed in Flink Client application. Once the 
>> batch job is executed the memory is increasedgradually and does not 
>> get cleaned afterwards.(We observed many ChildFirstClassLoader instances)
>>
>>
>> Thank you
>> Tamir.
>>
>> ------------------------------------------------------------------------
>> *From:* Kezhu Wang <kezhuw@gmail.com <ma...@gmail.com>>
>> *Sent:* Sunday, February 28, 2021 6:57 PM
>> *To:* Tamir Sagi <Tamir.Sagi@niceactimize.com 
>> <ma...@niceactimize.com>>
>> *Subject:* Re: Suspected classloader leak in Flink 1.11.1
>>
>> *EXTERNAL EMAIL*
>>
>>
>>
>> HI Tamir,
>>
>> The histogram has no instance of `ChildFirstClassLoader`.
>>
>> > we are running Flink on a session cluster (version 1.11.1) on 
>> Kubernetes, submitting batch jobs with Flink client on Spring boot 
>> application (using RestClusterClient).
>>
>> > By analyzing the memory of the client Java application with 
>> profiling tools, We saw that there are many instances of Flink's 
>> ChildFirstClassLoader (perhaps as the number of jobs which were 
>> running), and therefore many instances of the same class, each from a 
>> different instance of the Class Loader (as shown in the attached 
>> screenshot). Similarly, to the Flink task manager memory.
>>
>> By means of batch job, do you means that you compile job graph from 
>> DataSet API in client side and then submit it through RestClient ? I 
>> am not familiar with data set api, usually, there is no 
>> `ChildFirstClassLoader` creation in client side for job graph 
>> building. Could you depict a pseudo for this or did you create 
>> `ChildFirstClassLoader` yourself ?
>>
>>
>> > In addition, we have tried calling GC manually, but it did not 
>> change much.
>>
>> It might take serval runs to collect a class loader instance.
>>
>>
>> Best,
>> Kezhu Wang
>>
>>
>> On February 28, 2021 at 23:27:38, Tamir Sagi 
>> (tamir.sagi@niceactimize.com <ma...@niceactimize.com>) wrote:
>>
>>> Hey Kezhu,
>>> Thanks for fast responding,
>>>
>>> I've read that link few days ago.; Today I ran a simple batch job 
>>> with single operation (using hadoop s3 plugin) but the same behavior 
>>> was observed.
>>>
>>> attached GC.class_histogram (Not filtered)
>>>
>>>
>>> Tamir.
>>>
>>>
>>>
>>> ------------------------------------------------------------------------
>>> *From:* Kezhu Wang <kezhuw@gmail.com <ma...@gmail.com>>
>>> *Sent:* Sunday, February 28, 2021 4:46 PM
>>> *To:* user@flink.apache.org <ma...@flink.apache.org> 
>>> <user@flink.apache.org <ma...@flink.apache.org>>; Tamir Sagi 
>>> <Tamir.Sagi@niceactimize.com <ma...@niceactimize.com>>
>>> *Subject:* Re: Suspected classloader leak in Flink 1.11.1
>>>
>>> *EXTERNAL EMAIL*
>>>
>>>
>>>
>>> Hi Tamir,
>>>
>>> You could check 
>>> https://ci.apache.org/projects/flink/flink-docs-stable/ops/debugging/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code 
>>> <https://ci.apache.org/projects/flink/flink-docs-stable/ops/debugging/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code> for 
>>> known class loading issues.
>>>
>>> Besides this, I think GC.class_histogram(even filtered) could help 
>>> us listing suspected objects.
>>>
>>>
>>> Best,
>>> Kezhu Wang
>>>
>>>
>>> On February 28, 2021 at 21:25:07, Tamir Sagi 
>>> (tamir.sagi@niceactimize.com <ma...@niceactimize.com>) 
>>> wrote:
>>>
>>>>
>>>> Hey all,
>>>>
>>>> We are encountering memory issues on a Flink client and task 
>>>> managers, which I would like to raise here.
>>>>
>>>> we are running Flink on a session cluster (version 1.11.1) on 
>>>> Kubernetes, submitting batch jobs with Flink client on Spring boot 
>>>> application (using RestClusterClient).
>>>>
>>>> When jobs are being submitted and running, one after another, We 
>>>> see that the metaspace memory(with max size of  1GB)  keeps 
>>>> increasing, as well as linear increase in the heap memory (though 
>>>> it's a more moderate increase). We do see GC working on the heap 
>>>> and releasing some resources.
>>>>
>>>> By analyzing the memory of the client Java application with 
>>>> profiling tools, We saw that there are many instances of Flink's 
>>>> ChildFirstClassLoader (perhaps as the number of jobs which were 
>>>> running), and therefore many instances of the same class, each from 
>>>> a different instance of the Class Loader (as shown in the attached 
>>>> screenshot). Similarly, to the Flink task manager memory.
>>>>
>>>> We would expect to see one instance of Class Loader. Therefore, We 
>>>> suspect that the reason for the increase is Class Loaders not being 
>>>> cleaned.
>>>>
>>>> Does anyone have some insights about this issue, or ideas how to 
>>>> proceed the investigation?
>>>>
>>>>
>>>> *Flink Client application (VisualVm)*
>>>>
>>>>
>>>>
>>>> Shallow Size com.fasterxmI.jackson.databind.PropertyMetadata 
>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>> com.fasterxmIjackson.databind.PropertyMetadata 
>>>> com.fasterxmI.jackson.databind.PropertyMetadata 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (41) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (79) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (82) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (23) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (36) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (34) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (84) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (92) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (59) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (70) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (3) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (60) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (8) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (17) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (31) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (12) 
>>>> org.apache.fIink.utiI.ChiIdFirstCIassLoader (49) Objects 0% 0% 0% 
>>>> 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Retained Size 120 120 120 
>>>> 120 120 120 120 120 120 120 120 120 120 120 120 120 120 0% 0% 0% 0% 
>>>> 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% z 120 z 120 z 120 z 120 z 120 z 
>>>> 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 
>>>> 120 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
>>>>
>>>> We have used different GCs but same results.
>>>>
>>>>
>>>> _*Task Manager*_
>>>>
>>>>
>>>> Total Size 4GB
>>>>
>>>> metaspace 1GB
>>>>
>>>> Off heap 512mb
>>>>
>>>>
>>>> Screenshot form Task manager, 612MB are occupied and not being 
>>>> released.
>>>>
>>>>
>>>> We used jcmd tool and attached 3 files
>>>>
>>>>  1. Threads print
>>>>  2. VM.metaspace output
>>>>  3. VM.classloader
>>>>
>>>> In addition, we have tried calling GC manually, but it did not 
>>>> change much.
>>>>
>>>> Thank you
>>>>
>>>>
>>>>
>>>>
>>>> Confidentiality: This communication and any attachments are 
>>>> intended for the above-named persons only and may be confidential 
>>>> and/or legally privileged. Any opinions expressed in this 
>>>> communication are not necessarily those of NICE Actimize. If this 
>>>> communication has come to you in error you must take no action 
>>>> based on it, nor must you copy or show it to anyone; please 
>>>> delete/destroy and inform the sender by e-mail immediately.
>>>> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
>>>> Viruses: Although we have taken steps toward ensuring that this 
>>>> e-mail and attachments are free from any virus, we advise that in 
>>>> keeping with good computing practice the recipient should ensure 
>>>> they are actually virus free.
>>>>
>>>
>>> Confidentiality: This communication and any attachments are intended 
>>> for the above-named persons only and may be confidential and/or 
>>> legally privileged. Any opinions expressed in this communication are 
>>> not necessarily those of NICE Actimize. If this communication has 
>>> come to you in error you must take no action based on it, nor must 
>>> you copy or show it to anyone; please delete/destroy and inform the 
>>> sender by e-mail immediately.
>>> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
>>> Viruses: Although we have taken steps toward ensuring that this 
>>> e-mail and attachments are free from any virus, we advise that in 
>>> keeping with good computing practice the recipient should ensure 
>>> they are actually virus free.
>>>
>>
>> Confidentiality: This communication and any attachments are intended 
>> for the above-named persons only and may be confidential and/or 
>> legally privileged. Any opinions expressed in this communication are 
>> not necessarily those of NICE Actimize. If this communication has 
>> come to you in error you must take no action based on it, nor must 
>> you copy or show it to anyone; please delete/destroy and inform the 
>> sender by e-mail immediately.
>> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
>> Viruses: Although we have taken steps toward ensuring that this 
>> e-mail and attachments are free from any virus, we advise that in 
>> keeping with good computing practice the recipient should ensure they 
>> are actually virus free.
>>
>
> Confidentiality: This communication and any attachments are intended 
> for the above-named persons only and may be confidential and/or 
> legally privileged. Any opinions expressed in this communication are 
> not necessarily those of NICE Actimize. If this communication has come 
> to you in error you must take no action based on it, nor must you copy 
> or show it to anyone; please delete/destroy and inform the sender by 
> e-mail immediately.
> Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
> Viruses: Although we have taken steps toward ensuring that this e-mail 
> and attachments are free from any virus, we advise that in keeping 
> with good computing practice the recipient should ensure they are 
> actually virus free.
>


Re: Suspected classloader leak in Flink 1.11.1

Posted by Tamir Sagi <Ta...@niceactimize.com>.
Hey,

I'd expect that what happens in a single execution will repeat itself in N executions.

I ran entire cycle of jobs(28 jobs).
Once it finished:

  *   Memory has grown to 1GB
  *   I called GC ~100 times using "jcmd 1 GC.run" command.  Which did not affect much.

Prior running the tests I started Flight recording using "jcmd 1 JFR.start ", I stopped it after calling GC ~100 times.
Following figure shows the graphs from "recording.jfr" in Virtual Vm.


[cid:c7d24b97-bb4a-405f-99af-e9edbb66328c]
and Metaspace(top right)
[cid:ad91c085-ccf3-4fe3-8eaa-b68699be1041]

docker stats command filtered to relevant Task manager container
[cid:23fcfd89-bb84-4ab6-ba4d-3f11df2c0be2]

Following files of task-manager are attached :

  *   task-manager-VM.metaspace - taken via "jcmd 1 VM.metaspace"
  *   task-manager-gc-class-histogram.txt via "jcmd 1 GC.class_histogram"

Task manager heap dump is ~100MB,
here is a summary:
[cid:20c5ed11-6dc7-443b-8fc6-e9d441ad60e3]


Flink client app metric(taken from Lens):

[cid:e56f107b-cb25-4a85-9c63-1940fad08077]


We see a tight coupling between Task Manager app and Flink Client app, as the batch job runs on the client side(via reflection)
what happens with class loaders in that case?

we also noticed many logs in Task manager related to PoolingHttpClientConnectionManager
[cid:e5d57310-af60-473d-aa07-b29a9537e677]

and IdleConnectionReaper  InterruptedException
[cid:47956f41-8eaf-4760-be6a-7479b16312bd]


On Client app we noticed many instances of that thread (From heap dump)

[cid:220cecc9-0aa2-4699-89a2-4508c73744ce]

We uploaded 2 heap dumps and task-manager flight recording file into Google drive

  1.  task-manager-heap-dump.hprof
  2.  java_flink_client.hprof.
  3.  task-manager-recording.jfr


Link https://drive.google.com/drive/folders/1J9wjTmejBroyAIIp7680AsIEQ05Ewdfl?usp=sharing

Thanks,
Tamir.
[https://my-email-signature.link/signature.gif?u=1088647&e=138632777&v=e54b1d2bf09b80ece620657d094006b4c253ba33e380f27485dc317715f75ece]
________________________________
From: Kezhu Wang <ke...@gmail.com>
Sent: Monday, March 1, 2021 2:21 PM
To: user@flink.apache.org <us...@flink.apache.org>; Tamir Sagi <Ta...@niceactimize.com>
Subject: Re: Suspected classloader leak in Flink 1.11.1


EXTERNAL EMAIL


Hi Tamir,

> The histogram has been taken from Task Manager using jcmd tool.

From that histogram, I guest there is no classloader leaking.

> A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed.

It could be just new code paths and hence new classes. A single execution does not making much sense. Multiple or dozen runs and continuous memory increasing among them and not decreasing after could be symptom of leaking.

You could use following steps to verify whether there are issues in your task managers:
* Run job N times, the more the better.
* Wait all jobs finished or stopped.
* Trigger manually gc dozen times.
* Take class histogram and check whether there are any “ChildFirstClassLoader”.
* If there are roughly N “ChildFirstClassLoader” in histogram, then we can pretty sure there might be class loader leaking.
* If there is no “ChildFirstClassLoader” or few but memory still higher than a threshold, say ~600MB or more, it could be other shape of leaking.


In all leaking case, an heap dump as @Chesnay said could be more helpful since it can tell us which object/class/thread keep memory from freeing.


Besides this, I saw an attachment “task-manager-thrad-print.txt” in initial mail, when and where did you capture ? Task Manager ? Is there any job still running ?


Best,
Kezhu Wang


On March 1, 2021 at 18:38:55, Tamir Sagi (tamir.sagi@niceactimize.com<ma...@niceactimize.com>) wrote:

Hey Kezhu,

The histogram has been taken from Task Manager using jcmd tool.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?
Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin, then map that data into DataSet then just print it.
Then we have a Flink Client application which saves the batch app jar.

Attached the following files:

  1.  batch-source-code.java - main function
  2.  FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
  3.  flink-job-submit.txt - The code to submit the job

I've noticed 2 behaviors:

  1.  Idle - Once Task manager application boots up the memory consumption gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs where many classes are loaded into JVM and never get released.(Might be a normal behavior)
  2.  Batch Job Execution - A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed. I executed GC several times (Manually + Programmatically) it did not help(although some classes were unloaded). the memory keeps growing while more batch jobs are executed.

Attached Task Manager Logs from yesterday after a single batch execution.(Memory grew to 612MB and never freed)

  1.  taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp when the job was submitted)
  2.  gc-class-historgram.txt
  3.  thread-print.txt
  4.  vm-class-loader-stats.txt
  5.  vm-class-loaders.txt
  6.  heap_info.txt

Same behavior has been observed in Flink Client application. Once the batch job is executed the memory is increased gradually and does not get cleaned afterwards.(We observed many ChildFirstClassLoader instances)


Thank you
Tamir.

________________________________
From: Kezhu Wang <ke...@gmail.com>>
Sent: Sunday, February 28, 2021 6:57 PM
To: Tamir Sagi <Ta...@niceactimize.com>>
Subject: Re: Suspected classloader leak in Flink 1.11.1


EXTERNAL EMAIL


HI Tamir,

The histogram has no instance of `ChildFirstClassLoader`.

> we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

> By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?


> In addition, we have tried calling GC manually, but it did not change much.

It might take serval runs to collect a class loader instance.


Best,
Kezhu Wang



On February 28, 2021 at 23:27:38, Tamir Sagi (tamir.sagi@niceactimize.com<ma...@niceactimize.com>) wrote:

Hey Kezhu,
Thanks for fast responding,

I've read that link few days ago.; Today I ran a simple batch job with single operation (using hadoop s3 plugin) but the same behavior was observed.

attached GC.class_histogram (Not filtered)


Tamir.



________________________________
From: Kezhu Wang <ke...@gmail.com>>
Sent: Sunday, February 28, 2021 4:46 PM
To: user@flink.apache.org<ma...@flink.apache.org> <us...@flink.apache.org>>; Tamir Sagi <Ta...@niceactimize.com>>
Subject: Re: Suspected classloader leak in Flink 1.11.1


EXTERNAL EMAIL


Hi Tamir,

You could check https://ci.apache.org/projects/flink/flink-docs-stable/ops/debugging/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code for known class loading issues.

Besides this, I think GC.class_histogram(even filtered) could help us listing suspected objects.


Best,
Kezhu Wang



On February 28, 2021 at 21:25:07, Tamir Sagi (tamir.sagi@niceactimize.com<ma...@niceactimize.com>) wrote:

Hey all,

We are encountering memory issues on a Flink client and task managers, which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of  1GB)  keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed the investigation?


Flink Client application (VisualVm)


[https://attachments.office.net/owa/Tamir.Sagi%40niceactimize.com/service.svc/s/GetAttachmentThumbnail?id=AAMkAGQxZDc3N2VmLTc2YjktNDBkYy1hYTlkLTlkZGJjOTVhNzgzYwBGAAAAAADTC80E8Z8%2BSZUI2Am6rB91BwBZ4%2BSIfRzgRZqm%2FnQvQJIRAAAAAAEPAABZ4%2BSIfRzgRZqm%2FnQvQJIRAAFJ%2FcQPAAABEgAQAHR5WsedsapHhxIGbUzU5Jw%3D&thumbnailType=2&token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjMwODE3OUNFNUY0QjUyRTc4QjJEQjg5NjZCQUY0RUNDMzcyN0FFRUUiLCJ0eXAiOiJKV1QiLCJ4NXQiOiJNSUY1emw5TFV1ZUxMYmlXYTY5T3pEY25ydTQifQ.eyJvcmlnaW4iOiJodHRwczovL291dGxvb2sub2ZmaWNlLmNvbSIsInVjIjoiOTRmMDM0MDU3Y2UwNDliMjlkZDhmY2U2NzEwNjkyYjkiLCJzaWduaW5fc3RhdGUiOiJbXCJrbXNpXCJdIiwidmVyIjoiRXhjaGFuZ2UuQ2FsbGJhY2suVjEiLCJhcHBjdHhzZW5kZXIiOiJPd2FEb3dubG9hZEA3MTIzZGFiZC0wZTg3LTRkYTktOWNiOS1iN2VjODIwMTFhYWQiLCJpc3NyaW5nIjoiV1ciLCJhcHBjdHgiOiJ7XCJtc2V4Y2hwcm90XCI6XCJvd2FcIixcInB1aWRcIjpcIjExNTM4MDExMTU5MTU4NzEyNzdcIixcInNjb3BlXCI6XCJPd2FEb3dubG9hZFwiLFwib2lkXCI6XCIzM2MwMWQyMS0yZGU1LTQyMDEtOWM3ZS1lYTY0OTYwZDFjMWVcIixcInByaW1hcnlzaWRcIjpcIlMtMS01LTIxLTE4MDA0OTE2MS0zOTk2NjA1NjY3LTQwMzc0MjAyNDEtMTgxNjQ3MDVcIn0iLCJuYmYiOjE2MTQ1OTE0NDEsImV4cCI6MTYxNDU5MjA0MSwiaXNzIjoiMDAwMDAwMDItMDAwMC0wZmYxLWNlMDAtMDAwMDAwMDAwMDAwQDcxMjNkYWJkLTBlODctNGRhOS05Y2I5LWI3ZWM4MjAxMWFhZCIsImF1ZCI6IjAwMDAwMDAyLTAwMDAtMGZmMS1jZTAwLTAwMDAwMDAwMDAwMC9hdHRhY2htZW50cy5vZmZpY2UubmV0QDcxMjNkYWJkLTBlODctNGRhOS05Y2I5LWI3ZWM4MjAxMWFhZCIsImhhcHAiOiJvd2EifQ.GQbcfBdbjV3c7DjQE0TQrOpgPGjhRUDYwaVh2w3dLizEyEJQ90fGfmAr88IQKzHOCy0GUCGVtXhaZ7hTecWOSxTUDR4ODz6tZvG7kVIa3csMyqZtoK2KmYnoH2dqOvqbj8PpaRiYFcOkPMVWjYo3DF5eMuBzuSFYekUknasVjN00BbZInQRauav943LMTmBKwlZlqyvvh-_0-rk2AtBORXRlO67BU7Xttf2QhfBNAuRsUVdvSPsS1JyJDnA8UfwgVqfsLVU9MfnDzcejrGJLinSvnrtoM0PZ5NHxseiw2MuaovsB2y9j_d8DMqAX2QpEZ5yiXJZNhWWtcOAWdJb-QA&X-OWA-CANARY=gU6uPQQbCEmRxIxXfgM1oTBgcjGW3NgYNc-bgvlLlsOs6jxYddBDjlm0e_6DwfGuAB22O0FwK-g.&owa=outlook.office.com&scriptVer=20210222004.08&animation=true]






[Shallow Size      com.fasterxmI.jackson.databind.PropertyMetadata      com.fasterxmIjackson.databind.PropertyMetadata      com.fasterxmI.jackson.databind.PropertyMetadata      com.fasterxmIjackson.databind.PropertyMetadata      com.fasterxmI.jackson.databind.PropertyMetadata      com.fasterxmIjackson.databind.PropertyMetadata      com.fasterxmI.jackson.databind.PropertyMetadata      com.fasterxmIjackson.databind.PropertyMetadata      com.fasterxmI.jackson.databind.PropertyMetadata      com.fasterxmIjackson.databind.PropertyMetadata      com.fasterxmI.jackson.databind.PropertyMetadata      com.fasterxmIjackson.databind.PropertyMetadata      com.fasterxmI.jackson.databind.PropertyMetadata      com.fasterxmIjackson.databind.PropertyMetadata      com.fasterxmI.jackson.databind.PropertyMetadata      com.fasterxmIjackson.databind.PropertyMetadata      com.fasterxmI.jackson.databind.PropertyMetadata      org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (79)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (82)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (23)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (36)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (34)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (84)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (92)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (59)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (70)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (3)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (60)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (8)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (17)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (31)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (12)      org.apache.fIink.utiI.ChiIdFirstCIassLoader (49)      Objects      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      Retained Size      120      120      120      120      120      120      120      120      120      120      120      120      120      120      120      120      120      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      z 120      z 120      z 120      z 120      z 120      z 120      z 120      z 120      z 120      z 120      z 120      z 120      z 120      z 120      z 120      z 120      z 120      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%      0%]

We have used different GCs but same results.


Task Manager


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released.

[https://attachments.office.net/owa/Tamir.Sagi%40niceactimize.com/service.svc/s/GetAttachmentThumbnail?id=AAMkAGQxZDc3N2VmLTc2YjktNDBkYy1hYTlkLTlkZGJjOTVhNzgzYwBGAAAAAADTC80E8Z8%2BSZUI2Am6rB91BwBZ4%2BSIfRzgRZqm%2FnQvQJIRAAAAAAEPAABZ4%2BSIfRzgRZqm%2FnQvQJIRAAFJ%2FcQPAAABEgAQAHBNlLLkwTdGr6Ast2cs7Xo%3D&thumbnailType=2&token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjMwODE3OUNFNUY0QjUyRTc4QjJEQjg5NjZCQUY0RUNDMzcyN0FFRUUiLCJ0eXAiOiJKV1QiLCJ4NXQiOiJNSUY1emw5TFV1ZUxMYmlXYTY5T3pEY25ydTQifQ.eyJvcmlnaW4iOiJodHRwczovL291dGxvb2sub2ZmaWNlLmNvbSIsInVjIjoiOTRmMDM0MDU3Y2UwNDliMjlkZDhmY2U2NzEwNjkyYjkiLCJzaWduaW5fc3RhdGUiOiJbXCJrbXNpXCJdIiwidmVyIjoiRXhjaGFuZ2UuQ2FsbGJhY2suVjEiLCJhcHBjdHhzZW5kZXIiOiJPd2FEb3dubG9hZEA3MTIzZGFiZC0wZTg3LTRkYTktOWNiOS1iN2VjODIwMTFhYWQiLCJpc3NyaW5nIjoiV1ciLCJhcHBjdHgiOiJ7XCJtc2V4Y2hwcm90XCI6XCJvd2FcIixcInB1aWRcIjpcIjExNTM4MDExMTU5MTU4NzEyNzdcIixcInNjb3BlXCI6XCJPd2FEb3dubG9hZFwiLFwib2lkXCI6XCIzM2MwMWQyMS0yZGU1LTQyMDEtOWM3ZS1lYTY0OTYwZDFjMWVcIixcInByaW1hcnlzaWRcIjpcIlMtMS01LTIxLTE4MDA0OTE2MS0zOTk2NjA1NjY3LTQwMzc0MjAyNDEtMTgxNjQ3MDVcIn0iLCJuYmYiOjE2MTQ1OTE0NDEsImV4cCI6MTYxNDU5MjA0MSwiaXNzIjoiMDAwMDAwMDItMDAwMC0wZmYxLWNlMDAtMDAwMDAwMDAwMDAwQDcxMjNkYWJkLTBlODctNGRhOS05Y2I5LWI3ZWM4MjAxMWFhZCIsImF1ZCI6IjAwMDAwMDAyLTAwMDAtMGZmMS1jZTAwLTAwMDAwMDAwMDAwMC9hdHRhY2htZW50cy5vZmZpY2UubmV0QDcxMjNkYWJkLTBlODctNGRhOS05Y2I5LWI3ZWM4MjAxMWFhZCIsImhhcHAiOiJvd2EifQ.GQbcfBdbjV3c7DjQE0TQrOpgPGjhRUDYwaVh2w3dLizEyEJQ90fGfmAr88IQKzHOCy0GUCGVtXhaZ7hTecWOSxTUDR4ODz6tZvG7kVIa3csMyqZtoK2KmYnoH2dqOvqbj8PpaRiYFcOkPMVWjYo3DF5eMuBzuSFYekUknasVjN00BbZInQRauav943LMTmBKwlZlqyvvh-_0-rk2AtBORXRlO67BU7Xttf2QhfBNAuRsUVdvSPsS1JyJDnA8UfwgVqfsLVU9MfnDzcejrGJLinSvnrtoM0PZ5NHxseiw2MuaovsB2y9j_d8DMqAX2QpEZ5yiXJZNhWWtcOAWdJb-QA&X-OWA-CANARY=gU6uPQQbCEmRxIxXfgM1oTBgcjGW3NgYNc-bgvlLlsOs6jxYddBDjlm0e_6DwfGuAB22O0FwK-g.&owa=outlook.office.com&scriptVer=20210222004.08&animation=true]


We used jcmd tool and attached 3 files

  1.  Threads print
  2.  VM.metaspace output
  3.  VM.classloader

In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

[https://my-email-signature.link/signature.gif?u=1088647&e=138542084&v=a339a3eaa6c7d2a9acbe16af16ce5c08b97b1a79cfa8afaf01c129ed2c3b2425]

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Re: Suspected classloader leak in Flink 1.11.1

Posted by Kezhu Wang <ke...@gmail.com>.
Hi Tamir,

> The histogram has been taken from Task Manager using jcmd tool.

From that histogram, I guest there is no classloader leaking.

> A simple batch job with single operation . The memory bumps to ~600MB
(after single execution). once the job is finished the memory never freed.

It could be just new code paths and hence new classes. A single execution
does not making much sense. Multiple or dozen runs and continuous memory
increasing among them and not decreasing after could be symptom of leaking.

You could use following steps to verify whether there are issues in your
task managers:
* Run job N times, the more the better.
* Wait all jobs finished or stopped.
* Trigger manually gc dozen times.
* Take class histogram and check whether there are any
“ChildFirstClassLoader”.
* If there are roughly N “ChildFirstClassLoader” in histogram, then we can
pretty sure there might be class loader leaking.
* If there is no “ChildFirstClassLoader” or few but memory still higher
than a threshold, say ~600MB or more, it could be other shape of leaking.


In all leaking case, an heap dump as @Chesnay said could be more helpful
since it can tell us which object/class/thread keep memory from freeing.


Besides this, I saw an attachment “task-manager-thrad-print.txt” in initial
mail, when and where did you capture ? Task Manager ? Is there any job
still running ?


Best,
Kezhu Wang

On March 1, 2021 at 18:38:55, Tamir Sagi (tamir.sagi@niceactimize.com)
wrote:

Hey Kezhu,

The histogram has been taken from Task Manager using jcmd tool.

By means of batch job, do you means that you compile job graph from DataSet
API in client side and then submit it through RestClient ? I am not
familiar with data set api, usually, there is no `ChildFirstClassLoader`
creation in client side for job graph building. Could you depict a pseudo
for this or did you create `ChildFirstClassLoader` yourself ?

Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin,
then map that data into DataSet then just print it.
Then we have a Flink Client application which saves the batch app jar.

Attached the following files:

   1. batch-source-code.java - main function
   2. FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
   3. flink-job-submit.txt - The code to submit the job


I've noticed 2 behaviors:

   1. Idle - Once Task manager application boots up the memory consumption
   gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs
   where many classes are loaded into JVM and never get released.(Might be a
   normal behavior)
   2. Batch Job Execution - A simple batch job with single operation . The
   memory bumps to ~600MB (after single execution). once the job is finished
   the memory never freed. I executed GC several times (Manually +
   Programmatically) it did not help(although some classes were unloaded). the
   memory keeps growing while more batch jobs are executed.

Attached Task Manager Logs from yesterday after a single batch
execution.(Memory grew to 612MB and never freed)

   1. taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp
   when the job was submitted)
   2. gc-class-historgram.txt
   3. thread-print.txt
   4. vm-class-loader-stats.txt
   5. vm-class-loaders.txt
   6. heap_info.txt


Same behavior has been observed in Flink Client application. Once the batch
job is executed the memory is increased gradually and does not get cleaned
afterwards.(We observed many ChildFirstClassLoader instances)


Thank you
Tamir.

------------------------------
*From:* Kezhu Wang <ke...@gmail.com>
*Sent:* Sunday, February 28, 2021 6:57 PM
*To:* Tamir Sagi <Ta...@niceactimize.com>
*Subject:* Re: Suspected classloader leak in Flink 1.11.1


*EXTERNAL EMAIL*


HI Tamir,

The histogram has no instance of `ChildFirstClassLoader`.

> we are running Flink on a session cluster (version 1.11.1) on Kubernetes,
submitting batch jobs with Flink client on Spring boot application (using
RestClusterClient).

> By analyzing the memory of the client Java application with profiling
tools, We saw that there are many instances of Flink's
ChildFirstClassLoader (perhaps as the number of jobs which were running),
and therefore many instances of the same class, each from a different
instance of the Class Loader (as shown in the attached screenshot).
Similarly, to the Flink task manager memory.

By means of batch job, do you means that you compile job graph from DataSet
API in client side and then submit it through RestClient ? I am not
familiar with data set api, usually, there is no `ChildFirstClassLoader`
creation in client side for job graph building. Could you depict a pseudo
for this or did you create `ChildFirstClassLoader` yourself ?


> In addition, we have tried calling GC manually, but it did not change
much.

It might take serval runs to collect a class loader instance.


Best,
Kezhu Wang


On February 28, 2021 at 23:27:38, Tamir Sagi (tamir.sagi@niceactimize.com)
wrote:

Hey Kezhu,
Thanks for fast responding,

I've read that link few days ago.; Today I ran a simple batch job with
single operation (using hadoop s3 plugin) but the same behavior was
observed.

attached GC.class_histogram (Not filtered)


Tamir.



------------------------------
*From:* Kezhu Wang <ke...@gmail.com>
*Sent:* Sunday, February 28, 2021 4:46 PM
*To:* user@flink.apache.org <us...@flink.apache.org>; Tamir Sagi <
Tamir.Sagi@niceactimize.com>
*Subject:* Re: Suspected classloader leak in Flink 1.11.1


*EXTERNAL EMAIL*


Hi Tamir,

You could check
https://ci.apache.org/projects/flink/flink-docs-stable/ops/debugging/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code
for
known class loading issues.

Besides this, I think GC.class_histogram(even filtered) could help us
listing suspected objects.


Best,
Kezhu Wang


On February 28, 2021 at 21:25:07, Tamir Sagi (tamir.sagi@niceactimize.com)
wrote:


Hey all,

We are encountering memory issues on a Flink client and task managers,
which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes,
submitting batch jobs with Flink client on Spring boot application (using
RestClusterClient).

When jobs are being submitted and running, one after another, We see that
the metaspace memory(with max size of  1GB)  keeps increasing, as well as
linear increase in the heap memory (though it's a more moderate increase).
We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling
tools, We saw that there are many instances of Flink's
ChildFirstClassLoader (perhaps as the number of jobs which were running),
and therefore many instances of the same class, each from a different
instance of the Class Loader (as shown in the attached screenshot).
Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect
that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed
the investigation?


*Flink Client application (VisualVm)*







[image: Shallow Size com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (79)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (82)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (23)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (36)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (34)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (84)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (92)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (59)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (70)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (3)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (60)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (8)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (17)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (31)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (12)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (49) Objects 0% 0% 0% 0% 0% 0%
0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Retained Size 120 120 120 120 120 120 120 120
120 120 120 120 120 120 120 120 120 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
0% 0% 0% z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120
z 120 z 120 z 120 z 120 z 120 z 120 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
0% 0% 0%]

We have used different GCs but same results.


*Task Manager*


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released.

We used jcmd tool and attached 3 files


   1. Threads print
   2. VM.metaspace output
   3. VM.classloader

In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.


Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.

Re: Suspected classloader leak in Flink 1.11.1

Posted by Tamir Sagi <Ta...@niceactimize.com>.
Hey Kezhu,

The histogram has been taken from Task Manager using jcmd tool.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?
Yes, we have a batch app. we read a file from s3 using hadoop-s3-plugin, then map that data into DataSet then just print it.
Then we have a Flink Client application which saves the batch app jar.

Attached the following files:

  1.  batch-source-code.java - main function
  2.  FlatMapXSightMsgProcessor.java - custom RichFlatMapFunction
  3.  flink-job-submit.txt - The code to submit the job

I've noticed 2 behaviors:

  1.  Idle - Once Task manager application boots up the memory consumption gradually grows, starting ~360MB to ~430MB(within few minutes) I see logs where many classes are loaded into JVM and never get released.(Might be a normal behavior)
  2.  Batch Job Execution - A simple batch job with single operation . The memory bumps to ~600MB (after single execution). once the job is finished the memory never freed. I executed GC several times (Manually + Programmatically) it did not help(although some classes were unloaded). the memory keeps growing while more batch jobs are executed.

Attached Task Manager Logs from yesterday after a single batch execution.(Memory grew to 612MB and never freed)

  1.  taskmgr.txt - Task manager logs (2021-02-28T16:06:05,983 is the timestamp when the job was submitted)
  2.  gc-class-historgram.txt
  3.  thread-print.txt
  4.  vm-class-loader-stats.txt
  5.  vm-class-loaders.txt
  6.  heap_info.txt

Same behavior has been observed in Flink Client application. Once the batch job is executed the memory is increased gradually and does not get cleaned afterwards.(We observed many ChildFirstClassLoader instances)


Thank you
Tamir.

________________________________
From: Kezhu Wang <ke...@gmail.com>
Sent: Sunday, February 28, 2021 6:57 PM
To: Tamir Sagi <Ta...@niceactimize.com>
Subject: Re: Suspected classloader leak in Flink 1.11.1


EXTERNAL EMAIL


HI Tamir,

The histogram has no instance of `ChildFirstClassLoader`.

> we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

> By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

By means of batch job, do you means that you compile job graph from DataSet API in client side and then submit it through RestClient ? I am not familiar with data set api, usually, there is no `ChildFirstClassLoader` creation in client side for job graph building. Could you depict a pseudo for this or did you create `ChildFirstClassLoader` yourself ?


> In addition, we have tried calling GC manually, but it did not change much.

It might take serval runs to collect a class loader instance.


Best,
Kezhu Wang



On February 28, 2021 at 23:27:38, Tamir Sagi (tamir.sagi@niceactimize.com<ma...@niceactimize.com>) wrote:

Hey Kezhu,
Thanks for fast responding,

I've read that link few days ago.; Today I ran a simple batch job with single operation (using hadoop s3 plugin) but the same behavior was observed.

attached GC.class_histogram (Not filtered)


Tamir.



________________________________
From: Kezhu Wang <ke...@gmail.com>>
Sent: Sunday, February 28, 2021 4:46 PM
To: user@flink.apache.org<ma...@flink.apache.org> <us...@flink.apache.org>>; Tamir Sagi <Ta...@niceactimize.com>>
Subject: Re: Suspected classloader leak in Flink 1.11.1


EXTERNAL EMAIL


Hi Tamir,

You could check https://ci.apache.org/projects/flink/flink-docs-stable/ops/debugging/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code for known class loading issues.

Besides this, I think GC.class_histogram(even filtered) could help us listing suspected objects.


Best,
Kezhu Wang



On February 28, 2021 at 21:25:07, Tamir Sagi (tamir.sagi@niceactimize.com<ma...@niceactimize.com>) wrote:

Hey all,

We are encountering memory issues on a Flink client and task managers, which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes, submitting batch jobs with Flink client on Spring boot application (using RestClusterClient).

When jobs are being submitted and running, one after another, We see that the metaspace memory(with max size of  1GB)  keeps increasing, as well as linear increase in the heap memory (though it's a more moderate increase). We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling tools, We saw that there are many instances of Flink's ChildFirstClassLoader (perhaps as the number of jobs which were running), and therefore many instances of the same class, each from a different instance of the Class Loader (as shown in the attached screenshot). Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed the investigation?


Flink Client application (VisualVm)


[https://attachments.office.net/owa/Tamir.Sagi%40niceactimize.com/service.svc/s/GetAttachmentThumbnail?id=AAMkAGQxZDc3N2VmLTc2YjktNDBkYy1hYTlkLTlkZGJjOTVhNzgzYwBGAAAAAADTC80E8Z8%2BSZUI2Am6rB91BwBZ4%2BSIfRzgRZqm%2FnQvQJIRAAAAAAEPAABZ4%2BSIfRzgRZqm%2FnQvQJIRAAFJ%2FcQPAAABEgAQAHR5WsedsapHhxIGbUzU5Jw%3D&thumbnailType=2&token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjMwODE3OUNFNUY0QjUyRTc4QjJEQjg5NjZCQUY0RUNDMzcyN0FFRUUiLCJ0eXAiOiJKV1QiLCJ4NXQiOiJNSUY1emw5TFV1ZUxMYmlXYTY5T3pEY25ydTQifQ.eyJvcmlnaW4iOiJodHRwczovL291dGxvb2sub2ZmaWNlLmNvbSIsInVjIjoiOTRmMDM0MDU3Y2UwNDliMjlkZDhmY2U2NzEwNjkyYjkiLCJzaWduaW5fc3RhdGUiOiJbXCJrbXNpXCJdIiwidmVyIjoiRXhjaGFuZ2UuQ2FsbGJhY2suVjEiLCJhcHBjdHhzZW5kZXIiOiJPd2FEb3dubG9hZEA3MTIzZGFiZC0wZTg3LTRkYTktOWNiOS1iN2VjODIwMTFhYWQiLCJpc3NyaW5nIjoiV1ciLCJhcHBjdHgiOiJ7XCJtc2V4Y2hwcm90XCI6XCJvd2FcIixcInB1aWRcIjpcIjExNTM4MDExMTU5MTU4NzEyNzdcIixcInNjb3BlXCI6XCJPd2FEb3dubG9hZFwiLFwib2lkXCI6XCIzM2MwMWQyMS0yZGU1LTQyMDEtOWM3ZS1lYTY0OTYwZDFjMWVcIixcInByaW1hcnlzaWRcIjpcIlMtMS01LTIxLTE4MDA0OTE2MS0zOTk2NjA1NjY3LTQwMzc0MjAyNDEtMTgxNjQ3MDVcIn0iLCJuYmYiOjE2MTQ1OTE0NDEsImV4cCI6MTYxNDU5MjA0MSwiaXNzIjoiMDAwMDAwMDItMDAwMC0wZmYxLWNlMDAtMDAwMDAwMDAwMDAwQDcxMjNkYWJkLTBlODctNGRhOS05Y2I5LWI3ZWM4MjAxMWFhZCIsImF1ZCI6IjAwMDAwMDAyLTAwMDAtMGZmMS1jZTAwLTAwMDAwMDAwMDAwMC9hdHRhY2htZW50cy5vZmZpY2UubmV0QDcxMjNkYWJkLTBlODctNGRhOS05Y2I5LWI3ZWM4MjAxMWFhZCIsImhhcHAiOiJvd2EifQ.GQbcfBdbjV3c7DjQE0TQrOpgPGjhRUDYwaVh2w3dLizEyEJQ90fGfmAr88IQKzHOCy0GUCGVtXhaZ7hTecWOSxTUDR4ODz6tZvG7kVIa3csMyqZtoK2KmYnoH2dqOvqbj8PpaRiYFcOkPMVWjYo3DF5eMuBzuSFYekUknasVjN00BbZInQRauav943LMTmBKwlZlqyvvh-_0-rk2AtBORXRlO67BU7Xttf2QhfBNAuRsUVdvSPsS1JyJDnA8UfwgVqfsLVU9MfnDzcejrGJLinSvnrtoM0PZ5NHxseiw2MuaovsB2y9j_d8DMqAX2QpEZ5yiXJZNhWWtcOAWdJb-QA&X-OWA-CANARY=gU6uPQQbCEmRxIxXfgM1oTBgcjGW3NgYNc-bgvlLlsOs6jxYddBDjlm0e_6DwfGuAB22O0FwK-g.&owa=outlook.office.com&scriptVer=20210222004.08&animation=true]






[Shallow Size    com.fasterxmI.jackson.databind.PropertyMetadata    com.fasterxmIjackson.databind.PropertyMetadata    com.fasterxmI.jackson.databind.PropertyMetadata    com.fasterxmIjackson.databind.PropertyMetadata    com.fasterxmI.jackson.databind.PropertyMetadata    com.fasterxmIjackson.databind.PropertyMetadata    com.fasterxmI.jackson.databind.PropertyMetadata    com.fasterxmIjackson.databind.PropertyMetadata    com.fasterxmI.jackson.databind.PropertyMetadata    com.fasterxmIjackson.databind.PropertyMetadata    com.fasterxmI.jackson.databind.PropertyMetadata    com.fasterxmIjackson.databind.PropertyMetadata    com.fasterxmI.jackson.databind.PropertyMetadata    com.fasterxmIjackson.databind.PropertyMetadata    com.fasterxmI.jackson.databind.PropertyMetadata    com.fasterxmIjackson.databind.PropertyMetadata    com.fasterxmI.jackson.databind.PropertyMetadata    org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (79)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (82)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (23)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (36)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (34)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (84)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (92)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (59)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (70)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (3)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (60)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (8)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (17)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (31)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (12)    org.apache.fIink.utiI.ChiIdFirstCIassLoader (49)    Objects    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    Retained Size    120    120    120    120    120    120    120    120    120    120    120    120    120    120    120    120    120    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    z 120    z 120    z 120    z 120    z 120    z 120    z 120    z 120    z 120    z 120    z 120    z 120    z 120    z 120    z 120    z 120    z 120    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%    0%]

We have used different GCs but same results.


Task Manager


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released.

[https://attachments.office.net/owa/Tamir.Sagi%40niceactimize.com/service.svc/s/GetAttachmentThumbnail?id=AAMkAGQxZDc3N2VmLTc2YjktNDBkYy1hYTlkLTlkZGJjOTVhNzgzYwBGAAAAAADTC80E8Z8%2BSZUI2Am6rB91BwBZ4%2BSIfRzgRZqm%2FnQvQJIRAAAAAAEPAABZ4%2BSIfRzgRZqm%2FnQvQJIRAAFJ%2FcQPAAABEgAQAHBNlLLkwTdGr6Ast2cs7Xo%3D&thumbnailType=2&token=eyJhbGciOiJSUzI1NiIsImtpZCI6IjMwODE3OUNFNUY0QjUyRTc4QjJEQjg5NjZCQUY0RUNDMzcyN0FFRUUiLCJ0eXAiOiJKV1QiLCJ4NXQiOiJNSUY1emw5TFV1ZUxMYmlXYTY5T3pEY25ydTQifQ.eyJvcmlnaW4iOiJodHRwczovL291dGxvb2sub2ZmaWNlLmNvbSIsInVjIjoiOTRmMDM0MDU3Y2UwNDliMjlkZDhmY2U2NzEwNjkyYjkiLCJzaWduaW5fc3RhdGUiOiJbXCJrbXNpXCJdIiwidmVyIjoiRXhjaGFuZ2UuQ2FsbGJhY2suVjEiLCJhcHBjdHhzZW5kZXIiOiJPd2FEb3dubG9hZEA3MTIzZGFiZC0wZTg3LTRkYTktOWNiOS1iN2VjODIwMTFhYWQiLCJpc3NyaW5nIjoiV1ciLCJhcHBjdHgiOiJ7XCJtc2V4Y2hwcm90XCI6XCJvd2FcIixcInB1aWRcIjpcIjExNTM4MDExMTU5MTU4NzEyNzdcIixcInNjb3BlXCI6XCJPd2FEb3dubG9hZFwiLFwib2lkXCI6XCIzM2MwMWQyMS0yZGU1LTQyMDEtOWM3ZS1lYTY0OTYwZDFjMWVcIixcInByaW1hcnlzaWRcIjpcIlMtMS01LTIxLTE4MDA0OTE2MS0zOTk2NjA1NjY3LTQwMzc0MjAyNDEtMTgxNjQ3MDVcIn0iLCJuYmYiOjE2MTQ1OTE0NDEsImV4cCI6MTYxNDU5MjA0MSwiaXNzIjoiMDAwMDAwMDItMDAwMC0wZmYxLWNlMDAtMDAwMDAwMDAwMDAwQDcxMjNkYWJkLTBlODctNGRhOS05Y2I5LWI3ZWM4MjAxMWFhZCIsImF1ZCI6IjAwMDAwMDAyLTAwMDAtMGZmMS1jZTAwLTAwMDAwMDAwMDAwMC9hdHRhY2htZW50cy5vZmZpY2UubmV0QDcxMjNkYWJkLTBlODctNGRhOS05Y2I5LWI3ZWM4MjAxMWFhZCIsImhhcHAiOiJvd2EifQ.GQbcfBdbjV3c7DjQE0TQrOpgPGjhRUDYwaVh2w3dLizEyEJQ90fGfmAr88IQKzHOCy0GUCGVtXhaZ7hTecWOSxTUDR4ODz6tZvG7kVIa3csMyqZtoK2KmYnoH2dqOvqbj8PpaRiYFcOkPMVWjYo3DF5eMuBzuSFYekUknasVjN00BbZInQRauav943LMTmBKwlZlqyvvh-_0-rk2AtBORXRlO67BU7Xttf2QhfBNAuRsUVdvSPsS1JyJDnA8UfwgVqfsLVU9MfnDzcejrGJLinSvnrtoM0PZ5NHxseiw2MuaovsB2y9j_d8DMqAX2QpEZ5yiXJZNhWWtcOAWdJb-QA&X-OWA-CANARY=gU6uPQQbCEmRxIxXfgM1oTBgcjGW3NgYNc-bgvlLlsOs6jxYddBDjlm0e_6DwfGuAB22O0FwK-g.&owa=outlook.office.com&scriptVer=20210222004.08&animation=true]


We used jcmd tool and attached 3 files

  1.  Threads print
  2.  VM.metaspace output
  3.  VM.classloader

In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

[https://my-email-signature.link/signature.gif?u=1088647&e=138542084&v=a339a3eaa6c7d2a9acbe16af16ce5c08b97b1a79cfa8afaf01c129ed2c3b2425]

Confidentiality: This communication and any attachments are intended for the above-named persons only and may be confidential and/or legally privileged. Any opinions expressed in this communication are not necessarily those of NICE Actimize. If this communication has come to you in error you must take no action based on it, nor must you copy or show it to anyone; please delete/destroy and inform the sender by e-mail immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and attachments are free from any virus, we advise that in keeping with good computing practice the recipient should ensure they are actually virus free.

Re: Suspected classloader leak in Flink 1.11.1

Posted by Kezhu Wang <ke...@gmail.com>.
Hi Tamir,

You could check
https://ci.apache.org/projects/flink/flink-docs-stable/ops/debugging/debugging_classloading.html#unloading-of-dynamically-loaded-classes-in-user-code
for
known class loading issues.

Besides this, I think GC.class_histogram(even filtered) could help us
listing suspected objects.


Best,
Kezhu Wang


On February 28, 2021 at 21:25:07, Tamir Sagi (tamir.sagi@niceactimize.com)
wrote:


Hey all,

We are encountering memory issues on a Flink client and task managers,
which I would like to raise here.

we are running Flink on a session cluster (version 1.11.1) on Kubernetes,
submitting batch jobs with Flink client on Spring boot application (using
RestClusterClient).

When jobs are being submitted and running, one after another, We see that
the metaspace memory(with max size of  1GB)  keeps increasing, as well as
linear increase in the heap memory (though it's a more moderate increase).
We do see GC working on the heap and releasing some resources.

By analyzing the memory of the client Java application with profiling
tools, We saw that there are many instances of Flink's
ChildFirstClassLoader (perhaps as the number of jobs which were running),
and therefore many instances of the same class, each from a different
instance of the Class Loader (as shown in the attached screenshot).
Similarly, to the Flink task manager memory.

We would expect to see one instance of Class Loader. Therefore, We suspect
that the reason for the increase is Class Loaders not being cleaned.

Does anyone have some insights about this issue, or ideas how to proceed
the investigation?


*Flink Client application (VisualVm)*







[image: Shallow Size com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
com.fasterxmIjackson.databind.PropertyMetadata
com.fasterxmI.jackson.databind.PropertyMetadata
org.apache.fIink.utiI.ChiIdFirstCIassLoader (41)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (79)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (82)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (23)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (36)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (34)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (84)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (92)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (59)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (70)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (3)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (60)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (8)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (17)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (31)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (12)
org.apache.fIink.utiI.ChiIdFirstCIassLoader (49) Objects 0% 0% 0% 0% 0% 0%
0% 0% 0% 0% 0% 0% 0% 0% 0% 0% Retained Size 120 120 120 120 120 120 120 120
120 120 120 120 120 120 120 120 120 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
0% 0% 0% z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120 z 120
z 120 z 120 z 120 z 120 z 120 z 120 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0%
0% 0% 0%]

We have used different GCs but same results.


*Task Manager*


Total Size 4GB

metaspace 1GB

Off heap 512mb


Screenshot form Task manager, 612MB are occupied and not being released.

We used jcmd tool and attached 3 files


   1. Threads print
   2. VM.metaspace output
   3. VM.classloader

In addition, we have tried calling GC manually, but it did not change much.

Thank you




Confidentiality: This communication and any attachments are intended for
the above-named persons only and may be confidential and/or legally
privileged. Any opinions expressed in this communication are not
necessarily those of NICE Actimize. If this communication has come to you
in error you must take no action based on it, nor must you copy or show it
to anyone; please delete/destroy and inform the sender by e-mail
immediately.
Monitoring: NICE Actimize may monitor incoming and outgoing e-mails.
Viruses: Although we have taken steps toward ensuring that this e-mail and
attachments are free from any virus, we advise that in keeping with good
computing practice the recipient should ensure they are actually virus free.