You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by John Smith <ja...@gmail.com> on 2022/04/14 15:59:19 UTC

Re: How to debug Metaspace exception?

Hi, so I have a dump file. What do I look for?

On Thu, Mar 31, 2022 at 3:28 PM John Smith <ja...@gmail.com> wrote:

> Ok so if there's a leak, if I manually stop the job and restart it from
> the UI multiple times, I won't see the issue because because the classes
> are unloaded correctly?
>
>
> On Thu, Mar 31, 2022 at 9:20 AM huweihua <hu...@gmail.com> wrote:
>
>>
>> The difference is that manually canceling the job stops the JobMaster,
>> but automatic failover keeps the JobMaster running. But looking on
>> TaskManager, it doesn't make much difference
>>
>>
>> 2022年3月31日 上午4:01，John Smith <ja...@gmail.com> 写道：
>>
>> Also if I manually cancel and restart the same job over and over is it
>> the same as if flink was restarting a job due to failure?
>>
>> I.e: When I click "Cancel Job" on the UI is the job completely unloaded
>> vs when the job scheduler restarts a job because if whatever reason?
>>
>> Lile this I'll stop and restart the job a few times or maybe I can trick
>> my job to fail and have the scheduler restart it. Ok let me think about
>> this...
>>
>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <hu...@gmail.com> wrote:
>>
>>> So if I run the same jobs in my dev env will I still be able to see the
>>> similar dump?
>>>
>>> I think running the same job in dev should be reproducible, maybe you
>>> can have a try.
>>>
>>>  If not I would have to wait at a low volume time to do it on
>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>
>>> Yes, JMAP will pause the JVM, the time of pause depends on the size to
>>> dump. you can use "jmap -dump:live" to dump only the reachable objects,
>>> this will take a brief pause
>>>
>>>
>>>
>>> 2022年3月30日 下午9:47，John Smith <ja...@gmail.com> 写道：
>>>
>>> I have 3 task managers (see config below). There is total of 10 jobs
>>> with 25 slots being used.
>>> The jobs are 100% ETL I.e; They load Json, transform it and push it to
>>> JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>
>>> FOR JMAP. I know that it will pause the task manager. So if I run the
>>> same jobs in my dev env will I still be able to see the similar dump? I I
>>> assume so. If not I would have to wait at a low volume time to do it on
>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>
>>>
>>> # Operating system has 16GB total.
>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>
>>> cluster.evenly-spread-out-slots: true
>>>
>>> taskmanager.memory.flink.size: 10240m
>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>> taskmanager.numberOfTaskSlots: 16
>>> parallelism.default: 1
>>>
>>> high-availability: zookeeper
>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>> high-availability.zookeeper.quorum: ...
>>> high-availability.zookeeper.path.root: /flink_1_14
>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>
>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>
>>> state.backend: rocksdb
>>> state.backend.incremental: true
>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>
>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <hu...@gmail.com> wrote:
>>>
>>>> Hi, John
>>>>
>>>> Could you tell us you application scenario? Is it a flink session
>>>> cluster with a lot of jobs?
>>>>
>>>> Maybe you can try to dump the memory with jmap and use tools such as
>>>> MAT to analyze whether there are abnormal classes and classloaders
>>>>
>>>>
>>>> > 2022年3月30日 上午6:09，John Smith <ja...@gmail.com> 写道：
>>>> >
>>>> > Hi running 1.14.4
>>>> >
>>>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>>>> Metaspace. The metaspace out-of-memory error has occurred. This can mean
>>>> two things: either the job requires a larger size of JVM metaspace to load
>>>> classes or there is a class loading leak.
>>>> >
>>>> > I have 2GB of metaspace configed
>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>> >
>>>> > But the task nodes still fail.
>>>> >
>>>> > When looking at the UI metrics, the metaspace starts low. Now I see
>>>> 85% usage. It seems to be a class loading leak at this point, how can we
>>>> debug this issue?
>>>>
>>>>
>>>
>>

Re: How to debug Metaspace exception?

Posted by John Smith <ja...@gmail.com>.

Ok, I don't think I'm running user code on the job manager. Basically. I'm
running a standalone cluster.

3 zookeepers
3 job managers
3 task managers.

I submit my jobs via the UI.

But in case I'll copy the config iver to the job managers.



On Mon, May 2, 2022 at 11:00 AM Chesnay Schepler <ch...@apache.org> wrote:

> There are cases where user-code is run on the JobManager.
> I'm not sure whether though that applies to the JDBC sources.
>
> On 02/05/2022 15:45, John Smith wrote:
>
> Why do the JDBC jars need to be on the job manager node though?
>
> On Mon, May 2, 2022 at 9:36 AM Chesnay Schepler <ch...@apache.org>
> wrote:
>
>> yes.
>> But if you can ensure that the driver isn't bundled by any user-jar you
>> can also skip the pattern configuration step.
>>
>> The pattern looks correct formatting-wise; you could try whether
>> com.microsoft.sqlserver.jdbc. is enough to solve the issue.
>>
>> On 02/05/2022 14:41, John Smith wrote:
>>
>> Oh, so I should copy the jars to the lib folder and
>> set classloader.parent-first-patterns.additional:
>> "org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the task
>> managers and job managers?
>>
>> Also is my pattern correct?
>> "org.apache.ignite.;com.microsoft.sqlserver.jdbc."
>>
>> Just to be sure I'm running a standalone cluster using zookeeper. So I
>> have 3 zookeepers, 3 job managers and 3 task managers.
>>
>>
>> On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler <ch...@apache.org>
>> wrote:
>>
>>> And you do should make sure that it is set for both processes!
>>>
>>> On 02/05/2022 08:43, Chesnay Schepler wrote:
>>>
>>> The setting itself isn't taskmanager specific; it applies to both the
>>> job- and taskmanager process.
>>>
>>> On 02/05/2022 05:29, John Smith wrote:
>>>
>>> Also just to be sure this is a Task Manager setting right?
>>>
>>> On Thu, Apr 28, 2022 at 11:13 AM John Smith <ja...@gmail.com>
>>> wrote:
>>>
>>>> I assume you will take action on your side to track and fix the doc? :)
>>>>
>>>> On Thu, Apr 28, 2022 at 11:12 AM John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Ok so to summarize...
>>>>>
>>>>> - Build my job jar and have the JDBC driver as a compile only
>>>>> dependency and copy the JDBC driver to flink lib folder.
>>>>>
>>>>> Or
>>>>>
>>>>> - Build my job jar and include JDBC driver in the shadow, plus copy
>>>>> the JDBC driver in the flink lib folder, plus  make an entry in config for
>>>>> classloader.parent-first-patterns-additional
>>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>>
>>>>>
>>>>> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler <ch...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I think what I meant was "either add it to /lib, or [if it is already
>>>>>> in /lib but also bundled in the jar] add it to the parent-first patterns."
>>>>>>
>>>>>> On 28/04/2022 15:56, Chesnay Schepler wrote:
>>>>>>
>>>>>> Pretty sure, even though I seemingly documented it incorrectly :)
>>>>>>
>>>>>> On 28/04/2022 15:49, John Smith wrote:
>>>>>>
>>>>>> You sure?
>>>>>>
>>>>>>    -
>>>>>>
>>>>>>    *JDBC*: JDBC drivers leak references outside the user code
>>>>>>    classloader. To ensure that these classes are only loaded once you should
>>>>>>    either add the driver jars to Flink’s lib/ folder, or add the
>>>>>>    driver classes to the list of parent-first loaded class via
>>>>>>    classloader.parent-first-patterns-additional
>>>>>>    <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>>>    .
>>>>>>
>>>>>>    It says either or
>>>>>>
>>>>>>
>>>>>> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ch...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> You're misinterpreting the docs.
>>>>>>>
>>>>>>> The parent/child-first classloading controls where Flink looks for a
>>>>>>> class *first*, specifically whether we first load from /lib or the
>>>>>>> user-jar.
>>>>>>> It does not allow you to load something from the user-jar in the
>>>>>>> parent classloader. That's just not how it works.
>>>>>>>
>>>>>>> It must be in /lib.
>>>>>>>
>>>>>>> On 27/04/2022 04:59, John Smith wrote:
>>>>>>>
>>>>>>> Hi Chesnay as per the docs...
>>>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>>>>>>
>>>>>>> You can either put the jars in task manager lib folder or use
>>>>>>> classloader.parent-first-patterns-additional
>>>>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>>>>
>>>>>>> I prefer the latter like this: the dependency stays with the
>>>>>>> user-jar and not on the task manager.
>>>>>>>
>>>>>>> On Tue, Apr 26, 2022 at 9:52 PM John Smith <ja...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok so I should put the Apache ignite and my Microsoft drivers in
>>>>>>>> the lib folders of my task managers?
>>>>>>>>
>>>>>>>> And then in my job jar only include them as compile time
>>>>>>>> dependencies?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <
>>>>>>>> chesnay@apache.org> wrote:
>>>>>>>>
>>>>>>>>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>>>>>>>>
>>>>>>>>> You have correctly identified your alternatives.
>>>>>>>>>
>>>>>>>>> You must put the jdbc driver into /lib instead. Setting only the
>>>>>>>>> parent-first pattern shouldn't affect anything.
>>>>>>>>> That is only relevant if something is in both in /lib and the
>>>>>>>>> user-jar, telling Flink to prioritize what is in lib.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On 26/04/2022 15:35, John Smith wrote:
>>>>>>>>>
>>>>>>>>> So I put classloader.parent-first-patterns.additional:
>>>>>>>>> "org.apache.ignite." in the task config and so far I don't think I'm
>>>>>>>>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>>>>>>>>
>>>>>>>>> Or it's too early to tell.
>>>>>>>>>
>>>>>>>>> Though now, the task managers are shutting down due to some
>>>>>>>>> other failures.
>>>>>>>>>
>>>>>>>>> So maybe because tasks were failing and reloading often the task
>>>>>>>>> manager was running out of Metspace. But now maybe it's just
>>>>>>>>> cleanly shutting down.
>>>>>>>>>
>>>>>>>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <
>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Or I can put in the config to treat org.apache.ignite. classes as
>>>>>>>>>> first class?
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <
>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ok, so I loaded the dump into Eclipse Mat and followed:
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>>>>
>>>>>>>>>>> - On the Histogram, I got over 30 entries for:
>>>>>>>>>>> ChildFirstClassLoader
>>>>>>>>>>> - Then I clicked on one of them "Merge Shortest Path..." and
>>>>>>>>>>> picked "Exclude all phantom/weak/soft references"
>>>>>>>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin
>>>>>>>>>>> Driver
>>>>>>>>>>>
>>>>>>>>>>> So i'm guessing anything JDBC based. I should copy into the task
>>>>>>>>>>> manager libs folder and my jobs make the dependencies as compile only?
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
>>>>>>>>>>> yaroslav@goldsky.io> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Also
>>>>>>>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>>>>>>>>> might be helpful (has a section on profiling, as well as classloading).
>>>>>>>>>>>>
>>>>>>>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <
>>>>>>>>>>>> chesnay@apache.org> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> We have a very rough "guide" in the wiki (it's just the
>>>>>>>>>>>>> specific steps I took to debug another leak):
>>>>>>>>>>>>>
>>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>>>>>>
>>>>>>>>>>>>> On 19/04/2022 12:01, huweihua wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, John
>>>>>>>>>>>>>
>>>>>>>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the
>>>>>>>>>>>>> dump file. Check whether have too many loaded classes.
>>>>>>>>>>>>>
>>>>>>>>>>>>> [1] https://www.eclipse.org/mat/
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2022年4月18日 下午9:55，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, can anyone help with this? I never looked at a dump file
>>>>>>>>>>>>> before.
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <
>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi, so I have a dump file. What do I look for?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <
>>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Ok so if there's a leak, if I manually stop the job and
>>>>>>>>>>>>>>> restart it from the UI multiple times, I won't see the issue because
>>>>>>>>>>>>>>> because the classes are unloaded correctly?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <
>>>>>>>>>>>>>>> huweihua.ckl@gmail.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The difference is that manually canceling the job stops the
>>>>>>>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But looking
>>>>>>>>>>>>>>>> on TaskManager, it doesn't make much difference
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2022年3月31日 上午4:01，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Also if I manually cancel and restart the same job over and
>>>>>>>>>>>>>>>> over is it the same as if flink was restarting a job due to failure?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job
>>>>>>>>>>>>>>>> completely unloaded vs when the job scheduler restarts a job because if
>>>>>>>>>>>>>>>> whatever reason?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Lile this I'll stop and restart the job a few times or
>>>>>>>>>>>>>>>> maybe I can trick my job to fail and have the scheduler restart it. Ok let
>>>>>>>>>>>>>>>> me think about this...
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <
>>>>>>>>>>>>>>>> huweihua.ckl@gmail.com> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> So if I run the same jobs in my dev env will I still be
>>>>>>>>>>>>>>>>> able to see the similar dump?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I think running the same job in dev should be
>>>>>>>>>>>>>>>>> reproducible, maybe you can have a try.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>  If not I would have to wait at a low volume time to do it
>>>>>>>>>>>>>>>>> on production. Aldo if I recall the dump is as big as the JVM memory right
>>>>>>>>>>>>>>>>> so if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on
>>>>>>>>>>>>>>>>> the size to dump. you can use "jmap -dump:live" to dump only the reachable
>>>>>>>>>>>>>>>>> objects, this will take a brief pause
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> 2022年3月30日 下午9:47，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> I have 3 task managers (see config below). There is total
>>>>>>>>>>>>>>>>> of 10 jobs with 25 slots being used.
>>>>>>>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it
>>>>>>>>>>>>>>>>> and push it to JDBC, only 1 job of the 10 is pushing to Apache Ignite
>>>>>>>>>>>>>>>>> cluster.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So
>>>>>>>>>>>>>>>>> if I run the same jobs in my dev env will I still be able to see the
>>>>>>>>>>>>>>>>> similar dump? I I assume so. If not I would have to wait at a low volume
>>>>>>>>>>>>>>>>> time to do it on production. Aldo if I recall the dump is as big as the JVM
>>>>>>>>>>>>>>>>> memory right so if I have 10GB configed for the JVM the dump will be 10GB
>>>>>>>>>>>>>>>>> file?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> # Operating system has 16GB total.
>>>>>>>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>>>>>>>>>>>> parallelism.default: 1
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> high-availability: zookeeper
>>>>>>>>>>>>>>>>> high-availability.storageDir:
>>>>>>>>>>>>>>>>> file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>>>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> state.backend: rocksdb
>>>>>>>>>>>>>>>>> state.backend.incremental: true
>>>>>>>>>>>>>>>>> state.checkpoints.dir:
>>>>>>>>>>>>>>>>> file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>>>>>>>>> state.savepoints.dir:
>>>>>>>>>>>>>>>>> file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <
>>>>>>>>>>>>>>>>> huweihua.ckl@gmail.com> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hi, John
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink
>>>>>>>>>>>>>>>>>> session cluster with a lot of jobs?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use
>>>>>>>>>>>>>>>>>> tools such as MAT to analyze whether there are abnormal classes and
>>>>>>>>>>>>>>>>>> classloaders
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> > 2022年3月30日 上午6:09，John Smith <ja...@gmail.com>
>>>>>>>>>>>>>>>>>> 写道：
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > Hi running 1.14.4
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > My tasks manager still fails with
>>>>>>>>>>>>>>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error
>>>>>>>>>>>>>>>>>> has occurred. This can mean two things: either the job requires a larger
>>>>>>>>>>>>>>>>>> size of JVM metaspace to load classes or there is a class loading leak.
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > But the task nodes still fail.
>>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts
>>>>>>>>>>>>>>>>>> low. Now I see 85% usage. It seems to be a class loading leak at this
>>>>>>>>>>>>>>>>>> point, how can we debug this issue?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>
>>>
>>
>

Re: How to debug Metaspace exception?

Posted by Chesnay Schepler <ch...@apache.org>.

There are cases where user-code is run on the JobManager.
I'm not sure whether though that applies to the JDBC sources.

On 02/05/2022 15:45, John Smith wrote:
> Why do the JDBC jars need to be on the job manager node though?
>
> On Mon, May 2, 2022 at 9:36 AM Chesnay Schepler <ch...@apache.org> 
> wrote:
>
>     yes.
>     But if you can ensure that the driver isn't bundled by any
>     user-jar you can also skip the pattern configuration step.
>
>     The pattern looks correct formatting-wise; you could try whether
>     com.microsoft.sqlserver.jdbc. is enough to solve the issue.
>
>     On 02/05/2022 14:41, John Smith wrote:
>>     Oh, so I should copy the jars to the lib folder and
>>     set classloader.parent-first-patterns.additional:
>>     "org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the
>>     task managers and job managers?
>>
>>     Also is my pattern correct?
>>     "org.apache.ignite.;com.microsoft.sqlserver.jdbc."
>>
>>     Just to be sure I'm running a standalone cluster using zookeeper.
>>     So I have 3 zookeepers, 3 job managers and 3 task managers.
>>
>>
>>     On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler
>>     <ch...@apache.org> wrote:
>>
>>         And you do should make sure that it is set for both processes!
>>
>>         On 02/05/2022 08:43, Chesnay Schepler wrote:
>>>         The setting itself isn't taskmanager specific; it applies to
>>>         both the job- and taskmanager process.
>>>
>>>         On 02/05/2022 05:29, John Smith wrote:
>>>>         Also just to be sure this is a Task Manager setting right?
>>>>
>>>>         On Thu, Apr 28, 2022 at 11:13 AM John Smith
>>>>         <ja...@gmail.com> wrote:
>>>>
>>>>             I assume you will take action on your side to track and
>>>>             fix the doc? :)
>>>>
>>>>             On Thu, Apr 28, 2022 at 11:12 AM John Smith
>>>>             <ja...@gmail.com> wrote:
>>>>
>>>>                 Ok so to summarize...
>>>>
>>>>                 - Build my job jar and have the JDBC driver as a
>>>>                 compile only dependency and copy the JDBC driver to
>>>>                 flink lib folder.
>>>>
>>>>                 Or
>>>>
>>>>                 - Build my job jar and include JDBC driver in the
>>>>                 shadow, plus copy the JDBC driver in the flink lib
>>>>                 folder, plus  make an entry in config for
>>>>                 |classloader.parent-first-patterns-additional|
>>>>                 <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>
>>>>
>>>>                 On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler
>>>>                 <ch...@apache.org> wrote:
>>>>
>>>>                     I think what I meant was "either add it to
>>>>                     /lib, or [if it is already in /lib but also
>>>>                     bundled in the jar] add it to the parent-first
>>>>                     patterns."
>>>>
>>>>                     On 28/04/2022 15:56, Chesnay Schepler wrote:
>>>>>                     Pretty sure, even though I seemingly
>>>>>                     documented it incorrectly :)
>>>>>
>>>>>                     On 28/04/2022 15:49, John Smith wrote:
>>>>>>                     You sure?
>>>>>>
>>>>>>                      *
>>>>>>
>>>>>>                         /JDBC/: JDBC drivers leak references
>>>>>>                         outside the user code classloader. To
>>>>>>                         ensure that these classes are only loaded
>>>>>>                         once you should either add the driver
>>>>>>                         jars to Flink’s |lib/| folder, or add the
>>>>>>                         driver classes to the list of
>>>>>>                         parent-first loaded class via
>>>>>>                         |classloader.parent-first-patterns-additional|
>>>>>>                         <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>.
>>>>>>
>>>>>>                         It says either or
>>>>>>
>>>>>>
>>>>>>                     On Wed, Apr 27, 2022 at 3:44 AM Chesnay
>>>>>>                     Schepler <ch...@apache.org> wrote:
>>>>>>
>>>>>>                         You're misinterpreting the docs.
>>>>>>
>>>>>>                         The parent/child-first classloading
>>>>>>                         controls where Flink looks for a class
>>>>>>                         /first/, specifically whether we first
>>>>>>                         load from /lib or the user-jar.
>>>>>>                         It does not allow you to load something
>>>>>>                         from the user-jar in the parent
>>>>>>                         classloader. That's just not how it works.
>>>>>>
>>>>>>                         It must be in /lib.
>>>>>>
>>>>>>                         On 27/04/2022 04:59, John Smith wrote:
>>>>>>>                         Hi Chesnay as per the docs...
>>>>>>>                         https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>>>>>>
>>>>>>>                         You can either put the jars in task
>>>>>>>                         manager lib folder or use
>>>>>>>                         |classloader.parent-first-patterns-additional|
>>>>>>>                         <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>>>>
>>>>>>>                         I prefer the latter like this: the
>>>>>>>                         dependency stays with the user-jar and
>>>>>>>                         not on the task manager.
>>>>>>>
>>>>>>>                         On Tue, Apr 26, 2022 at 9:52 PM John
>>>>>>>                         Smith <ja...@gmail.com> wrote:
>>>>>>>
>>>>>>>                             Ok so I should put the Apache ignite
>>>>>>>                             and my Microsoft drivers in the lib
>>>>>>>                             folders of my task managers?
>>>>>>>
>>>>>>>                             And then in my job jar only include
>>>>>>>                             them as compile time dependencies?
>>>>>>>
>>>>>>>
>>>>>>>                             On Tue, Apr 26, 2022 at 10:42 AM
>>>>>>>                             Chesnay Schepler
>>>>>>>                             <ch...@apache.org> wrote:
>>>>>>>
>>>>>>>                                 JDBC drivers are well-known for
>>>>>>>                                 leaking classloaders unfortunately.
>>>>>>>
>>>>>>>                                 You have correctly identified
>>>>>>>                                 your alternatives.
>>>>>>>
>>>>>>>                                 You must put the jdbc driver
>>>>>>>                                 into /lib instead. Setting only
>>>>>>>                                 the parent-first pattern
>>>>>>>                                 shouldn't affect anything.
>>>>>>>                                 That is only relevant if
>>>>>>>                                 something is in both in /lib and
>>>>>>>                                 the user-jar, telling Flink to
>>>>>>>                                 prioritize what is in lib.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>                                 On 26/04/2022 15:35, John Smith
>>>>>>>                                 wrote:
>>>>>>>>                                 So I
>>>>>>>>                                 put classloader.parent-first-patterns.additional:
>>>>>>>>                                 "org.apache.ignite." in the
>>>>>>>>                                 task config and so far I don't
>>>>>>>>                                 think I'm getting
>>>>>>>>                                 "java.lang.OutOfMemoryError:
>>>>>>>>                                 Metaspace" any more.
>>>>>>>>
>>>>>>>>                                 Or it's too early to tell.
>>>>>>>>
>>>>>>>>                                 Though now, the task managers
>>>>>>>>                                 are shutting down due to some
>>>>>>>>                                 other failures.
>>>>>>>>
>>>>>>>>                                 So maybe because tasks were
>>>>>>>>                                 failing and reloading often the
>>>>>>>>                                 task manager was running out of
>>>>>>>>                                 Metspace. But now maybe it's
>>>>>>>>                                 just cleanly shutting down.
>>>>>>>>
>>>>>>>>                                 On Wed, Apr 20, 2022 at 11:35
>>>>>>>>                                 AM John Smith
>>>>>>>>                                 <ja...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>                                     Or I can put in the config
>>>>>>>>                                     to treat org.apache.ignite.
>>>>>>>>                                     classes as first class?
>>>>>>>>
>>>>>>>>                                     On Tue, Apr 19, 2022 at
>>>>>>>>                                     10:18 PM John Smith
>>>>>>>>                                     <ja...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>                                         Ok, so I loaded the
>>>>>>>>                                         dump into Eclipse Mat
>>>>>>>>                                         and followed:
>>>>>>>>                                         https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>
>>>>>>>>                                         - On the Histogram, I
>>>>>>>>                                         got over 30 entries
>>>>>>>>                                         for: ChildFirstClassLoader
>>>>>>>>                                         - Then I clicked on one
>>>>>>>>                                         of them "Merge Shortest
>>>>>>>>                                         Path..." and picked
>>>>>>>>                                         "Exclude all
>>>>>>>>                                         phantom/weak/soft
>>>>>>>>                                         references"
>>>>>>>>                                         - Which then gave me:
>>>>>>>>                                         SqlDriverManager >
>>>>>>>>                                         Apache Ignite JdbcThin
>>>>>>>>                                         Driver
>>>>>>>>
>>>>>>>>                                         So i'm
>>>>>>>>                                         guessing anything JDBC
>>>>>>>>                                         based. I should copy
>>>>>>>>                                         into the task manager
>>>>>>>>                                         libs folder and my jobs
>>>>>>>>                                         make the dependencies
>>>>>>>>                                         as compile only?
>>>>>>>>
>>>>>>>>                                         On Tue, Apr 19, 2022 at
>>>>>>>>                                         12:18 PM Yaroslav
>>>>>>>>                                         Tkachenko
>>>>>>>>                                         <ya...@goldsky.io>
>>>>>>>>                                         wrote:
>>>>>>>>
>>>>>>>>                                             Also
>>>>>>>>                                             https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>>>>>                                             might be helpful
>>>>>>>>                                             (has a section on
>>>>>>>>                                             profiling, as well
>>>>>>>>                                             as classloading).
>>>>>>>>
>>>>>>>>                                             On Tue, Apr 19,
>>>>>>>>                                             2022 at 4:35 AM
>>>>>>>>                                             Chesnay Schepler
>>>>>>>>                                             <ch...@apache.org>
>>>>>>>>                                             wrote:
>>>>>>>>
>>>>>>>>                                                 We have a very
>>>>>>>>                                                 rough "guide"
>>>>>>>>                                                 in the wiki
>>>>>>>>                                                 (it's just the
>>>>>>>>                                                 specific steps
>>>>>>>>                                                 I took to debug
>>>>>>>>                                                 another leak):
>>>>>>>>                                                 https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>
>>>>>>>>                                                 On 19/04/2022
>>>>>>>>                                                 12:01, huweihua
>>>>>>>>                                                 wrote:
>>>>>>>>>                                                 Hi, John
>>>>>>>>>
>>>>>>>>>                                                 Sorry for the
>>>>>>>>>                                                 late reply.
>>>>>>>>>                                                 You can use
>>>>>>>>>                                                 MAT[1] to
>>>>>>>>>                                                 analyze the
>>>>>>>>>                                                 dump file.
>>>>>>>>>                                                 Check whether
>>>>>>>>>                                                 have too many
>>>>>>>>>                                                 loaded classes.
>>>>>>>>>
>>>>>>>>>                                                 [1]
>>>>>>>>>                                                 https://www.eclipse.org/mat/
>>>>>>>>>
>>>>>>>>>>                                                 2022年4月18日
>>>>>>>>>>                                                 下午9:55，John
>>>>>>>>>>                                                 Smith
>>>>>>>>>>                                                 <ja...@gmail.com>
>>>>>>>>>>                                                 写道：
>>>>>>>>>>
>>>>>>>>>>                                                 Hi, can
>>>>>>>>>>                                                 anyone help
>>>>>>>>>>                                                 with this? I
>>>>>>>>>>                                                 never looked
>>>>>>>>>>                                                 at a dump
>>>>>>>>>>                                                 file before.
>>>>>>>>>>
>>>>>>>>>>                                                 On Thu, Apr
>>>>>>>>>>                                                 14, 2022 at
>>>>>>>>>>                                                 11:59 AM John
>>>>>>>>>>                                                 Smith
>>>>>>>>>>                                                 <ja...@gmail.com>
>>>>>>>>>>                                                 wrote:
>>>>>>>>>>
>>>>>>>>>>                                                     Hi, so I
>>>>>>>>>>                                                     have a
>>>>>>>>>>                                                     dump
>>>>>>>>>>                                                     file.
>>>>>>>>>>                                                     What do I
>>>>>>>>>>                                                     look for?
>>>>>>>>>>
>>>>>>>>>>                                                     On Thu,
>>>>>>>>>>                                                     Mar 31,
>>>>>>>>>>                                                     2022 at
>>>>>>>>>>                                                     3:28 PM
>>>>>>>>>>                                                     John
>>>>>>>>>>                                                     Smith
>>>>>>>>>>                                                     <ja...@gmail.com>
>>>>>>>>>>                                                     wrote:
>>>>>>>>>>
>>>>>>>>>>                                                         Ok so
>>>>>>>>>>                                                         if
>>>>>>>>>>                                                         there's
>>>>>>>>>>                                                         a
>>>>>>>>>>                                                         leak,
>>>>>>>>>>                                                         if I
>>>>>>>>>>                                                         manually stop
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         job
>>>>>>>>>>                                                         and
>>>>>>>>>>                                                         restart
>>>>>>>>>>                                                         it
>>>>>>>>>>                                                         from
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         UI
>>>>>>>>>>                                                         multiple
>>>>>>>>>>                                                         times,
>>>>>>>>>>                                                         I
>>>>>>>>>>                                                         won't
>>>>>>>>>>                                                         see
>>>>>>>>>>                                                         the issue
>>>>>>>>>>                                                         because
>>>>>>>>>>                                                         because
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         classes
>>>>>>>>>>                                                         are
>>>>>>>>>>                                                         unloaded
>>>>>>>>>>                                                         correctly?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                                                         On
>>>>>>>>>>                                                         Thu,
>>>>>>>>>>                                                         Mar
>>>>>>>>>>                                                         31,
>>>>>>>>>>                                                         2022
>>>>>>>>>>                                                         at
>>>>>>>>>>                                                         9:20
>>>>>>>>>>                                                         AM
>>>>>>>>>>                                                         huweihua
>>>>>>>>>>                                                         <hu...@gmail.com>
>>>>>>>>>>                                                         wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                                                             The
>>>>>>>>>>                                                             difference
>>>>>>>>>>                                                             is
>>>>>>>>>>                                                             that
>>>>>>>>>>                                                             manually
>>>>>>>>>>                                                             canceling
>>>>>>>>>>                                                             the
>>>>>>>>>>                                                             job
>>>>>>>>>>                                                             stops
>>>>>>>>>>                                                             the
>>>>>>>>>>                                                             JobMaster,
>>>>>>>>>>                                                             but
>>>>>>>>>>                                                             automatic
>>>>>>>>>>                                                             failover
>>>>>>>>>>                                                             keeps
>>>>>>>>>>                                                             the
>>>>>>>>>>                                                             JobMaster
>>>>>>>>>>                                                             running.
>>>>>>>>>>                                                             But
>>>>>>>>>>                                                             looking
>>>>>>>>>>                                                             on
>>>>>>>>>>                                                             TaskManager,
>>>>>>>>>>                                                             it
>>>>>>>>>>                                                             doesn't
>>>>>>>>>>                                                             make
>>>>>>>>>>                                                             much
>>>>>>>>>>                                                             difference
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>                                                             2022年3月31日
>>>>>>>>>>>                                                             上午4:01，John
>>>>>>>>>>>                                                             Smith
>>>>>>>>>>>                                                             <ja...@gmail.com>
>>>>>>>>>>>                                                             写道：
>>>>>>>>>>>
>>>>>>>>>>>                                                             Also
>>>>>>>>>>>                                                             if
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             manually
>>>>>>>>>>>                                                             cancel
>>>>>>>>>>>                                                             and
>>>>>>>>>>>                                                             restart
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             same
>>>>>>>>>>>                                                             job
>>>>>>>>>>>                                                             over
>>>>>>>>>>>                                                             and
>>>>>>>>>>>                                                             over
>>>>>>>>>>>                                                             is
>>>>>>>>>>>                                                             it
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             same
>>>>>>>>>>>                                                             as
>>>>>>>>>>>                                                             if
>>>>>>>>>>>                                                             flink
>>>>>>>>>>>                                                             was
>>>>>>>>>>>                                                             restarting
>>>>>>>>>>>                                                             a
>>>>>>>>>>>                                                             job
>>>>>>>>>>>                                                             due
>>>>>>>>>>>                                                             to
>>>>>>>>>>>                                                             failure?
>>>>>>>>>>>
>>>>>>>>>>>                                                             I.e:
>>>>>>>>>>>                                                             When
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             click
>>>>>>>>>>>                                                             "Cancel
>>>>>>>>>>>                                                             Job"
>>>>>>>>>>>                                                             on
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             UI
>>>>>>>>>>>                                                             is
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             job
>>>>>>>>>>>                                                             completely
>>>>>>>>>>>                                                             unloaded
>>>>>>>>>>>                                                             vs
>>>>>>>>>>>                                                             when
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             job
>>>>>>>>>>>                                                             scheduler
>>>>>>>>>>>                                                             restarts
>>>>>>>>>>>                                                             a
>>>>>>>>>>>                                                             job
>>>>>>>>>>>                                                             because
>>>>>>>>>>>                                                             if
>>>>>>>>>>>                                                             whatever
>>>>>>>>>>>                                                             reason?
>>>>>>>>>>>
>>>>>>>>>>>                                                             Lile
>>>>>>>>>>>                                                             this
>>>>>>>>>>>                                                             I'll
>>>>>>>>>>>                                                             stop
>>>>>>>>>>>                                                             and
>>>>>>>>>>>                                                             restart
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             job
>>>>>>>>>>>                                                             a
>>>>>>>>>>>                                                             few
>>>>>>>>>>>                                                             times
>>>>>>>>>>>                                                             or
>>>>>>>>>>>                                                             maybe
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             can
>>>>>>>>>>>                                                             trick
>>>>>>>>>>>                                                             my
>>>>>>>>>>>                                                             job
>>>>>>>>>>>                                                             to
>>>>>>>>>>>                                                             fail
>>>>>>>>>>>                                                             and
>>>>>>>>>>>                                                             have
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             scheduler
>>>>>>>>>>>                                                             restart
>>>>>>>>>>>                                                             it.
>>>>>>>>>>>                                                             Ok
>>>>>>>>>>>                                                             let
>>>>>>>>>>>                                                             me
>>>>>>>>>>>                                                             think
>>>>>>>>>>>                                                             about
>>>>>>>>>>>                                                             this...
>>>>>>>>>>>
>>>>>>>>>>>                                                             On
>>>>>>>>>>>                                                             Wed,
>>>>>>>>>>>                                                             Mar
>>>>>>>>>>>                                                             30,
>>>>>>>>>>>                                                             2022
>>>>>>>>>>>                                                             at
>>>>>>>>>>>                                                             10:24
>>>>>>>>>>>                                                             AM
>>>>>>>>>>>                                                             胡伟华
>>>>>>>>>>>                                                             <hu...@gmail.com>
>>>>>>>>>>>                                                             wrote:
>>>>>>>>>>>
>>>>>>>>>>>>                                                                 So
>>>>>>>>>>>>                                                                 if
>>>>>>>>>>>>                                                                 I
>>>>>>>>>>>>                                                                 run
>>>>>>>>>>>>                                                                 the
>>>>>>>>>>>>                                                                 same
>>>>>>>>>>>>                                                                 jobs
>>>>>>>>>>>>                                                                 in
>>>>>>>>>>>>                                                                 my
>>>>>>>>>>>>                                                                 dev
>>>>>>>>>>>>                                                                 env
>>>>>>>>>>>>                                                                 will
>>>>>>>>>>>>                                                                 I
>>>>>>>>>>>>                                                                 still
>>>>>>>>>>>>                                                                 be
>>>>>>>>>>>>                                                                 able
>>>>>>>>>>>>                                                                 to
>>>>>>>>>>>>                                                                 see
>>>>>>>>>>>>                                                                 the
>>>>>>>>>>>>                                                                 similar
>>>>>>>>>>>>                                                                 dump?
>>>>>>>>>>>>
>>>>>>>>>>>                                                                 I
>>>>>>>>>>>                                                                 think
>>>>>>>>>>>                                                                 running
>>>>>>>>>>>                                                                 the
>>>>>>>>>>>                                                                 same
>>>>>>>>>>>                                                                 job
>>>>>>>>>>>                                                                 in
>>>>>>>>>>>                                                                 dev
>>>>>>>>>>>                                                                 should
>>>>>>>>>>>                                                                 be
>>>>>>>>>>>                                                                 reproducible,
>>>>>>>>>>>                                                                 maybe
>>>>>>>>>>>                                                                 you
>>>>>>>>>>>                                                                 can
>>>>>>>>>>>                                                                 have
>>>>>>>>>>>                                                                 a
>>>>>>>>>>>                                                                 try.
>>>>>>>>>>>
>>>>>>>>>>>>                                                                  If
>>>>>>>>>>>>                                                                 not
>>>>>>>>>>>>                                                                 I
>>>>>>>>>>>>                                                                 would
>>>>>>>>>>>>                                                                 have
>>>>>>>>>>>>                                                                 to
>>>>>>>>>>>>                                                                 wait
>>>>>>>>>>>>                                                                 at
>>>>>>>>>>>>                                                                 a
>>>>>>>>>>>>                                                                 low
>>>>>>>>>>>>                                                                 volume
>>>>>>>>>>>>                                                                 time
>>>>>>>>>>>>                                                                 to
>>>>>>>>>>>>                                                                 do
>>>>>>>>>>>>                                                                 it
>>>>>>>>>>>>                                                                 on
>>>>>>>>>>>>                                                                 production.
>>>>>>>>>>>>                                                                 Aldo
>>>>>>>>>>>>                                                                 if
>>>>>>>>>>>>                                                                 I
>>>>>>>>>>>>                                                                 recall
>>>>>>>>>>>>                                                                 the
>>>>>>>>>>>>                                                                 dump
>>>>>>>>>>>>                                                                 is
>>>>>>>>>>>>                                                                 as
>>>>>>>>>>>>                                                                 big
>>>>>>>>>>>>                                                                 as
>>>>>>>>>>>>                                                                 the
>>>>>>>>>>>>                                                                 JVM
>>>>>>>>>>>>                                                                 memory
>>>>>>>>>>>>                                                                 right
>>>>>>>>>>>>                                                                 so
>>>>>>>>>>>>                                                                 if
>>>>>>>>>>>>                                                                 I
>>>>>>>>>>>>                                                                 have
>>>>>>>>>>>>                                                                 10GB
>>>>>>>>>>>>                                                                 configed
>>>>>>>>>>>>                                                                 for
>>>>>>>>>>>>                                                                 the
>>>>>>>>>>>>                                                                 JVM
>>>>>>>>>>>>                                                                 the
>>>>>>>>>>>>                                                                 dump
>>>>>>>>>>>>                                                                 will
>>>>>>>>>>>>                                                                 be
>>>>>>>>>>>>                                                                 10GB
>>>>>>>>>>>>                                                                 file?
>>>>>>>>>>>                                                                 Yes,
>>>>>>>>>>>                                                                 JMAP
>>>>>>>>>>>                                                                 will
>>>>>>>>>>>                                                                 pause
>>>>>>>>>>>                                                                 the
>>>>>>>>>>>                                                                 JVM,
>>>>>>>>>>>                                                                 the
>>>>>>>>>>>                                                                 time
>>>>>>>>>>>                                                                 of
>>>>>>>>>>>                                                                 pause
>>>>>>>>>>>                                                                 depends
>>>>>>>>>>>                                                                 on
>>>>>>>>>>>                                                                 the
>>>>>>>>>>>                                                                 size
>>>>>>>>>>>                                                                 to
>>>>>>>>>>>                                                                 dump.
>>>>>>>>>>>                                                                 you
>>>>>>>>>>>                                                                 can
>>>>>>>>>>>                                                                 use
>>>>>>>>>>>                                                                 "jmap
>>>>>>>>>>>                                                                 -dump:live"
>>>>>>>>>>>                                                                 to
>>>>>>>>>>>                                                                 dump
>>>>>>>>>>>                                                                 only
>>>>>>>>>>>                                                                 the
>>>>>>>>>>>                                                                 reachable
>>>>>>>>>>>                                                                 objects,
>>>>>>>>>>>                                                                 this
>>>>>>>>>>>                                                                 will
>>>>>>>>>>>                                                                 take
>>>>>>>>>>>                                                                 a
>>>>>>>>>>>                                                                 brief
>>>>>>>>>>>                                                                 pause
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>>                                                                 2022年3月30日
>>>>>>>>>>>>                                                                 下午9:47，John
>>>>>>>>>>>>                                                                 Smith
>>>>>>>>>>>>                                                                 <ja...@gmail.com>
>>>>>>>>>>>>                                                                 写道：
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                 I
>>>>>>>>>>>>                                                                 have
>>>>>>>>>>>>                                                                 3
>>>>>>>>>>>>                                                                 task
>>>>>>>>>>>>                                                                 managers
>>>>>>>>>>>>                                                                 (see
>>>>>>>>>>>>                                                                 config
>>>>>>>>>>>>                                                                 below).
>>>>>>>>>>>>                                                                 There
>>>>>>>>>>>>                                                                 is
>>>>>>>>>>>>                                                                 total
>>>>>>>>>>>>                                                                 of
>>>>>>>>>>>>                                                                 10
>>>>>>>>>>>>                                                                 jobs
>>>>>>>>>>>>                                                                 with
>>>>>>>>>>>>                                                                 25
>>>>>>>>>>>>                                                                 slots
>>>>>>>>>>>>                                                                 being
>>>>>>>>>>>>                                                                 used.
>>>>>>>>>>>>                                                                 The
>>>>>>>>>>>>                                                                 jobs
>>>>>>>>>>>>                                                                 are
>>>>>>>>>>>>                                                                 100%
>>>>>>>>>>>>                                                                 ETL
>>>>>>>>>>>>                                                                 I.e;
>>>>>>>>>>>>                                                                 They
>>>>>>>>>>>>                                                                 load
>>>>>>>>>>>>                                                                 Json,
>>>>>>>>>>>>                                                                 transform
>>>>>>>>>>>>                                                                 it
>>>>>>>>>>>>                                                                 and
>>>>>>>>>>>>                                                                 push
>>>>>>>>>>>>                                                                 it
>>>>>>>>>>>>                                                                 to
>>>>>>>>>>>>                                                                 JDBC,
>>>>>>>>>>>>                                                                 only
>>>>>>>>>>>>                                                                 1
>>>>>>>>>>>>                                                                 job
>>>>>>>>>>>>                                                                 of
>>>>>>>>>>>>                                                                 the
>>>>>>>>>>>>                                                                 10
>>>>>>>>>>>>                                                                 is
>>>>>>>>>>>>                                                                 pushing
>>>>>>>>>>>>                                                                 to
>>>>>>>>>>>>                                                                 Apache
>>>>>>>>>>>>                                                                 Ignite
>>>>>>>>>>>>                                                                 cluster.
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                 FOR
>>>>>>>>>>>>                                                                 JMAP.
>>>>>>>>>>>>                                                                 I
>>>>>>>>>>>>                                                                 know
>>>>>>>>>>>>                                                                 that
>>>>>>>>>>>>                                                                 it
>>>>>>>>>>>>                                                                 will
>>>>>>>>>>>>                                                                 pause
>>>>>>>>>>>>                                                                 the
>>>>>>>>>>>>                                                                 task
>>>>>>>>>>>>                                                                 manager.
>>>>>>>>>>>>                                                                 So
>>>>>>>>>>>>                                                                 if
>>>>>>>>>>>>                                                                 I
>>>>>>>>>>>>                                                                 run
>>>>>>>>>>>>                                                                 the
>>>>>>>>>>>>                                                                 same
>>>>>>>>>>>>                                                                 jobs
>>>>>>>>>>>>                                                                 in
>>>>>>>>>>>>                                                                 my
>>>>>>>>>>>>                                                                 dev
>>>>>>>>>>>>                                                                 env
>>>>>>>>>>>>                                                                 will
>>>>>>>>>>>>                                                                 I
>>>>>>>>>>>>                                                                 still
>>>>>>>>>>>>                                                                 be
>>>>>>>>>>>>                                                                 able
>>>>>>>>>>>>                                                                 to
>>>>>>>>>>>>                                                                 see
>>>>>>>>>>>>                                                                 the
>>>>>>>>>>>>                                                                 similar
>>>>>>>>>>>>                                                                 dump?
>>>>>>>>>>>>                                                                 I
>>>>>>>>>>>>                                                                 I
>>>>>>>>>>>>                                                                 assume
>>>>>>>>>>>>                                                                 so.
>>>>>>>>>>>>                                                                 If
>>>>>>>>>>>>                                                                 not
>>>>>>>>>>>>                                                                 I
>>>>>>>>>>>>                                                                 would
>>>>>>>>>>>>                                                                 have
>>>>>>>>>>>>                                                                 to
>>>>>>>>>>>>                                                                 wait
>>>>>>>>>>>>                                                                 at
>>>>>>>>>>>>                                                                 a
>>>>>>>>>>>>                                                                 low
>>>>>>>>>>>>                                                                 volume
>>>>>>>>>>>>                                                                 time
>>>>>>>>>>>>                                                                 to
>>>>>>>>>>>>                                                                 do
>>>>>>>>>>>>                                                                 it
>>>>>>>>>>>>                                                                 on
>>>>>>>>>>>>                                                                 production.
>>>>>>>>>>>>                                                                 Aldo
>>>>>>>>>>>>                                                                 if
>>>>>>>>>>>>                                                                 I
>>>>>>>>>>>>                                                                 recall
>>>>>>>>>>>>                                                                 the
>>>>>>>>>>>>                                                                 dump
>>>>>>>>>>>>                                                                 is
>>>>>>>>>>>>                                                                 as
>>>>>>>>>>>>                                                                 big
>>>>>>>>>>>>                                                                 as
>>>>>>>>>>>>                                                                 the
>>>>>>>>>>>>                                                                 JVM
>>>>>>>>>>>>                                                                 memory
>>>>>>>>>>>>                                                                 right
>>>>>>>>>>>>                                                                 so
>>>>>>>>>>>>                                                                 if
>>>>>>>>>>>>                                                                 I
>>>>>>>>>>>>                                                                 have
>>>>>>>>>>>>                                                                 10GB
>>>>>>>>>>>>                                                                 configed
>>>>>>>>>>>>                                                                 for
>>>>>>>>>>>>                                                                 the
>>>>>>>>>>>>                                                                 JVM
>>>>>>>>>>>>                                                                 the
>>>>>>>>>>>>                                                                 dump
>>>>>>>>>>>>                                                                 will
>>>>>>>>>>>>                                                                 be
>>>>>>>>>>>>                                                                 10GB
>>>>>>>>>>>>                                                                 file?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                 #
>>>>>>>>>>>>                                                                 Operating
>>>>>>>>>>>>                                                                 system
>>>>>>>>>>>>                                                                 has
>>>>>>>>>>>>                                                                 16GB
>>>>>>>>>>>>                                                                 total.
>>>>>>>>>>>>                                                                 env.ssh.opts:
>>>>>>>>>>>>                                                                 -l
>>>>>>>>>>>>                                                                 flink
>>>>>>>>>>>>                                                                 -oStrictHostKeyChecking=no
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                 cluster.evenly-spread-out-slots:
>>>>>>>>>>>>                                                                 true
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                 taskmanager.memory.flink.size:
>>>>>>>>>>>>                                                                 10240m
>>>>>>>>>>>>                                                                 taskmanager.memory.jvm-metaspace.size:
>>>>>>>>>>>>                                                                 2048m
>>>>>>>>>>>>                                                                 taskmanager.numberOfTaskSlots:
>>>>>>>>>>>>                                                                 16
>>>>>>>>>>>>                                                                 parallelism.default:
>>>>>>>>>>>>                                                                 1
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                 high-availability:
>>>>>>>>>>>>                                                                 zookeeper
>>>>>>>>>>>>                                                                 high-availability.storageDir:
>>>>>>>>>>>>                                                                 file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>>>>                                                                 high-availability.zookeeper.quorum:
>>>>>>>>>>>>                                                                 ...
>>>>>>>>>>>>                                                                 high-availability.zookeeper.path.root:
>>>>>>>>>>>>                                                                 /flink_1_14
>>>>>>>>>>>>                                                                 high-availability.cluster-id:
>>>>>>>>>>>>                                                                 /flink_1_14_cluster_0001
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                 web.upload.dir:
>>>>>>>>>>>>                                                                 /mnt/flink/uploads/flink_1_14
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                 state.backend:
>>>>>>>>>>>>                                                                 rocksdb
>>>>>>>>>>>>                                                                 state.backend.incremental:
>>>>>>>>>>>>                                                                 true
>>>>>>>>>>>>                                                                 state.checkpoints.dir:
>>>>>>>>>>>>                                                                 file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>>>>                                                                 state.savepoints.dir:
>>>>>>>>>>>>                                                                 file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                 On
>>>>>>>>>>>>                                                                 Wed,
>>>>>>>>>>>>                                                                 Mar
>>>>>>>>>>>>                                                                 30,
>>>>>>>>>>>>                                                                 2022
>>>>>>>>>>>>                                                                 at
>>>>>>>>>>>>                                                                 2:16
>>>>>>>>>>>>                                                                 AM
>>>>>>>>>>>>                                                                 胡伟华
>>>>>>>>>>>>                                                                 <hu...@gmail.com>
>>>>>>>>>>>>                                                                 wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                     Hi,
>>>>>>>>>>>>                                                                     John
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                     Could
>>>>>>>>>>>>                                                                     you
>>>>>>>>>>>>                                                                     tell
>>>>>>>>>>>>                                                                     us
>>>>>>>>>>>>                                                                     you
>>>>>>>>>>>>                                                                     application
>>>>>>>>>>>>                                                                     scenario?
>>>>>>>>>>>>                                                                     Is
>>>>>>>>>>>>                                                                     it
>>>>>>>>>>>>                                                                     a
>>>>>>>>>>>>                                                                     flink
>>>>>>>>>>>>                                                                     session
>>>>>>>>>>>>                                                                     cluster
>>>>>>>>>>>>                                                                     with
>>>>>>>>>>>>                                                                     a
>>>>>>>>>>>>                                                                     lot
>>>>>>>>>>>>                                                                     of
>>>>>>>>>>>>                                                                     jobs?
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                     Maybe
>>>>>>>>>>>>                                                                     you
>>>>>>>>>>>>                                                                     can
>>>>>>>>>>>>                                                                     try
>>>>>>>>>>>>                                                                     to
>>>>>>>>>>>>                                                                     dump
>>>>>>>>>>>>                                                                     the
>>>>>>>>>>>>                                                                     memory
>>>>>>>>>>>>                                                                     with
>>>>>>>>>>>>                                                                     jmap
>>>>>>>>>>>>                                                                     and
>>>>>>>>>>>>                                                                     use
>>>>>>>>>>>>                                                                     tools
>>>>>>>>>>>>                                                                     such
>>>>>>>>>>>>                                                                     as
>>>>>>>>>>>>                                                                     MAT
>>>>>>>>>>>>                                                                     to
>>>>>>>>>>>>                                                                     analyze
>>>>>>>>>>>>                                                                     whether
>>>>>>>>>>>>                                                                     there
>>>>>>>>>>>>                                                                     are
>>>>>>>>>>>>                                                                     abnormal
>>>>>>>>>>>>                                                                     classes
>>>>>>>>>>>>                                                                     and
>>>>>>>>>>>>                                                                     classloaders
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                     >
>>>>>>>>>>>>                                                                     2022年3月30日
>>>>>>>>>>>>                                                                     上午6:09，John
>>>>>>>>>>>>                                                                     Smith
>>>>>>>>>>>>                                                                     <ja...@gmail.com>
>>>>>>>>>>>>                                                                     写道：
>>>>>>>>>>>>                                                                     >
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                     >
>>>>>>>>>>>>                                                                     Hi
>>>>>>>>>>>>                                                                     running
>>>>>>>>>>>>                                                                     1.14.4
>>>>>>>>>>>>                                                                     >
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                     >
>>>>>>>>>>>>                                                                     My
>>>>>>>>>>>>                                                                     tasks
>>>>>>>>>>>>                                                                     manager
>>>>>>>>>>>>                                                                     still
>>>>>>>>>>>>                                                                     fails
>>>>>>>>>>>>                                                                     with
>>>>>>>>>>>>                                                                     java.lang.OutOfMemoryError:
>>>>>>>>>>>>                                                                     Metaspace.
>>>>>>>>>>>>                                                                     The
>>>>>>>>>>>>                                                                     metaspace
>>>>>>>>>>>>                                                                     out-of-memory
>>>>>>>>>>>>                                                                     error
>>>>>>>>>>>>                                                                     has
>>>>>>>>>>>>                                                                     occurred.
>>>>>>>>>>>>                                                                     This
>>>>>>>>>>>>                                                                     can
>>>>>>>>>>>>                                                                     mean
>>>>>>>>>>>>                                                                     two
>>>>>>>>>>>>                                                                     things:
>>>>>>>>>>>>                                                                     either
>>>>>>>>>>>>                                                                     the
>>>>>>>>>>>>                                                                     job
>>>>>>>>>>>>                                                                     requires
>>>>>>>>>>>>                                                                     a
>>>>>>>>>>>>                                                                     larger
>>>>>>>>>>>>                                                                     size
>>>>>>>>>>>>                                                                     of
>>>>>>>>>>>>                                                                     JVM
>>>>>>>>>>>>                                                                     metaspace
>>>>>>>>>>>>                                                                     to
>>>>>>>>>>>>                                                                     load
>>>>>>>>>>>>                                                                     classes
>>>>>>>>>>>>                                                                     or
>>>>>>>>>>>>                                                                     there
>>>>>>>>>>>>                                                                     is
>>>>>>>>>>>>                                                                     a
>>>>>>>>>>>>                                                                     class
>>>>>>>>>>>>                                                                     loading
>>>>>>>>>>>>                                                                     leak.
>>>>>>>>>>>>                                                                     >
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                     >
>>>>>>>>>>>>                                                                     I
>>>>>>>>>>>>                                                                     have
>>>>>>>>>>>>                                                                     2GB
>>>>>>>>>>>>                                                                     of
>>>>>>>>>>>>                                                                     metaspace
>>>>>>>>>>>>                                                                     configed
>>>>>>>>>>>>                                                                     taskmanager.memory.jvm-metaspace.size:
>>>>>>>>>>>>                                                                     2048m
>>>>>>>>>>>>                                                                     >
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                     >
>>>>>>>>>>>>                                                                     But
>>>>>>>>>>>>                                                                     the
>>>>>>>>>>>>                                                                     task
>>>>>>>>>>>>                                                                     nodes
>>>>>>>>>>>>                                                                     still
>>>>>>>>>>>>                                                                     fail.
>>>>>>>>>>>>                                                                     >
>>>>>>>>>>>>
>>>>>>>>>>>>                                                                     >
>>>>>>>>>>>>                                                                     When
>>>>>>>>>>>>                                                                     looking
>>>>>>>>>>>>                                                                     at
>>>>>>>>>>>>                                                                     the
>>>>>>>>>>>>                                                                     UI
>>>>>>>>>>>>                                                                     metrics,
>>>>>>>>>>>>                                                                     the
>>>>>>>>>>>>                                                                     metaspace
>>>>>>>>>>>>                                                                     starts
>>>>>>>>>>>>                                                                     low.
>>>>>>>>>>>>                                                                     Now
>>>>>>>>>>>>                                                                     I
>>>>>>>>>>>>                                                                     see
>>>>>>>>>>>>                                                                     85%
>>>>>>>>>>>>                                                                     usage.
>>>>>>>>>>>>                                                                     It
>>>>>>>>>>>>                                                                     seems
>>>>>>>>>>>>                                                                     to
>>>>>>>>>>>>                                                                     be
>>>>>>>>>>>>                                                                     a
>>>>>>>>>>>>                                                                     class
>>>>>>>>>>>>                                                                     loading
>>>>>>>>>>>>                                                                     leak
>>>>>>>>>>>>                                                                     at
>>>>>>>>>>>>                                                                     this
>>>>>>>>>>>>                                                                     point,
>>>>>>>>>>>>                                                                     how
>>>>>>>>>>>>                                                                     can
>>>>>>>>>>>>                                                                     we
>>>>>>>>>>>>                                                                     debug
>>>>>>>>>>>>                                                                     this
>>>>>>>>>>>>                                                                     issue?
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to debug Metaspace exception?

Posted by John Smith <ja...@gmail.com>.

Why do the JDBC jars need to be on the job manager node though?

On Mon, May 2, 2022 at 9:36 AM Chesnay Schepler <ch...@apache.org> wrote:

> yes.
> But if you can ensure that the driver isn't bundled by any user-jar you
> can also skip the pattern configuration step.
>
> The pattern looks correct formatting-wise; you could try whether
> com.microsoft.sqlserver.jdbc. is enough to solve the issue.
>
> On 02/05/2022 14:41, John Smith wrote:
>
> Oh, so I should copy the jars to the lib folder and
> set classloader.parent-first-patterns.additional:
> "org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the task
> managers and job managers?
>
> Also is my pattern correct?
> "org.apache.ignite.;com.microsoft.sqlserver.jdbc."
>
> Just to be sure I'm running a standalone cluster using zookeeper. So I
> have 3 zookeepers, 3 job managers and 3 task managers.
>
>
> On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler <ch...@apache.org>
> wrote:
>
>> And you do should make sure that it is set for both processes!
>>
>> On 02/05/2022 08:43, Chesnay Schepler wrote:
>>
>> The setting itself isn't taskmanager specific; it applies to both the
>> job- and taskmanager process.
>>
>> On 02/05/2022 05:29, John Smith wrote:
>>
>> Also just to be sure this is a Task Manager setting right?
>>
>> On Thu, Apr 28, 2022 at 11:13 AM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> I assume you will take action on your side to track and fix the doc? :)
>>>
>>> On Thu, Apr 28, 2022 at 11:12 AM John Smith <ja...@gmail.com>
>>> wrote:
>>>
>>>> Ok so to summarize...
>>>>
>>>> - Build my job jar and have the JDBC driver as a compile only
>>>> dependency and copy the JDBC driver to flink lib folder.
>>>>
>>>> Or
>>>>
>>>> - Build my job jar and include JDBC driver in the shadow, plus copy the
>>>> JDBC driver in the flink lib folder, plus  make an entry in config for
>>>> classloader.parent-first-patterns-additional
>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>
>>>>
>>>> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler <ch...@apache.org>
>>>> wrote:
>>>>
>>>>> I think what I meant was "either add it to /lib, or [if it is already
>>>>> in /lib but also bundled in the jar] add it to the parent-first patterns."
>>>>>
>>>>> On 28/04/2022 15:56, Chesnay Schepler wrote:
>>>>>
>>>>> Pretty sure, even though I seemingly documented it incorrectly :)
>>>>>
>>>>> On 28/04/2022 15:49, John Smith wrote:
>>>>>
>>>>> You sure?
>>>>>
>>>>>    -
>>>>>
>>>>>    *JDBC*: JDBC drivers leak references outside the user code
>>>>>    classloader. To ensure that these classes are only loaded once you should
>>>>>    either add the driver jars to Flink’s lib/ folder, or add the
>>>>>    driver classes to the list of parent-first loaded class via
>>>>>    classloader.parent-first-patterns-additional
>>>>>    <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>>    .
>>>>>
>>>>>    It says either or
>>>>>
>>>>>
>>>>> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ch...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> You're misinterpreting the docs.
>>>>>>
>>>>>> The parent/child-first classloading controls where Flink looks for a
>>>>>> class *first*, specifically whether we first load from /lib or the
>>>>>> user-jar.
>>>>>> It does not allow you to load something from the user-jar in the
>>>>>> parent classloader. That's just not how it works.
>>>>>>
>>>>>> It must be in /lib.
>>>>>>
>>>>>> On 27/04/2022 04:59, John Smith wrote:
>>>>>>
>>>>>> Hi Chesnay as per the docs...
>>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>>>>>
>>>>>> You can either put the jars in task manager lib folder or use
>>>>>> classloader.parent-first-patterns-additional
>>>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>>>
>>>>>> I prefer the latter like this: the dependency stays with the user-jar
>>>>>> and not on the task manager.
>>>>>>
>>>>>> On Tue, Apr 26, 2022 at 9:52 PM John Smith <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Ok so I should put the Apache ignite and my Microsoft drivers in the
>>>>>>> lib folders of my task managers?
>>>>>>>
>>>>>>> And then in my job jar only include them as compile time
>>>>>>> dependencies?
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <
>>>>>>> chesnay@apache.org> wrote:
>>>>>>>
>>>>>>>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>>>>>>>
>>>>>>>> You have correctly identified your alternatives.
>>>>>>>>
>>>>>>>> You must put the jdbc driver into /lib instead. Setting only the
>>>>>>>> parent-first pattern shouldn't affect anything.
>>>>>>>> That is only relevant if something is in both in /lib and the
>>>>>>>> user-jar, telling Flink to prioritize what is in lib.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On 26/04/2022 15:35, John Smith wrote:
>>>>>>>>
>>>>>>>> So I put classloader.parent-first-patterns.additional:
>>>>>>>> "org.apache.ignite." in the task config and so far I don't think I'm
>>>>>>>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>>>>>>>
>>>>>>>> Or it's too early to tell.
>>>>>>>>
>>>>>>>> Though now, the task managers are shutting down due to some
>>>>>>>> other failures.
>>>>>>>>
>>>>>>>> So maybe because tasks were failing and reloading often the task
>>>>>>>> manager was running out of Metspace. But now maybe it's just
>>>>>>>> cleanly shutting down.
>>>>>>>>
>>>>>>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <ja...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Or I can put in the config to treat org.apache.ignite. classes as
>>>>>>>>> first class?
>>>>>>>>>
>>>>>>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <
>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Ok, so I loaded the dump into Eclipse Mat and followed:
>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>>>
>>>>>>>>>> - On the Histogram, I got over 30 entries for:
>>>>>>>>>> ChildFirstClassLoader
>>>>>>>>>> - Then I clicked on one of them "Merge Shortest Path..." and
>>>>>>>>>> picked "Exclude all phantom/weak/soft references"
>>>>>>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin
>>>>>>>>>> Driver
>>>>>>>>>>
>>>>>>>>>> So i'm guessing anything JDBC based. I should copy into the task
>>>>>>>>>> manager libs folder and my jobs make the dependencies as compile only?
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
>>>>>>>>>> yaroslav@goldsky.io> wrote:
>>>>>>>>>>
>>>>>>>>>>> Also
>>>>>>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>>>>>>>> might be helpful (has a section on profiling, as well as classloading).
>>>>>>>>>>>
>>>>>>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <
>>>>>>>>>>> chesnay@apache.org> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> We have a very rough "guide" in the wiki (it's just the
>>>>>>>>>>>> specific steps I took to debug another leak):
>>>>>>>>>>>>
>>>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>>>>>
>>>>>>>>>>>> On 19/04/2022 12:01, huweihua wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi, John
>>>>>>>>>>>>
>>>>>>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the
>>>>>>>>>>>> dump file. Check whether have too many loaded classes.
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://www.eclipse.org/mat/
>>>>>>>>>>>>
>>>>>>>>>>>> 2022年4月18日 下午9:55，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>
>>>>>>>>>>>> Hi, can anyone help with this? I never looked at a dump file
>>>>>>>>>>>> before.
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <
>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, so I have a dump file. What do I look for?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <
>>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Ok so if there's a leak, if I manually stop the job and
>>>>>>>>>>>>>> restart it from the UI multiple times, I won't see the issue because
>>>>>>>>>>>>>> because the classes are unloaded correctly?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <
>>>>>>>>>>>>>> huweihua.ckl@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The difference is that manually canceling the job stops the
>>>>>>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But looking
>>>>>>>>>>>>>>> on TaskManager, it doesn't make much difference
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2022年3月31日 上午4:01，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Also if I manually cancel and restart the same job over and
>>>>>>>>>>>>>>> over is it the same as if flink was restarting a job due to failure?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job
>>>>>>>>>>>>>>> completely unloaded vs when the job scheduler restarts a job because if
>>>>>>>>>>>>>>> whatever reason?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe
>>>>>>>>>>>>>>> I can trick my job to fail and have the scheduler restart it. Ok let me
>>>>>>>>>>>>>>> think about this...
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <hu...@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> So if I run the same jobs in my dev env will I still be
>>>>>>>>>>>>>>>> able to see the similar dump?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I think running the same job in dev should be reproducible,
>>>>>>>>>>>>>>>> maybe you can have a try.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>  If not I would have to wait at a low volume time to do it
>>>>>>>>>>>>>>>> on production. Aldo if I recall the dump is as big as the JVM memory right
>>>>>>>>>>>>>>>> so if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on
>>>>>>>>>>>>>>>> the size to dump. you can use "jmap -dump:live" to dump only the reachable
>>>>>>>>>>>>>>>> objects, this will take a brief pause
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> 2022年3月30日 下午9:47，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I have 3 task managers (see config below). There is total
>>>>>>>>>>>>>>>> of 10 jobs with 25 slots being used.
>>>>>>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and
>>>>>>>>>>>>>>>> push it to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if
>>>>>>>>>>>>>>>> I run the same jobs in my dev env will I still be able to see the similar
>>>>>>>>>>>>>>>> dump? I I assume so. If not I would have to wait at a low volume time to do
>>>>>>>>>>>>>>>> it on production. Aldo if I recall the dump is as big as the JVM memory
>>>>>>>>>>>>>>>> right so if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> # Operating system has 16GB total.
>>>>>>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>>>>>>>>>>> parallelism.default: 1
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> high-availability: zookeeper
>>>>>>>>>>>>>>>> high-availability.storageDir:
>>>>>>>>>>>>>>>> file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> state.backend: rocksdb
>>>>>>>>>>>>>>>> state.backend.incremental: true
>>>>>>>>>>>>>>>> state.checkpoints.dir:
>>>>>>>>>>>>>>>> file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>>>>>>>> state.savepoints.dir:
>>>>>>>>>>>>>>>> file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <hu...@gmail.com>
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Hi, John
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink
>>>>>>>>>>>>>>>>> session cluster with a lot of jobs?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use
>>>>>>>>>>>>>>>>> tools such as MAT to analyze whether there are abnormal classes and
>>>>>>>>>>>>>>>>> classloaders
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> > 2022年3月30日 上午6:09，John Smith <ja...@gmail.com>
>>>>>>>>>>>>>>>>> 写道：
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > Hi running 1.14.4
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > My tasks manager still fails with
>>>>>>>>>>>>>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error
>>>>>>>>>>>>>>>>> has occurred. This can mean two things: either the job requires a larger
>>>>>>>>>>>>>>>>> size of JVM metaspace to load classes or there is a class loading leak.
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > But the task nodes still fail.
>>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts
>>>>>>>>>>>>>>>>> low. Now I see 85% usage. It seems to be a class loading leak at this
>>>>>>>>>>>>>>>>> point, how can we debug this issue?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>
>>>>>>
>>>>>
>>>>>
>>
>>
>

Re: How to debug Metaspace exception?

Posted by Chesnay Schepler <ch...@apache.org>.

yes.
But if you can ensure that the driver isn't bundled by any user-jar you 
can also skip the pattern configuration step.

The pattern looks correct formatting-wise; you could try whether 
com.microsoft.sqlserver.jdbc. is enough to solve the issue.

On 02/05/2022 14:41, John Smith wrote:
> Oh, so I should copy the jars to the lib folder and 
> set classloader.parent-first-patterns.additional: 
> "org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the task 
> managers and job managers?
>
> Also is my pattern correct? 
> "org.apache.ignite.;com.microsoft.sqlserver.jdbc."
>
> Just to be sure I'm running a standalone cluster using zookeeper. So I 
> have 3 zookeepers, 3 job managers and 3 task managers.
>
>
> On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler <ch...@apache.org> 
> wrote:
>
>     And you do should make sure that it is set for both processes!
>
>     On 02/05/2022 08:43, Chesnay Schepler wrote:
>>     The setting itself isn't taskmanager specific; it applies to both
>>     the job- and taskmanager process.
>>
>>     On 02/05/2022 05:29, John Smith wrote:
>>>     Also just to be sure this is a Task Manager setting right?
>>>
>>>     On Thu, Apr 28, 2022 at 11:13 AM John Smith
>>>     <ja...@gmail.com> wrote:
>>>
>>>         I assume you will take action on your side to track and fix
>>>         the doc? :)
>>>
>>>         On Thu, Apr 28, 2022 at 11:12 AM John Smith
>>>         <ja...@gmail.com> wrote:
>>>
>>>             Ok so to summarize...
>>>
>>>             - Build my job jar and have the JDBC driver as a compile
>>>             only dependency and copy the JDBC driver to flink lib
>>>             folder.
>>>
>>>             Or
>>>
>>>             - Build my job jar and include JDBC driver in the
>>>             shadow, plus copy the JDBC driver in the flink lib
>>>             folder, plus  make an entry in config for
>>>             |classloader.parent-first-patterns-additional|
>>>             <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>
>>>
>>>             On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler
>>>             <ch...@apache.org> wrote:
>>>
>>>                 I think what I meant was "either add it to /lib, or
>>>                 [if it is already in /lib but also bundled in the
>>>                 jar] add it to the parent-first patterns."
>>>
>>>                 On 28/04/2022 15:56, Chesnay Schepler wrote:
>>>>                 Pretty sure, even though I seemingly documented it
>>>>                 incorrectly :)
>>>>
>>>>                 On 28/04/2022 15:49, John Smith wrote:
>>>>>                 You sure?
>>>>>
>>>>>                  *
>>>>>
>>>>>                     /JDBC/: JDBC drivers leak references outside
>>>>>                     the user code classloader. To ensure that
>>>>>                     these classes are only loaded once you should
>>>>>                     either add the driver jars to Flink’s
>>>>>                     |lib/| folder, or add the driver classes to
>>>>>                     the list of parent-first loaded class via
>>>>>                     |classloader.parent-first-patterns-additional|
>>>>>                     <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>.
>>>>>
>>>>>                     It says either or
>>>>>
>>>>>
>>>>>                 On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler
>>>>>                 <ch...@apache.org> wrote:
>>>>>
>>>>>                     You're misinterpreting the docs.
>>>>>
>>>>>                     The parent/child-first classloading controls
>>>>>                     where Flink looks for a class /first/,
>>>>>                     specifically whether we first load from /lib
>>>>>                     or the user-jar.
>>>>>                     It does not allow you to load something from
>>>>>                     the user-jar in the parent classloader. That's
>>>>>                     just not how it works.
>>>>>
>>>>>                     It must be in /lib.
>>>>>
>>>>>                     On 27/04/2022 04:59, John Smith wrote:
>>>>>>                     Hi Chesnay as per the docs...
>>>>>>                     https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>>>>>
>>>>>>                     You can either put the jars in task manager
>>>>>>                     lib folder or use
>>>>>>                     |classloader.parent-first-patterns-additional|
>>>>>>                     <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>>>
>>>>>>                     I prefer the latter like this: the
>>>>>>                     dependency stays with the user-jar and not on
>>>>>>                     the task manager.
>>>>>>
>>>>>>                     On Tue, Apr 26, 2022 at 9:52 PM John Smith
>>>>>>                     <ja...@gmail.com> wrote:
>>>>>>
>>>>>>                         Ok so I should put the Apache ignite and
>>>>>>                         my Microsoft drivers in the lib folders
>>>>>>                         of my task managers?
>>>>>>
>>>>>>                         And then in my job jar only include them
>>>>>>                         as compile time dependencies?
>>>>>>
>>>>>>
>>>>>>                         On Tue, Apr 26, 2022 at 10:42 AM Chesnay
>>>>>>                         Schepler <ch...@apache.org> wrote:
>>>>>>
>>>>>>                             JDBC drivers are well-known for
>>>>>>                             leaking classloaders unfortunately.
>>>>>>
>>>>>>                             You have correctly identified your
>>>>>>                             alternatives.
>>>>>>
>>>>>>                             You must put the jdbc driver into
>>>>>>                             /lib instead. Setting only the
>>>>>>                             parent-first pattern shouldn't affect
>>>>>>                             anything.
>>>>>>                             That is only relevant if something is
>>>>>>                             in both in /lib and the user-jar,
>>>>>>                             telling Flink to prioritize what is
>>>>>>                             in lib.
>>>>>>
>>>>>>
>>>>>>
>>>>>>                             On 26/04/2022 15:35, John Smith wrote:
>>>>>>>                             So I
>>>>>>>                             put classloader.parent-first-patterns.additional:
>>>>>>>                             "org.apache.ignite." in the task
>>>>>>>                             config and so far I don't think I'm
>>>>>>>                             getting "java.lang.OutOfMemoryError:
>>>>>>>                             Metaspace" any more.
>>>>>>>
>>>>>>>                             Or it's too early to tell.
>>>>>>>
>>>>>>>                             Though now, the task managers are
>>>>>>>                             shutting down due to some
>>>>>>>                             other failures.
>>>>>>>
>>>>>>>                             So maybe because tasks were failing
>>>>>>>                             and reloading often the task manager
>>>>>>>                             was running out of Metspace. But now
>>>>>>>                             maybe it's just cleanly shutting down.
>>>>>>>
>>>>>>>                             On Wed, Apr 20, 2022 at 11:35 AM
>>>>>>>                             John Smith <ja...@gmail.com>
>>>>>>>                             wrote:
>>>>>>>
>>>>>>>                                 Or I can put in the config to
>>>>>>>                                 treat org.apache.ignite. classes
>>>>>>>                                 as first class?
>>>>>>>
>>>>>>>                                 On Tue, Apr 19, 2022 at 10:18 PM
>>>>>>>                                 John Smith
>>>>>>>                                 <ja...@gmail.com> wrote:
>>>>>>>
>>>>>>>                                     Ok, so I loaded the dump
>>>>>>>                                     into Eclipse Mat and
>>>>>>>                                     followed:
>>>>>>>                                     https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>
>>>>>>>                                     - On the Histogram, I got
>>>>>>>                                     over 30 entries for:
>>>>>>>                                     ChildFirstClassLoader
>>>>>>>                                     - Then I clicked on one of
>>>>>>>                                     them "Merge Shortest
>>>>>>>                                     Path..." and picked "Exclude
>>>>>>>                                     all phantom/weak/soft
>>>>>>>                                     references"
>>>>>>>                                     - Which then gave me:
>>>>>>>                                     SqlDriverManager > Apache
>>>>>>>                                     Ignite JdbcThin Driver
>>>>>>>
>>>>>>>                                     So i'm guessing anything
>>>>>>>                                     JDBC based. I should copy
>>>>>>>                                     into the task manager libs
>>>>>>>                                     folder and my jobs make the
>>>>>>>                                     dependencies as compile only?
>>>>>>>
>>>>>>>                                     On Tue, Apr 19, 2022 at
>>>>>>>                                     12:18 PM Yaroslav Tkachenko
>>>>>>>                                     <ya...@goldsky.io> wrote:
>>>>>>>
>>>>>>>                                         Also
>>>>>>>                                         https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>>>>                                         might be helpful (has a
>>>>>>>                                         section on profiling, as
>>>>>>>                                         well as classloading).
>>>>>>>
>>>>>>>                                         On Tue, Apr 19, 2022 at
>>>>>>>                                         4:35 AM Chesnay Schepler
>>>>>>>                                         <ch...@apache.org> wrote:
>>>>>>>
>>>>>>>                                             We have a very rough
>>>>>>>                                             "guide" in the wiki
>>>>>>>                                             (it's just the
>>>>>>>                                             specific steps I
>>>>>>>                                             took to debug
>>>>>>>                                             another leak):
>>>>>>>                                             https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>
>>>>>>>                                             On 19/04/2022 12:01,
>>>>>>>                                             huweihua wrote:
>>>>>>>>                                             Hi, John
>>>>>>>>
>>>>>>>>                                             Sorry for the late
>>>>>>>>                                             reply. You can use
>>>>>>>>                                             MAT[1] to analyze
>>>>>>>>                                             the dump file.
>>>>>>>>                                             Check whether have
>>>>>>>>                                             too many loaded
>>>>>>>>                                             classes.
>>>>>>>>
>>>>>>>>                                             [1]
>>>>>>>>                                             https://www.eclipse.org/mat/
>>>>>>>>
>>>>>>>>>                                             2022年4月18日
>>>>>>>>>                                             下午9:55，John Smith
>>>>>>>>>                                             <ja...@gmail.com>
>>>>>>>>>                                             写道：
>>>>>>>>>
>>>>>>>>>                                             Hi, can anyone
>>>>>>>>>                                             help with this? I
>>>>>>>>>                                             never looked at a
>>>>>>>>>                                             dump file before.
>>>>>>>>>
>>>>>>>>>                                             On Thu, Apr 14,
>>>>>>>>>                                             2022 at 11:59 AM
>>>>>>>>>                                             John Smith
>>>>>>>>>                                             <ja...@gmail.com>
>>>>>>>>>                                             wrote:
>>>>>>>>>
>>>>>>>>>                                                 Hi, so I have
>>>>>>>>>                                                 a dump file.
>>>>>>>>>                                                 What do I look
>>>>>>>>>                                                 for?
>>>>>>>>>
>>>>>>>>>                                                 On Thu, Mar
>>>>>>>>>                                                 31, 2022 at
>>>>>>>>>                                                 3:28 PM John
>>>>>>>>>                                                 Smith
>>>>>>>>>                                                 <ja...@gmail.com>
>>>>>>>>>                                                 wrote:
>>>>>>>>>
>>>>>>>>>                                                     Ok so if
>>>>>>>>>                                                     there's a
>>>>>>>>>                                                     leak, if I
>>>>>>>>>                                                     manually stop
>>>>>>>>>                                                     the job
>>>>>>>>>                                                     and
>>>>>>>>>                                                     restart it
>>>>>>>>>                                                     from the
>>>>>>>>>                                                     UI
>>>>>>>>>                                                     multiple
>>>>>>>>>                                                     times, I
>>>>>>>>>                                                     won't see
>>>>>>>>>                                                     the issue
>>>>>>>>>                                                     because
>>>>>>>>>                                                     because
>>>>>>>>>                                                     the
>>>>>>>>>                                                     classes
>>>>>>>>>                                                     are
>>>>>>>>>                                                     unloaded
>>>>>>>>>                                                     correctly?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                                                     On Thu,
>>>>>>>>>                                                     Mar 31,
>>>>>>>>>                                                     2022 at
>>>>>>>>>                                                     9:20 AM
>>>>>>>>>                                                     huweihua
>>>>>>>>>                                                     <hu...@gmail.com>
>>>>>>>>>                                                     wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                                                         The
>>>>>>>>>                                                         difference
>>>>>>>>>                                                         is
>>>>>>>>>                                                         that
>>>>>>>>>                                                         manually
>>>>>>>>>                                                         canceling
>>>>>>>>>                                                         the
>>>>>>>>>                                                         job
>>>>>>>>>                                                         stops
>>>>>>>>>                                                         the
>>>>>>>>>                                                         JobMaster,
>>>>>>>>>                                                         but
>>>>>>>>>                                                         automatic
>>>>>>>>>                                                         failover
>>>>>>>>>                                                         keeps
>>>>>>>>>                                                         the
>>>>>>>>>                                                         JobMaster
>>>>>>>>>                                                         running.
>>>>>>>>>                                                         But
>>>>>>>>>                                                         looking
>>>>>>>>>                                                         on
>>>>>>>>>                                                         TaskManager,
>>>>>>>>>                                                         it
>>>>>>>>>                                                         doesn't
>>>>>>>>>                                                         make
>>>>>>>>>                                                         much
>>>>>>>>>                                                         difference
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>                                                         2022年3月31日
>>>>>>>>>>                                                         上午4:01，John
>>>>>>>>>>                                                         Smith
>>>>>>>>>>                                                         <ja...@gmail.com>
>>>>>>>>>>                                                         写道：
>>>>>>>>>>
>>>>>>>>>>                                                         Also
>>>>>>>>>>                                                         if I
>>>>>>>>>>                                                         manually
>>>>>>>>>>                                                         cancel
>>>>>>>>>>                                                         and
>>>>>>>>>>                                                         restart
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         same
>>>>>>>>>>                                                         job
>>>>>>>>>>                                                         over
>>>>>>>>>>                                                         and
>>>>>>>>>>                                                         over
>>>>>>>>>>                                                         is it
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         same
>>>>>>>>>>                                                         as if
>>>>>>>>>>                                                         flink
>>>>>>>>>>                                                         was
>>>>>>>>>>                                                         restarting
>>>>>>>>>>                                                         a job
>>>>>>>>>>                                                         due
>>>>>>>>>>                                                         to
>>>>>>>>>>                                                         failure?
>>>>>>>>>>
>>>>>>>>>>                                                         I.e:
>>>>>>>>>>                                                         When
>>>>>>>>>>                                                         I
>>>>>>>>>>                                                         click
>>>>>>>>>>                                                         "Cancel
>>>>>>>>>>                                                         Job"
>>>>>>>>>>                                                         on
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         UI is
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         job
>>>>>>>>>>                                                         completely
>>>>>>>>>>                                                         unloaded
>>>>>>>>>>                                                         vs
>>>>>>>>>>                                                         when
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         job
>>>>>>>>>>                                                         scheduler
>>>>>>>>>>                                                         restarts
>>>>>>>>>>                                                         a job
>>>>>>>>>>                                                         because
>>>>>>>>>>                                                         if
>>>>>>>>>>                                                         whatever
>>>>>>>>>>                                                         reason?
>>>>>>>>>>
>>>>>>>>>>                                                         Lile
>>>>>>>>>>                                                         this
>>>>>>>>>>                                                         I'll
>>>>>>>>>>                                                         stop
>>>>>>>>>>                                                         and
>>>>>>>>>>                                                         restart
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         job a
>>>>>>>>>>                                                         few
>>>>>>>>>>                                                         times
>>>>>>>>>>                                                         or
>>>>>>>>>>                                                         maybe
>>>>>>>>>>                                                         I can
>>>>>>>>>>                                                         trick
>>>>>>>>>>                                                         my
>>>>>>>>>>                                                         job
>>>>>>>>>>                                                         to
>>>>>>>>>>                                                         fail
>>>>>>>>>>                                                         and
>>>>>>>>>>                                                         have
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         scheduler
>>>>>>>>>>                                                         restart
>>>>>>>>>>                                                         it.
>>>>>>>>>>                                                         Ok
>>>>>>>>>>                                                         let
>>>>>>>>>>                                                         me
>>>>>>>>>>                                                         think
>>>>>>>>>>                                                         about
>>>>>>>>>>                                                         this...
>>>>>>>>>>
>>>>>>>>>>                                                         On
>>>>>>>>>>                                                         Wed,
>>>>>>>>>>                                                         Mar
>>>>>>>>>>                                                         30,
>>>>>>>>>>                                                         2022
>>>>>>>>>>                                                         at
>>>>>>>>>>                                                         10:24
>>>>>>>>>>                                                         AM
>>>>>>>>>>                                                         胡伟华
>>>>>>>>>>                                                         <hu...@gmail.com>
>>>>>>>>>>                                                         wrote:
>>>>>>>>>>
>>>>>>>>>>>                                                             So
>>>>>>>>>>>                                                             if
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             run
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             same
>>>>>>>>>>>                                                             jobs
>>>>>>>>>>>                                                             in
>>>>>>>>>>>                                                             my
>>>>>>>>>>>                                                             dev
>>>>>>>>>>>                                                             env
>>>>>>>>>>>                                                             will
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             still
>>>>>>>>>>>                                                             be
>>>>>>>>>>>                                                             able
>>>>>>>>>>>                                                             to
>>>>>>>>>>>                                                             see
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             similar
>>>>>>>>>>>                                                             dump?
>>>>>>>>>>>
>>>>>>>>>>                                                             I
>>>>>>>>>>                                                             think
>>>>>>>>>>                                                             running
>>>>>>>>>>                                                             the
>>>>>>>>>>                                                             same
>>>>>>>>>>                                                             job
>>>>>>>>>>                                                             in
>>>>>>>>>>                                                             dev
>>>>>>>>>>                                                             should
>>>>>>>>>>                                                             be
>>>>>>>>>>                                                             reproducible,
>>>>>>>>>>                                                             maybe
>>>>>>>>>>                                                             you
>>>>>>>>>>                                                             can
>>>>>>>>>>                                                             have
>>>>>>>>>>                                                             a
>>>>>>>>>>                                                             try.
>>>>>>>>>>
>>>>>>>>>>>                                                              If
>>>>>>>>>>>                                                             not
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             would
>>>>>>>>>>>                                                             have
>>>>>>>>>>>                                                             to
>>>>>>>>>>>                                                             wait
>>>>>>>>>>>                                                             at
>>>>>>>>>>>                                                             a
>>>>>>>>>>>                                                             low
>>>>>>>>>>>                                                             volume
>>>>>>>>>>>                                                             time
>>>>>>>>>>>                                                             to
>>>>>>>>>>>                                                             do
>>>>>>>>>>>                                                             it
>>>>>>>>>>>                                                             on
>>>>>>>>>>>                                                             production.
>>>>>>>>>>>                                                             Aldo
>>>>>>>>>>>                                                             if
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             recall
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             dump
>>>>>>>>>>>                                                             is
>>>>>>>>>>>                                                             as
>>>>>>>>>>>                                                             big
>>>>>>>>>>>                                                             as
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             JVM
>>>>>>>>>>>                                                             memory
>>>>>>>>>>>                                                             right
>>>>>>>>>>>                                                             so
>>>>>>>>>>>                                                             if
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             have
>>>>>>>>>>>                                                             10GB
>>>>>>>>>>>                                                             configed
>>>>>>>>>>>                                                             for
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             JVM
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             dump
>>>>>>>>>>>                                                             will
>>>>>>>>>>>                                                             be
>>>>>>>>>>>                                                             10GB
>>>>>>>>>>>                                                             file?
>>>>>>>>>>                                                             Yes,
>>>>>>>>>>                                                             JMAP
>>>>>>>>>>                                                             will
>>>>>>>>>>                                                             pause
>>>>>>>>>>                                                             the
>>>>>>>>>>                                                             JVM,
>>>>>>>>>>                                                             the
>>>>>>>>>>                                                             time
>>>>>>>>>>                                                             of
>>>>>>>>>>                                                             pause
>>>>>>>>>>                                                             depends
>>>>>>>>>>                                                             on
>>>>>>>>>>                                                             the
>>>>>>>>>>                                                             size
>>>>>>>>>>                                                             to
>>>>>>>>>>                                                             dump.
>>>>>>>>>>                                                             you
>>>>>>>>>>                                                             can
>>>>>>>>>>                                                             use
>>>>>>>>>>                                                             "jmap
>>>>>>>>>>                                                             -dump:live"
>>>>>>>>>>                                                             to
>>>>>>>>>>                                                             dump
>>>>>>>>>>                                                             only
>>>>>>>>>>                                                             the
>>>>>>>>>>                                                             reachable
>>>>>>>>>>                                                             objects,
>>>>>>>>>>                                                             this
>>>>>>>>>>                                                             will
>>>>>>>>>>                                                             take
>>>>>>>>>>                                                             a
>>>>>>>>>>                                                             brief
>>>>>>>>>>                                                             pause
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>                                                             2022年3月30日
>>>>>>>>>>>                                                             下午9:47，John
>>>>>>>>>>>                                                             Smith
>>>>>>>>>>>                                                             <ja...@gmail.com>
>>>>>>>>>>>                                                             写道：
>>>>>>>>>>>
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             have
>>>>>>>>>>>                                                             3
>>>>>>>>>>>                                                             task
>>>>>>>>>>>                                                             managers
>>>>>>>>>>>                                                             (see
>>>>>>>>>>>                                                             config
>>>>>>>>>>>                                                             below).
>>>>>>>>>>>                                                             There
>>>>>>>>>>>                                                             is
>>>>>>>>>>>                                                             total
>>>>>>>>>>>                                                             of
>>>>>>>>>>>                                                             10
>>>>>>>>>>>                                                             jobs
>>>>>>>>>>>                                                             with
>>>>>>>>>>>                                                             25
>>>>>>>>>>>                                                             slots
>>>>>>>>>>>                                                             being
>>>>>>>>>>>                                                             used.
>>>>>>>>>>>                                                             The
>>>>>>>>>>>                                                             jobs
>>>>>>>>>>>                                                             are
>>>>>>>>>>>                                                             100%
>>>>>>>>>>>                                                             ETL
>>>>>>>>>>>                                                             I.e;
>>>>>>>>>>>                                                             They
>>>>>>>>>>>                                                             load
>>>>>>>>>>>                                                             Json,
>>>>>>>>>>>                                                             transform
>>>>>>>>>>>                                                             it
>>>>>>>>>>>                                                             and
>>>>>>>>>>>                                                             push
>>>>>>>>>>>                                                             it
>>>>>>>>>>>                                                             to
>>>>>>>>>>>                                                             JDBC,
>>>>>>>>>>>                                                             only
>>>>>>>>>>>                                                             1
>>>>>>>>>>>                                                             job
>>>>>>>>>>>                                                             of
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             10
>>>>>>>>>>>                                                             is
>>>>>>>>>>>                                                             pushing
>>>>>>>>>>>                                                             to
>>>>>>>>>>>                                                             Apache
>>>>>>>>>>>                                                             Ignite
>>>>>>>>>>>                                                             cluster.
>>>>>>>>>>>
>>>>>>>>>>>                                                             FOR
>>>>>>>>>>>                                                             JMAP.
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             know
>>>>>>>>>>>                                                             that
>>>>>>>>>>>                                                             it
>>>>>>>>>>>                                                             will
>>>>>>>>>>>                                                             pause
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             task
>>>>>>>>>>>                                                             manager.
>>>>>>>>>>>                                                             So
>>>>>>>>>>>                                                             if
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             run
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             same
>>>>>>>>>>>                                                             jobs
>>>>>>>>>>>                                                             in
>>>>>>>>>>>                                                             my
>>>>>>>>>>>                                                             dev
>>>>>>>>>>>                                                             env
>>>>>>>>>>>                                                             will
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             still
>>>>>>>>>>>                                                             be
>>>>>>>>>>>                                                             able
>>>>>>>>>>>                                                             to
>>>>>>>>>>>                                                             see
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             similar
>>>>>>>>>>>                                                             dump?
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             assume
>>>>>>>>>>>                                                             so.
>>>>>>>>>>>                                                             If
>>>>>>>>>>>                                                             not
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             would
>>>>>>>>>>>                                                             have
>>>>>>>>>>>                                                             to
>>>>>>>>>>>                                                             wait
>>>>>>>>>>>                                                             at
>>>>>>>>>>>                                                             a
>>>>>>>>>>>                                                             low
>>>>>>>>>>>                                                             volume
>>>>>>>>>>>                                                             time
>>>>>>>>>>>                                                             to
>>>>>>>>>>>                                                             do
>>>>>>>>>>>                                                             it
>>>>>>>>>>>                                                             on
>>>>>>>>>>>                                                             production.
>>>>>>>>>>>                                                             Aldo
>>>>>>>>>>>                                                             if
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             recall
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             dump
>>>>>>>>>>>                                                             is
>>>>>>>>>>>                                                             as
>>>>>>>>>>>                                                             big
>>>>>>>>>>>                                                             as
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             JVM
>>>>>>>>>>>                                                             memory
>>>>>>>>>>>                                                             right
>>>>>>>>>>>                                                             so
>>>>>>>>>>>                                                             if
>>>>>>>>>>>                                                             I
>>>>>>>>>>>                                                             have
>>>>>>>>>>>                                                             10GB
>>>>>>>>>>>                                                             configed
>>>>>>>>>>>                                                             for
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             JVM
>>>>>>>>>>>                                                             the
>>>>>>>>>>>                                                             dump
>>>>>>>>>>>                                                             will
>>>>>>>>>>>                                                             be
>>>>>>>>>>>                                                             10GB
>>>>>>>>>>>                                                             file?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                                                             #
>>>>>>>>>>>                                                             Operating
>>>>>>>>>>>                                                             system
>>>>>>>>>>>                                                             has
>>>>>>>>>>>                                                             16GB
>>>>>>>>>>>                                                             total.
>>>>>>>>>>>                                                             env.ssh.opts:
>>>>>>>>>>>                                                             -l
>>>>>>>>>>>                                                             flink
>>>>>>>>>>>                                                             -oStrictHostKeyChecking=no
>>>>>>>>>>>
>>>>>>>>>>>                                                             cluster.evenly-spread-out-slots:
>>>>>>>>>>>                                                             true
>>>>>>>>>>>
>>>>>>>>>>>                                                             taskmanager.memory.flink.size:
>>>>>>>>>>>                                                             10240m
>>>>>>>>>>>                                                             taskmanager.memory.jvm-metaspace.size:
>>>>>>>>>>>                                                             2048m
>>>>>>>>>>>                                                             taskmanager.numberOfTaskSlots:
>>>>>>>>>>>                                                             16
>>>>>>>>>>>                                                             parallelism.default:
>>>>>>>>>>>                                                             1
>>>>>>>>>>>
>>>>>>>>>>>                                                             high-availability:
>>>>>>>>>>>                                                             zookeeper
>>>>>>>>>>>                                                             high-availability.storageDir:
>>>>>>>>>>>                                                             file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>>>                                                             high-availability.zookeeper.quorum:
>>>>>>>>>>>                                                             ...
>>>>>>>>>>>                                                             high-availability.zookeeper.path.root:
>>>>>>>>>>>                                                             /flink_1_14
>>>>>>>>>>>                                                             high-availability.cluster-id:
>>>>>>>>>>>                                                             /flink_1_14_cluster_0001
>>>>>>>>>>>
>>>>>>>>>>>                                                             web.upload.dir:
>>>>>>>>>>>                                                             /mnt/flink/uploads/flink_1_14
>>>>>>>>>>>
>>>>>>>>>>>                                                             state.backend:
>>>>>>>>>>>                                                             rocksdb
>>>>>>>>>>>                                                             state.backend.incremental:
>>>>>>>>>>>                                                             true
>>>>>>>>>>>                                                             state.checkpoints.dir:
>>>>>>>>>>>                                                             file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>>>                                                             state.savepoints.dir:
>>>>>>>>>>>                                                             file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>>>
>>>>>>>>>>>                                                             On
>>>>>>>>>>>                                                             Wed,
>>>>>>>>>>>                                                             Mar
>>>>>>>>>>>                                                             30,
>>>>>>>>>>>                                                             2022
>>>>>>>>>>>                                                             at
>>>>>>>>>>>                                                             2:16
>>>>>>>>>>>                                                             AM
>>>>>>>>>>>                                                             胡伟华
>>>>>>>>>>>                                                             <hu...@gmail.com>
>>>>>>>>>>>                                                             wrote:
>>>>>>>>>>>
>>>>>>>>>>>                                                                 Hi,
>>>>>>>>>>>                                                                 John
>>>>>>>>>>>
>>>>>>>>>>>                                                                 Could
>>>>>>>>>>>                                                                 you
>>>>>>>>>>>                                                                 tell
>>>>>>>>>>>                                                                 us
>>>>>>>>>>>                                                                 you
>>>>>>>>>>>                                                                 application
>>>>>>>>>>>                                                                 scenario?
>>>>>>>>>>>                                                                 Is
>>>>>>>>>>>                                                                 it
>>>>>>>>>>>                                                                 a
>>>>>>>>>>>                                                                 flink
>>>>>>>>>>>                                                                 session
>>>>>>>>>>>                                                                 cluster
>>>>>>>>>>>                                                                 with
>>>>>>>>>>>                                                                 a
>>>>>>>>>>>                                                                 lot
>>>>>>>>>>>                                                                 of
>>>>>>>>>>>                                                                 jobs?
>>>>>>>>>>>
>>>>>>>>>>>                                                                 Maybe
>>>>>>>>>>>                                                                 you
>>>>>>>>>>>                                                                 can
>>>>>>>>>>>                                                                 try
>>>>>>>>>>>                                                                 to
>>>>>>>>>>>                                                                 dump
>>>>>>>>>>>                                                                 the
>>>>>>>>>>>                                                                 memory
>>>>>>>>>>>                                                                 with
>>>>>>>>>>>                                                                 jmap
>>>>>>>>>>>                                                                 and
>>>>>>>>>>>                                                                 use
>>>>>>>>>>>                                                                 tools
>>>>>>>>>>>                                                                 such
>>>>>>>>>>>                                                                 as
>>>>>>>>>>>                                                                 MAT
>>>>>>>>>>>                                                                 to
>>>>>>>>>>>                                                                 analyze
>>>>>>>>>>>                                                                 whether
>>>>>>>>>>>                                                                 there
>>>>>>>>>>>                                                                 are
>>>>>>>>>>>                                                                 abnormal
>>>>>>>>>>>                                                                 classes
>>>>>>>>>>>                                                                 and
>>>>>>>>>>>                                                                 classloaders
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>                                                                 >
>>>>>>>>>>>                                                                 2022年3月30日
>>>>>>>>>>>                                                                 上午6:09，John
>>>>>>>>>>>                                                                 Smith
>>>>>>>>>>>                                                                 <ja...@gmail.com>
>>>>>>>>>>>                                                                 写道：
>>>>>>>>>>>                                                                 >
>>>>>>>>>>>
>>>>>>>>>>>                                                                 >
>>>>>>>>>>>                                                                 Hi
>>>>>>>>>>>                                                                 running
>>>>>>>>>>>                                                                 1.14.4
>>>>>>>>>>>                                                                 >
>>>>>>>>>>>
>>>>>>>>>>>                                                                 >
>>>>>>>>>>>                                                                 My
>>>>>>>>>>>                                                                 tasks
>>>>>>>>>>>                                                                 manager
>>>>>>>>>>>                                                                 still
>>>>>>>>>>>                                                                 fails
>>>>>>>>>>>                                                                 with
>>>>>>>>>>>                                                                 java.lang.OutOfMemoryError:
>>>>>>>>>>>                                                                 Metaspace.
>>>>>>>>>>>                                                                 The
>>>>>>>>>>>                                                                 metaspace
>>>>>>>>>>>                                                                 out-of-memory
>>>>>>>>>>>                                                                 error
>>>>>>>>>>>                                                                 has
>>>>>>>>>>>                                                                 occurred.
>>>>>>>>>>>                                                                 This
>>>>>>>>>>>                                                                 can
>>>>>>>>>>>                                                                 mean
>>>>>>>>>>>                                                                 two
>>>>>>>>>>>                                                                 things:
>>>>>>>>>>>                                                                 either
>>>>>>>>>>>                                                                 the
>>>>>>>>>>>                                                                 job
>>>>>>>>>>>                                                                 requires
>>>>>>>>>>>                                                                 a
>>>>>>>>>>>                                                                 larger
>>>>>>>>>>>                                                                 size
>>>>>>>>>>>                                                                 of
>>>>>>>>>>>                                                                 JVM
>>>>>>>>>>>                                                                 metaspace
>>>>>>>>>>>                                                                 to
>>>>>>>>>>>                                                                 load
>>>>>>>>>>>                                                                 classes
>>>>>>>>>>>                                                                 or
>>>>>>>>>>>                                                                 there
>>>>>>>>>>>                                                                 is
>>>>>>>>>>>                                                                 a
>>>>>>>>>>>                                                                 class
>>>>>>>>>>>                                                                 loading
>>>>>>>>>>>                                                                 leak.
>>>>>>>>>>>                                                                 >
>>>>>>>>>>>
>>>>>>>>>>>                                                                 >
>>>>>>>>>>>                                                                 I
>>>>>>>>>>>                                                                 have
>>>>>>>>>>>                                                                 2GB
>>>>>>>>>>>                                                                 of
>>>>>>>>>>>                                                                 metaspace
>>>>>>>>>>>                                                                 configed
>>>>>>>>>>>                                                                 taskmanager.memory.jvm-metaspace.size:
>>>>>>>>>>>                                                                 2048m
>>>>>>>>>>>                                                                 >
>>>>>>>>>>>
>>>>>>>>>>>                                                                 >
>>>>>>>>>>>                                                                 But
>>>>>>>>>>>                                                                 the
>>>>>>>>>>>                                                                 task
>>>>>>>>>>>                                                                 nodes
>>>>>>>>>>>                                                                 still
>>>>>>>>>>>                                                                 fail.
>>>>>>>>>>>                                                                 >
>>>>>>>>>>>
>>>>>>>>>>>                                                                 >
>>>>>>>>>>>                                                                 When
>>>>>>>>>>>                                                                 looking
>>>>>>>>>>>                                                                 at
>>>>>>>>>>>                                                                 the
>>>>>>>>>>>                                                                 UI
>>>>>>>>>>>                                                                 metrics,
>>>>>>>>>>>                                                                 the
>>>>>>>>>>>                                                                 metaspace
>>>>>>>>>>>                                                                 starts
>>>>>>>>>>>                                                                 low.
>>>>>>>>>>>                                                                 Now
>>>>>>>>>>>                                                                 I
>>>>>>>>>>>                                                                 see
>>>>>>>>>>>                                                                 85%
>>>>>>>>>>>                                                                 usage.
>>>>>>>>>>>                                                                 It
>>>>>>>>>>>                                                                 seems
>>>>>>>>>>>                                                                 to
>>>>>>>>>>>                                                                 be
>>>>>>>>>>>                                                                 a
>>>>>>>>>>>                                                                 class
>>>>>>>>>>>                                                                 loading
>>>>>>>>>>>                                                                 leak
>>>>>>>>>>>                                                                 at
>>>>>>>>>>>                                                                 this
>>>>>>>>>>>                                                                 point,
>>>>>>>>>>>                                                                 how
>>>>>>>>>>>                                                                 can
>>>>>>>>>>>                                                                 we
>>>>>>>>>>>                                                                 debug
>>>>>>>>>>>                                                                 this
>>>>>>>>>>>                                                                 issue?
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to debug Metaspace exception?

Posted by John Smith <ja...@gmail.com>.

Oh, so I should copy the jars to the lib folder and
set classloader.parent-first-patterns.additional:
"org.apache.ignite.;com.microsoft.sqlserver.jdbc." to both the task
managers and job managers?

Also is my pattern correct?
"org.apache.ignite.;com.microsoft.sqlserver.jdbc."

Just to be sure I'm running a standalone cluster using zookeeper. So I have
3 zookeepers, 3 job managers and 3 task managers.


On Mon, May 2, 2022 at 2:57 AM Chesnay Schepler <ch...@apache.org> wrote:

> And you do should make sure that it is set for both processes!
>
> On 02/05/2022 08:43, Chesnay Schepler wrote:
>
> The setting itself isn't taskmanager specific; it applies to both the job-
> and taskmanager process.
>
> On 02/05/2022 05:29, John Smith wrote:
>
> Also just to be sure this is a Task Manager setting right?
>
> On Thu, Apr 28, 2022 at 11:13 AM John Smith <ja...@gmail.com>
> wrote:
>
>> I assume you will take action on your side to track and fix the doc? :)
>>
>> On Thu, Apr 28, 2022 at 11:12 AM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> Ok so to summarize...
>>>
>>> - Build my job jar and have the JDBC driver as a compile only
>>> dependency and copy the JDBC driver to flink lib folder.
>>>
>>> Or
>>>
>>> - Build my job jar and include JDBC driver in the shadow, plus copy the
>>> JDBC driver in the flink lib folder, plus  make an entry in config for
>>> classloader.parent-first-patterns-additional
>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>
>>>
>>> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler <ch...@apache.org>
>>> wrote:
>>>
>>>> I think what I meant was "either add it to /lib, or [if it is already
>>>> in /lib but also bundled in the jar] add it to the parent-first patterns."
>>>>
>>>> On 28/04/2022 15:56, Chesnay Schepler wrote:
>>>>
>>>> Pretty sure, even though I seemingly documented it incorrectly :)
>>>>
>>>> On 28/04/2022 15:49, John Smith wrote:
>>>>
>>>> You sure?
>>>>
>>>>    -
>>>>
>>>>    *JDBC*: JDBC drivers leak references outside the user code
>>>>    classloader. To ensure that these classes are only loaded once you should
>>>>    either add the driver jars to Flink’s lib/ folder, or add the
>>>>    driver classes to the list of parent-first loaded class via
>>>>    classloader.parent-first-patterns-additional
>>>>    <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>    .
>>>>
>>>>    It says either or
>>>>
>>>>
>>>> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ch...@apache.org>
>>>> wrote:
>>>>
>>>>> You're misinterpreting the docs.
>>>>>
>>>>> The parent/child-first classloading controls where Flink looks for a
>>>>> class *first*, specifically whether we first load from /lib or the
>>>>> user-jar.
>>>>> It does not allow you to load something from the user-jar in the
>>>>> parent classloader. That's just not how it works.
>>>>>
>>>>> It must be in /lib.
>>>>>
>>>>> On 27/04/2022 04:59, John Smith wrote:
>>>>>
>>>>> Hi Chesnay as per the docs...
>>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>>>>
>>>>> You can either put the jars in task manager lib folder or use
>>>>> classloader.parent-first-patterns-additional
>>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>>
>>>>> I prefer the latter like this: the dependency stays with the user-jar
>>>>> and not on the task manager.
>>>>>
>>>>> On Tue, Apr 26, 2022 at 9:52 PM John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Ok so I should put the Apache ignite and my Microsoft drivers in the
>>>>>> lib folders of my task managers?
>>>>>>
>>>>>> And then in my job jar only include them as compile time
>>>>>> dependencies?
>>>>>>
>>>>>>
>>>>>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ch...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>>>>>>
>>>>>>> You have correctly identified your alternatives.
>>>>>>>
>>>>>>> You must put the jdbc driver into /lib instead. Setting only the
>>>>>>> parent-first pattern shouldn't affect anything.
>>>>>>> That is only relevant if something is in both in /lib and the
>>>>>>> user-jar, telling Flink to prioritize what is in lib.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On 26/04/2022 15:35, John Smith wrote:
>>>>>>>
>>>>>>> So I put classloader.parent-first-patterns.additional:
>>>>>>> "org.apache.ignite." in the task config and so far I don't think I'm
>>>>>>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>>>>>>
>>>>>>> Or it's too early to tell.
>>>>>>>
>>>>>>> Though now, the task managers are shutting down due to some
>>>>>>> other failures.
>>>>>>>
>>>>>>> So maybe because tasks were failing and reloading often the task
>>>>>>> manager was running out of Metspace. But now maybe it's just
>>>>>>> cleanly shutting down.
>>>>>>>
>>>>>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <ja...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Or I can put in the config to treat org.apache.ignite. classes as
>>>>>>>> first class?
>>>>>>>>
>>>>>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <ja...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Ok, so I loaded the dump into Eclipse Mat and followed:
>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>>
>>>>>>>>> - On the Histogram, I got over 30 entries for:
>>>>>>>>> ChildFirstClassLoader
>>>>>>>>> - Then I clicked on one of them "Merge Shortest Path..." and
>>>>>>>>> picked "Exclude all phantom/weak/soft references"
>>>>>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin
>>>>>>>>> Driver
>>>>>>>>>
>>>>>>>>> So i'm guessing anything JDBC based. I should copy into the task
>>>>>>>>> manager libs folder and my jobs make the dependencies as compile only?
>>>>>>>>>
>>>>>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
>>>>>>>>> yaroslav@goldsky.io> wrote:
>>>>>>>>>
>>>>>>>>>> Also
>>>>>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>>>>>>> might be helpful (has a section on profiling, as well as classloading).
>>>>>>>>>>
>>>>>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <
>>>>>>>>>> chesnay@apache.org> wrote:
>>>>>>>>>>
>>>>>>>>>>> We have a very rough "guide" in the wiki (it's just the specific
>>>>>>>>>>> steps I took to debug another leak):
>>>>>>>>>>>
>>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>>>>
>>>>>>>>>>> On 19/04/2022 12:01, huweihua wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi, John
>>>>>>>>>>>
>>>>>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump
>>>>>>>>>>> file. Check whether have too many loaded classes.
>>>>>>>>>>>
>>>>>>>>>>> [1] https://www.eclipse.org/mat/
>>>>>>>>>>>
>>>>>>>>>>> 2022年4月18日 下午9:55，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>
>>>>>>>>>>> Hi, can anyone help with this? I never looked at a dump file
>>>>>>>>>>> before.
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <
>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi, so I have a dump file. What do I look for?
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <
>>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Ok so if there's a leak, if I manually stop the job and
>>>>>>>>>>>>> restart it from the UI multiple times, I won't see the issue because
>>>>>>>>>>>>> because the classes are unloaded correctly?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <
>>>>>>>>>>>>> huweihua.ckl@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The difference is that manually canceling the job stops the
>>>>>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But looking
>>>>>>>>>>>>>> on TaskManager, it doesn't make much difference
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2022年3月31日 上午4:01，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Also if I manually cancel and restart the same job over and
>>>>>>>>>>>>>> over is it the same as if flink was restarting a job due to failure?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job
>>>>>>>>>>>>>> completely unloaded vs when the job scheduler restarts a job because if
>>>>>>>>>>>>>> whatever reason?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe
>>>>>>>>>>>>>> I can trick my job to fail and have the scheduler restart it. Ok let me
>>>>>>>>>>>>>> think about this...
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <hu...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> So if I run the same jobs in my dev env will I still be able
>>>>>>>>>>>>>>> to see the similar dump?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I think running the same job in dev should be reproducible,
>>>>>>>>>>>>>>> maybe you can have a try.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>  If not I would have to wait at a low volume time to do it
>>>>>>>>>>>>>>> on production. Aldo if I recall the dump is as big as the JVM memory right
>>>>>>>>>>>>>>> so if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on
>>>>>>>>>>>>>>> the size to dump. you can use "jmap -dump:live" to dump only the reachable
>>>>>>>>>>>>>>> objects, this will take a brief pause
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> 2022年3月30日 下午9:47，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I have 3 task managers (see config below). There is total of
>>>>>>>>>>>>>>> 10 jobs with 25 slots being used.
>>>>>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and
>>>>>>>>>>>>>>> push it to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if
>>>>>>>>>>>>>>> I run the same jobs in my dev env will I still be able to see the similar
>>>>>>>>>>>>>>> dump? I I assume so. If not I would have to wait at a low volume time to do
>>>>>>>>>>>>>>> it on production. Aldo if I recall the dump is as big as the JVM memory
>>>>>>>>>>>>>>> right so if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> # Operating system has 16GB total.
>>>>>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>>>>>>>>>> parallelism.default: 1
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> high-availability: zookeeper
>>>>>>>>>>>>>>> high-availability.storageDir:
>>>>>>>>>>>>>>> file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> state.backend: rocksdb
>>>>>>>>>>>>>>> state.backend.incremental: true
>>>>>>>>>>>>>>> state.checkpoints.dir:
>>>>>>>>>>>>>>> file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>>>>>>> state.savepoints.dir:
>>>>>>>>>>>>>>> file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <hu...@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hi, John
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink
>>>>>>>>>>>>>>>> session cluster with a lot of jobs?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use
>>>>>>>>>>>>>>>> tools such as MAT to analyze whether there are abnormal classes and
>>>>>>>>>>>>>>>> classloaders
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> > 2022年3月30日 上午6:09，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > Hi running 1.14.4
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > My tasks manager still fails with
>>>>>>>>>>>>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error
>>>>>>>>>>>>>>>> has occurred. This can mean two things: either the job requires a larger
>>>>>>>>>>>>>>>> size of JVM metaspace to load classes or there is a class loading leak.
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > But the task nodes still fail.
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low.
>>>>>>>>>>>>>>>> Now I see 85% usage. It seems to be a class loading leak at this point, how
>>>>>>>>>>>>>>>> can we debug this issue?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>
>>>>>
>>>>
>>>>
>
>

Re: How to debug Metaspace exception?

Posted by Chesnay Schepler <ch...@apache.org>.

And you do should make sure that it is set for both processes!

On 02/05/2022 08:43, Chesnay Schepler wrote:
> The setting itself isn't taskmanager specific; it applies to both the 
> job- and taskmanager process.
>
> On 02/05/2022 05:29, John Smith wrote:
>> Also just to be sure this is a Task Manager setting right?
>>
>> On Thu, Apr 28, 2022 at 11:13 AM John Smith <ja...@gmail.com> 
>> wrote:
>>
>>     I assume you will take action on your side to track and fix the
>>     doc? :)
>>
>>     On Thu, Apr 28, 2022 at 11:12 AM John Smith
>>     <ja...@gmail.com> wrote:
>>
>>         Ok so to summarize...
>>
>>         - Build my job jar and have the JDBC driver as a compile only
>>         dependency and copy the JDBC driver to flink lib folder.
>>
>>         Or
>>
>>         - Build my job jar and include JDBC driver in the shadow,
>>         plus copy the JDBC driver in the flink lib folder, plus  make
>>         an entry in config for
>>         |classloader.parent-first-patterns-additional|
>>         <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>
>>
>>         On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler
>>         <ch...@apache.org> wrote:
>>
>>             I think what I meant was "either add it to /lib, or [if
>>             it is already in /lib but also bundled in the jar] add it
>>             to the parent-first patterns."
>>
>>             On 28/04/2022 15:56, Chesnay Schepler wrote:
>>>             Pretty sure, even though I seemingly documented it
>>>             incorrectly :)
>>>
>>>             On 28/04/2022 15:49, John Smith wrote:
>>>>             You sure?
>>>>
>>>>              *
>>>>
>>>>                 /JDBC/: JDBC drivers leak references outside the
>>>>                 user code classloader. To ensure that these classes
>>>>                 are only loaded once you should either add the
>>>>                 driver jars to Flink’s |lib/| folder, or add the
>>>>                 driver classes to the list of parent-first loaded
>>>>                 class via
>>>>                 |classloader.parent-first-patterns-additional|
>>>>                 <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>.
>>>>
>>>>                 It says either or
>>>>
>>>>
>>>>             On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler
>>>>             <ch...@apache.org> wrote:
>>>>
>>>>                 You're misinterpreting the docs.
>>>>
>>>>                 The parent/child-first classloading controls where
>>>>                 Flink looks for a class /first/, specifically
>>>>                 whether we first load from /lib or the user-jar.
>>>>                 It does not allow you to load something from the
>>>>                 user-jar in the parent classloader. That's just not
>>>>                 how it works.
>>>>
>>>>                 It must be in /lib.
>>>>
>>>>                 On 27/04/2022 04:59, John Smith wrote:
>>>>>                 Hi Chesnay as per the docs...
>>>>>                 https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>>>>
>>>>>                 You can either put the jars in task manager lib
>>>>>                 folder or use
>>>>>                 |classloader.parent-first-patterns-additional|
>>>>>                 <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>>
>>>>>                 I prefer the latter like this: the
>>>>>                 dependency stays with the user-jar and not on the
>>>>>                 task manager.
>>>>>
>>>>>                 On Tue, Apr 26, 2022 at 9:52 PM John Smith
>>>>>                 <ja...@gmail.com> wrote:
>>>>>
>>>>>                     Ok so I should put the Apache ignite and my
>>>>>                     Microsoft drivers in the lib folders of my
>>>>>                     task managers?
>>>>>
>>>>>                     And then in my job jar only include them as
>>>>>                     compile time dependencies?
>>>>>
>>>>>
>>>>>                     On Tue, Apr 26, 2022 at 10:42 AM Chesnay
>>>>>                     Schepler <ch...@apache.org> wrote:
>>>>>
>>>>>                         JDBC drivers are well-known for leaking
>>>>>                         classloaders unfortunately.
>>>>>
>>>>>                         You have correctly identified your
>>>>>                         alternatives.
>>>>>
>>>>>                         You must put the jdbc driver into /lib
>>>>>                         instead. Setting only the parent-first
>>>>>                         pattern shouldn't affect anything.
>>>>>                         That is only relevant if something is in
>>>>>                         both in /lib and the user-jar, telling
>>>>>                         Flink to prioritize what is in lib.
>>>>>
>>>>>
>>>>>
>>>>>                         On 26/04/2022 15:35, John Smith wrote:
>>>>>>                         So I
>>>>>>                         put classloader.parent-first-patterns.additional:
>>>>>>                         "org.apache.ignite." in the task config
>>>>>>                         and so far I don't think I'm getting
>>>>>>                         "java.lang.OutOfMemoryError: Metaspace"
>>>>>>                         any more.
>>>>>>
>>>>>>                         Or it's too early to tell.
>>>>>>
>>>>>>                         Though now, the task managers are
>>>>>>                         shutting down due to some other failures.
>>>>>>
>>>>>>                         So maybe because tasks were failing and
>>>>>>                         reloading often the task manager was
>>>>>>                         running out of Metspace. But now maybe
>>>>>>                         it's just cleanly shutting down.
>>>>>>
>>>>>>                         On Wed, Apr 20, 2022 at 11:35 AM John
>>>>>>                         Smith <ja...@gmail.com> wrote:
>>>>>>
>>>>>>                             Or I can put in the config to treat
>>>>>>                             org.apache.ignite. classes as first
>>>>>>                             class?
>>>>>>
>>>>>>                             On Tue, Apr 19, 2022 at 10:18 PM John
>>>>>>                             Smith <ja...@gmail.com> wrote:
>>>>>>
>>>>>>                                 Ok, so I loaded the dump into
>>>>>>                                 Eclipse Mat and followed:
>>>>>>                                 https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>
>>>>>>                                 - On the Histogram, I got over 30
>>>>>>                                 entries for: ChildFirstClassLoader
>>>>>>                                 - Then I clicked on one of them
>>>>>>                                 "Merge Shortest Path..." and
>>>>>>                                 picked "Exclude all
>>>>>>                                 phantom/weak/soft references"
>>>>>>                                 - Which then gave me:
>>>>>>                                 SqlDriverManager > Apache Ignite
>>>>>>                                 JdbcThin Driver
>>>>>>
>>>>>>                                 So i'm guessing anything JDBC
>>>>>>                                 based. I should copy into the
>>>>>>                                 task manager libs folder and my
>>>>>>                                 jobs make the dependencies as
>>>>>>                                 compile only?
>>>>>>
>>>>>>                                 On Tue, Apr 19, 2022 at 12:18 PM
>>>>>>                                 Yaroslav Tkachenko
>>>>>>                                 <ya...@goldsky.io> wrote:
>>>>>>
>>>>>>                                     Also
>>>>>>                                     https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>>>                                     might be helpful (has a
>>>>>>                                     section on profiling, as well
>>>>>>                                     as classloading).
>>>>>>
>>>>>>                                     On Tue, Apr 19, 2022 at 4:35
>>>>>>                                     AM Chesnay Schepler
>>>>>>                                     <ch...@apache.org> wrote:
>>>>>>
>>>>>>                                         We have a very rough
>>>>>>                                         "guide" in the wiki (it's
>>>>>>                                         just the specific steps I
>>>>>>                                         took to debug another leak):
>>>>>>                                         https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>
>>>>>>                                         On 19/04/2022 12:01,
>>>>>>                                         huweihua wrote:
>>>>>>>                                         Hi, John
>>>>>>>
>>>>>>>                                         Sorry for the late
>>>>>>>                                         reply. You can use
>>>>>>>                                         MAT[1] to analyze the
>>>>>>>                                         dump file. Check whether
>>>>>>>                                         have too many loaded
>>>>>>>                                         classes.
>>>>>>>
>>>>>>>                                         [1]
>>>>>>>                                         https://www.eclipse.org/mat/
>>>>>>>
>>>>>>>>                                         2022年4月18日
>>>>>>>>                                         下午9:55，John Smith
>>>>>>>>                                         <ja...@gmail.com>
>>>>>>>>                                         写道：
>>>>>>>>
>>>>>>>>                                         Hi, can anyone help
>>>>>>>>                                         with this? I never
>>>>>>>>                                         looked at a dump file
>>>>>>>>                                         before.
>>>>>>>>
>>>>>>>>                                         On Thu, Apr 14, 2022 at
>>>>>>>>                                         11:59 AM John Smith
>>>>>>>>                                         <ja...@gmail.com>
>>>>>>>>                                         wrote:
>>>>>>>>
>>>>>>>>                                             Hi, so I have a
>>>>>>>>                                             dump file. What do
>>>>>>>>                                             I look for?
>>>>>>>>
>>>>>>>>                                             On Thu, Mar 31,
>>>>>>>>                                             2022 at 3:28 PM
>>>>>>>>                                             John Smith
>>>>>>>>                                             <ja...@gmail.com>
>>>>>>>>                                             wrote:
>>>>>>>>
>>>>>>>>                                                 Ok so if
>>>>>>>>                                                 there's a leak,
>>>>>>>>                                                 if I
>>>>>>>>                                                 manually stop
>>>>>>>>                                                 the job and
>>>>>>>>                                                 restart it from
>>>>>>>>                                                 the UI multiple
>>>>>>>>                                                 times, I won't
>>>>>>>>                                                 see the issue
>>>>>>>>                                                 because because
>>>>>>>>                                                 the classes are
>>>>>>>>                                                 unloaded
>>>>>>>>                                                 correctly?
>>>>>>>>
>>>>>>>>
>>>>>>>>                                                 On Thu, Mar 31,
>>>>>>>>                                                 2022 at 9:20 AM
>>>>>>>>                                                 huweihua
>>>>>>>>                                                 <hu...@gmail.com>
>>>>>>>>                                                 wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>                                                     The
>>>>>>>>                                                     difference
>>>>>>>>                                                     is that
>>>>>>>>                                                     manually
>>>>>>>>                                                     canceling
>>>>>>>>                                                     the job
>>>>>>>>                                                     stops the
>>>>>>>>                                                     JobMaster,
>>>>>>>>                                                     but
>>>>>>>>                                                     automatic
>>>>>>>>                                                     failover
>>>>>>>>                                                     keeps the
>>>>>>>>                                                     JobMaster
>>>>>>>>                                                     running.
>>>>>>>>                                                     But looking
>>>>>>>>                                                     on
>>>>>>>>                                                     TaskManager,
>>>>>>>>                                                     it doesn't
>>>>>>>>                                                     make much
>>>>>>>>                                                     difference
>>>>>>>>
>>>>>>>>
>>>>>>>>>                                                     2022年3月31日
>>>>>>>>>                                                     上午4:01，John
>>>>>>>>>                                                     Smith
>>>>>>>>>                                                     <ja...@gmail.com>
>>>>>>>>>                                                     写道：
>>>>>>>>>
>>>>>>>>>                                                     Also if I
>>>>>>>>>                                                     manually
>>>>>>>>>                                                     cancel and
>>>>>>>>>                                                     restart
>>>>>>>>>                                                     the same
>>>>>>>>>                                                     job over
>>>>>>>>>                                                     and over
>>>>>>>>>                                                     is it the
>>>>>>>>>                                                     same as if
>>>>>>>>>                                                     flink was
>>>>>>>>>                                                     restarting
>>>>>>>>>                                                     a job due
>>>>>>>>>                                                     to failure?
>>>>>>>>>
>>>>>>>>>                                                     I.e: When
>>>>>>>>>                                                     I click
>>>>>>>>>                                                     "Cancel
>>>>>>>>>                                                     Job" on
>>>>>>>>>                                                     the UI is
>>>>>>>>>                                                     the job
>>>>>>>>>                                                     completely
>>>>>>>>>                                                     unloaded
>>>>>>>>>                                                     vs when
>>>>>>>>>                                                     the job
>>>>>>>>>                                                     scheduler
>>>>>>>>>                                                     restarts a
>>>>>>>>>                                                     job
>>>>>>>>>                                                     because if
>>>>>>>>>                                                     whatever
>>>>>>>>>                                                     reason?
>>>>>>>>>
>>>>>>>>>                                                     Lile this
>>>>>>>>>                                                     I'll stop
>>>>>>>>>                                                     and
>>>>>>>>>                                                     restart
>>>>>>>>>                                                     the job a
>>>>>>>>>                                                     few times
>>>>>>>>>                                                     or maybe I
>>>>>>>>>                                                     can trick
>>>>>>>>>                                                     my job to
>>>>>>>>>                                                     fail and
>>>>>>>>>                                                     have the
>>>>>>>>>                                                     scheduler
>>>>>>>>>                                                     restart
>>>>>>>>>                                                     it. Ok let
>>>>>>>>>                                                     me think
>>>>>>>>>                                                     about this...
>>>>>>>>>
>>>>>>>>>                                                     On Wed,
>>>>>>>>>                                                     Mar 30,
>>>>>>>>>                                                     2022 at
>>>>>>>>>                                                     10:24 AM
>>>>>>>>>                                                     胡伟华
>>>>>>>>>                                                     <hu...@gmail.com>
>>>>>>>>>                                                     wrote:
>>>>>>>>>
>>>>>>>>>>                                                         So if
>>>>>>>>>>                                                         I run
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         same
>>>>>>>>>>                                                         jobs
>>>>>>>>>>                                                         in my
>>>>>>>>>>                                                         dev
>>>>>>>>>>                                                         env
>>>>>>>>>>                                                         will
>>>>>>>>>>                                                         I
>>>>>>>>>>                                                         still
>>>>>>>>>>                                                         be
>>>>>>>>>>                                                         able
>>>>>>>>>>                                                         to
>>>>>>>>>>                                                         see
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         similar
>>>>>>>>>>                                                         dump?
>>>>>>>>>                                                         I
>>>>>>>>>                                                         think
>>>>>>>>>                                                         running
>>>>>>>>>                                                         the
>>>>>>>>>                                                         same
>>>>>>>>>                                                         job in
>>>>>>>>>                                                         dev
>>>>>>>>>                                                         should
>>>>>>>>>                                                         be
>>>>>>>>>                                                         reproducible,
>>>>>>>>>                                                         maybe
>>>>>>>>>                                                         you
>>>>>>>>>                                                         can
>>>>>>>>>                                                         have a
>>>>>>>>>                                                         try.
>>>>>>>>>
>>>>>>>>>>                                                          If
>>>>>>>>>>                                                         not I
>>>>>>>>>>                                                         would
>>>>>>>>>>                                                         have
>>>>>>>>>>                                                         to
>>>>>>>>>>                                                         wait
>>>>>>>>>>                                                         at a
>>>>>>>>>>                                                         low
>>>>>>>>>>                                                         volume
>>>>>>>>>>                                                         time
>>>>>>>>>>                                                         to do
>>>>>>>>>>                                                         it on
>>>>>>>>>>                                                         production.
>>>>>>>>>>                                                         Aldo
>>>>>>>>>>                                                         if I
>>>>>>>>>>                                                         recall
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         dump
>>>>>>>>>>                                                         is as
>>>>>>>>>>                                                         big
>>>>>>>>>>                                                         as
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         JVM
>>>>>>>>>>                                                         memory
>>>>>>>>>>                                                         right
>>>>>>>>>>                                                         so if
>>>>>>>>>>                                                         I
>>>>>>>>>>                                                         have
>>>>>>>>>>                                                         10GB
>>>>>>>>>>                                                         configed
>>>>>>>>>>                                                         for
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         JVM
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         dump
>>>>>>>>>>                                                         will
>>>>>>>>>>                                                         be
>>>>>>>>>>                                                         10GB
>>>>>>>>>>                                                         file?
>>>>>>>>>                                                         Yes,
>>>>>>>>>                                                         JMAP
>>>>>>>>>                                                         will
>>>>>>>>>                                                         pause
>>>>>>>>>                                                         the
>>>>>>>>>                                                         JVM,
>>>>>>>>>                                                         the
>>>>>>>>>                                                         time
>>>>>>>>>                                                         of
>>>>>>>>>                                                         pause
>>>>>>>>>                                                         depends
>>>>>>>>>                                                         on the
>>>>>>>>>                                                         size
>>>>>>>>>                                                         to
>>>>>>>>>                                                         dump.
>>>>>>>>>                                                         you
>>>>>>>>>                                                         can
>>>>>>>>>                                                         use
>>>>>>>>>                                                         "jmap
>>>>>>>>>                                                         -dump:live"
>>>>>>>>>                                                         to
>>>>>>>>>                                                         dump
>>>>>>>>>                                                         only
>>>>>>>>>                                                         the
>>>>>>>>>                                                         reachable
>>>>>>>>>                                                         objects,
>>>>>>>>>                                                         this
>>>>>>>>>                                                         will
>>>>>>>>>                                                         take a
>>>>>>>>>                                                         brief
>>>>>>>>>                                                         pause
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>                                                         2022年3月30日
>>>>>>>>>>                                                         下午9:47，John
>>>>>>>>>>                                                         Smith
>>>>>>>>>>                                                         <ja...@gmail.com>
>>>>>>>>>>                                                         写道：
>>>>>>>>>>
>>>>>>>>>>                                                         I
>>>>>>>>>>                                                         have
>>>>>>>>>>                                                         3
>>>>>>>>>>                                                         task
>>>>>>>>>>                                                         managers
>>>>>>>>>>                                                         (see
>>>>>>>>>>                                                         config
>>>>>>>>>>                                                         below).
>>>>>>>>>>                                                         There
>>>>>>>>>>                                                         is
>>>>>>>>>>                                                         total
>>>>>>>>>>                                                         of 10
>>>>>>>>>>                                                         jobs
>>>>>>>>>>                                                         with
>>>>>>>>>>                                                         25
>>>>>>>>>>                                                         slots
>>>>>>>>>>                                                         being
>>>>>>>>>>                                                         used.
>>>>>>>>>>                                                         The
>>>>>>>>>>                                                         jobs
>>>>>>>>>>                                                         are
>>>>>>>>>>                                                         100%
>>>>>>>>>>                                                         ETL
>>>>>>>>>>                                                         I.e;
>>>>>>>>>>                                                         They
>>>>>>>>>>                                                         load
>>>>>>>>>>                                                         Json,
>>>>>>>>>>                                                         transform
>>>>>>>>>>                                                         it
>>>>>>>>>>                                                         and
>>>>>>>>>>                                                         push
>>>>>>>>>>                                                         it to
>>>>>>>>>>                                                         JDBC,
>>>>>>>>>>                                                         only
>>>>>>>>>>                                                         1 job
>>>>>>>>>>                                                         of
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         10 is
>>>>>>>>>>                                                         pushing
>>>>>>>>>>                                                         to
>>>>>>>>>>                                                         Apache
>>>>>>>>>>                                                         Ignite
>>>>>>>>>>                                                         cluster.
>>>>>>>>>>
>>>>>>>>>>                                                         FOR
>>>>>>>>>>                                                         JMAP.
>>>>>>>>>>                                                         I
>>>>>>>>>>                                                         know
>>>>>>>>>>                                                         that
>>>>>>>>>>                                                         it
>>>>>>>>>>                                                         will
>>>>>>>>>>                                                         pause
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         task
>>>>>>>>>>                                                         manager.
>>>>>>>>>>                                                         So if
>>>>>>>>>>                                                         I run
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         same
>>>>>>>>>>                                                         jobs
>>>>>>>>>>                                                         in my
>>>>>>>>>>                                                         dev
>>>>>>>>>>                                                         env
>>>>>>>>>>                                                         will
>>>>>>>>>>                                                         I
>>>>>>>>>>                                                         still
>>>>>>>>>>                                                         be
>>>>>>>>>>                                                         able
>>>>>>>>>>                                                         to
>>>>>>>>>>                                                         see
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         similar
>>>>>>>>>>                                                         dump?
>>>>>>>>>>                                                         I I
>>>>>>>>>>                                                         assume
>>>>>>>>>>                                                         so.
>>>>>>>>>>                                                         If
>>>>>>>>>>                                                         not I
>>>>>>>>>>                                                         would
>>>>>>>>>>                                                         have
>>>>>>>>>>                                                         to
>>>>>>>>>>                                                         wait
>>>>>>>>>>                                                         at a
>>>>>>>>>>                                                         low
>>>>>>>>>>                                                         volume
>>>>>>>>>>                                                         time
>>>>>>>>>>                                                         to do
>>>>>>>>>>                                                         it on
>>>>>>>>>>                                                         production.
>>>>>>>>>>                                                         Aldo
>>>>>>>>>>                                                         if I
>>>>>>>>>>                                                         recall
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         dump
>>>>>>>>>>                                                         is as
>>>>>>>>>>                                                         big
>>>>>>>>>>                                                         as
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         JVM
>>>>>>>>>>                                                         memory
>>>>>>>>>>                                                         right
>>>>>>>>>>                                                         so if
>>>>>>>>>>                                                         I
>>>>>>>>>>                                                         have
>>>>>>>>>>                                                         10GB
>>>>>>>>>>                                                         configed
>>>>>>>>>>                                                         for
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         JVM
>>>>>>>>>>                                                         the
>>>>>>>>>>                                                         dump
>>>>>>>>>>                                                         will
>>>>>>>>>>                                                         be
>>>>>>>>>>                                                         10GB
>>>>>>>>>>                                                         file?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                                                         #
>>>>>>>>>>                                                         Operating
>>>>>>>>>>                                                         system
>>>>>>>>>>                                                         has
>>>>>>>>>>                                                         16GB
>>>>>>>>>>                                                         total.
>>>>>>>>>>                                                         env.ssh.opts:
>>>>>>>>>>                                                         -l
>>>>>>>>>>                                                         flink
>>>>>>>>>>                                                         -oStrictHostKeyChecking=no
>>>>>>>>>>
>>>>>>>>>>                                                         cluster.evenly-spread-out-slots:
>>>>>>>>>>                                                         true
>>>>>>>>>>
>>>>>>>>>>                                                         taskmanager.memory.flink.size:
>>>>>>>>>>                                                         10240m
>>>>>>>>>>                                                         taskmanager.memory.jvm-metaspace.size:
>>>>>>>>>>                                                         2048m
>>>>>>>>>>                                                         taskmanager.numberOfTaskSlots:
>>>>>>>>>>                                                         16
>>>>>>>>>>                                                         parallelism.default:
>>>>>>>>>>                                                         1
>>>>>>>>>>
>>>>>>>>>>                                                         high-availability:
>>>>>>>>>>                                                         zookeeper
>>>>>>>>>>                                                         high-availability.storageDir:
>>>>>>>>>>                                                         file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>>                                                         high-availability.zookeeper.quorum:
>>>>>>>>>>                                                         ...
>>>>>>>>>>                                                         high-availability.zookeeper.path.root:
>>>>>>>>>>                                                         /flink_1_14
>>>>>>>>>>                                                         high-availability.cluster-id:
>>>>>>>>>>                                                         /flink_1_14_cluster_0001
>>>>>>>>>>
>>>>>>>>>>                                                         web.upload.dir:
>>>>>>>>>>                                                         /mnt/flink/uploads/flink_1_14
>>>>>>>>>>
>>>>>>>>>>                                                         state.backend:
>>>>>>>>>>                                                         rocksdb
>>>>>>>>>>                                                         state.backend.incremental:
>>>>>>>>>>                                                         true
>>>>>>>>>>                                                         state.checkpoints.dir:
>>>>>>>>>>                                                         file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>>                                                         state.savepoints.dir:
>>>>>>>>>>                                                         file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>>
>>>>>>>>>>                                                         On
>>>>>>>>>>                                                         Wed,
>>>>>>>>>>                                                         Mar
>>>>>>>>>>                                                         30,
>>>>>>>>>>                                                         2022
>>>>>>>>>>                                                         at
>>>>>>>>>>                                                         2:16
>>>>>>>>>>                                                         AM
>>>>>>>>>>                                                         胡伟华
>>>>>>>>>>                                                         <hu...@gmail.com>
>>>>>>>>>>                                                         wrote:
>>>>>>>>>>
>>>>>>>>>>                                                             Hi,
>>>>>>>>>>                                                             John
>>>>>>>>>>
>>>>>>>>>>                                                             Could
>>>>>>>>>>                                                             you
>>>>>>>>>>                                                             tell
>>>>>>>>>>                                                             us
>>>>>>>>>>                                                             you
>>>>>>>>>>                                                             application
>>>>>>>>>>                                                             scenario?
>>>>>>>>>>                                                             Is
>>>>>>>>>>                                                             it
>>>>>>>>>>                                                             a
>>>>>>>>>>                                                             flink
>>>>>>>>>>                                                             session
>>>>>>>>>>                                                             cluster
>>>>>>>>>>                                                             with
>>>>>>>>>>                                                             a
>>>>>>>>>>                                                             lot
>>>>>>>>>>                                                             of
>>>>>>>>>>                                                             jobs?
>>>>>>>>>>
>>>>>>>>>>                                                             Maybe
>>>>>>>>>>                                                             you
>>>>>>>>>>                                                             can
>>>>>>>>>>                                                             try
>>>>>>>>>>                                                             to
>>>>>>>>>>                                                             dump
>>>>>>>>>>                                                             the
>>>>>>>>>>                                                             memory
>>>>>>>>>>                                                             with
>>>>>>>>>>                                                             jmap
>>>>>>>>>>                                                             and
>>>>>>>>>>                                                             use
>>>>>>>>>>                                                             tools
>>>>>>>>>>                                                             such
>>>>>>>>>>                                                             as
>>>>>>>>>>                                                             MAT
>>>>>>>>>>                                                             to
>>>>>>>>>>                                                             analyze
>>>>>>>>>>                                                             whether
>>>>>>>>>>                                                             there
>>>>>>>>>>                                                             are
>>>>>>>>>>                                                             abnormal
>>>>>>>>>>                                                             classes
>>>>>>>>>>                                                             and
>>>>>>>>>>                                                             classloaders
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>                                                             >
>>>>>>>>>>                                                             2022年3月30日
>>>>>>>>>>                                                             上午6:09，John
>>>>>>>>>>                                                             Smith
>>>>>>>>>>                                                             <ja...@gmail.com>
>>>>>>>>>>                                                             写道：
>>>>>>>>>>                                                             >
>>>>>>>>>>                                                             >
>>>>>>>>>>                                                             Hi
>>>>>>>>>>                                                             running
>>>>>>>>>>                                                             1.14.4
>>>>>>>>>>                                                             >
>>>>>>>>>>                                                             >
>>>>>>>>>>                                                             My
>>>>>>>>>>                                                             tasks
>>>>>>>>>>                                                             manager
>>>>>>>>>>                                                             still
>>>>>>>>>>                                                             fails
>>>>>>>>>>                                                             with
>>>>>>>>>>                                                             java.lang.OutOfMemoryError:
>>>>>>>>>>                                                             Metaspace.
>>>>>>>>>>                                                             The
>>>>>>>>>>                                                             metaspace
>>>>>>>>>>                                                             out-of-memory
>>>>>>>>>>                                                             error
>>>>>>>>>>                                                             has
>>>>>>>>>>                                                             occurred.
>>>>>>>>>>                                                             This
>>>>>>>>>>                                                             can
>>>>>>>>>>                                                             mean
>>>>>>>>>>                                                             two
>>>>>>>>>>                                                             things:
>>>>>>>>>>                                                             either
>>>>>>>>>>                                                             the
>>>>>>>>>>                                                             job
>>>>>>>>>>                                                             requires
>>>>>>>>>>                                                             a
>>>>>>>>>>                                                             larger
>>>>>>>>>>                                                             size
>>>>>>>>>>                                                             of
>>>>>>>>>>                                                             JVM
>>>>>>>>>>                                                             metaspace
>>>>>>>>>>                                                             to
>>>>>>>>>>                                                             load
>>>>>>>>>>                                                             classes
>>>>>>>>>>                                                             or
>>>>>>>>>>                                                             there
>>>>>>>>>>                                                             is
>>>>>>>>>>                                                             a
>>>>>>>>>>                                                             class
>>>>>>>>>>                                                             loading
>>>>>>>>>>                                                             leak.
>>>>>>>>>>                                                             >
>>>>>>>>>>                                                             >
>>>>>>>>>>                                                             I
>>>>>>>>>>                                                             have
>>>>>>>>>>                                                             2GB
>>>>>>>>>>                                                             of
>>>>>>>>>>                                                             metaspace
>>>>>>>>>>                                                             configed
>>>>>>>>>>                                                             taskmanager.memory.jvm-metaspace.size:
>>>>>>>>>>                                                             2048m
>>>>>>>>>>                                                             >
>>>>>>>>>>                                                             >
>>>>>>>>>>                                                             But
>>>>>>>>>>                                                             the
>>>>>>>>>>                                                             task
>>>>>>>>>>                                                             nodes
>>>>>>>>>>                                                             still
>>>>>>>>>>                                                             fail.
>>>>>>>>>>                                                             >
>>>>>>>>>>                                                             >
>>>>>>>>>>                                                             When
>>>>>>>>>>                                                             looking
>>>>>>>>>>                                                             at
>>>>>>>>>>                                                             the
>>>>>>>>>>                                                             UI
>>>>>>>>>>                                                             metrics,
>>>>>>>>>>                                                             the
>>>>>>>>>>                                                             metaspace
>>>>>>>>>>                                                             starts
>>>>>>>>>>                                                             low.
>>>>>>>>>>                                                             Now
>>>>>>>>>>                                                             I
>>>>>>>>>>                                                             see
>>>>>>>>>>                                                             85%
>>>>>>>>>>                                                             usage.
>>>>>>>>>>                                                             It
>>>>>>>>>>                                                             seems
>>>>>>>>>>                                                             to
>>>>>>>>>>                                                             be
>>>>>>>>>>                                                             a
>>>>>>>>>>                                                             class
>>>>>>>>>>                                                             loading
>>>>>>>>>>                                                             leak
>>>>>>>>>>                                                             at
>>>>>>>>>>                                                             this
>>>>>>>>>>                                                             point,
>>>>>>>>>>                                                             how
>>>>>>>>>>                                                             can
>>>>>>>>>>                                                             we
>>>>>>>>>>                                                             debug
>>>>>>>>>>                                                             this
>>>>>>>>>>                                                             issue?
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to debug Metaspace exception?

Posted by Chesnay Schepler <ch...@apache.org>.

The setting itself isn't taskmanager specific; it applies to both the 
job- and taskmanager process.

On 02/05/2022 05:29, John Smith wrote:
> Also just to be sure this is a Task Manager setting right?
>
> On Thu, Apr 28, 2022 at 11:13 AM John Smith <ja...@gmail.com> 
> wrote:
>
>     I assume you will take action on your side to track and fix the
>     doc? :)
>
>     On Thu, Apr 28, 2022 at 11:12 AM John Smith
>     <ja...@gmail.com> wrote:
>
>         Ok so to summarize...
>
>         - Build my job jar and have the JDBC driver as a compile only
>         dependency and copy the JDBC driver to flink lib folder.
>
>         Or
>
>         - Build my job jar and include JDBC driver in the shadow, plus
>         copy the JDBC driver in the flink lib folder, plus  make an
>         entry in config for
>         |classloader.parent-first-patterns-additional|
>         <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>
>
>         On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler
>         <ch...@apache.org> wrote:
>
>             I think what I meant was "either add it to /lib, or [if it
>             is already in /lib but also bundled in the jar] add it to
>             the parent-first patterns."
>
>             On 28/04/2022 15:56, Chesnay Schepler wrote:
>>             Pretty sure, even though I seemingly documented it
>>             incorrectly :)
>>
>>             On 28/04/2022 15:49, John Smith wrote:
>>>             You sure?
>>>
>>>              *
>>>
>>>                 /JDBC/: JDBC drivers leak references outside the
>>>                 user code classloader. To ensure that these classes
>>>                 are only loaded once you should either add the
>>>                 driver jars to Flink’s |lib/| folder, or add the
>>>                 driver classes to the list of parent-first loaded
>>>                 class via
>>>                 |classloader.parent-first-patterns-additional|
>>>                 <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>.
>>>
>>>                 It says either or
>>>
>>>
>>>             On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler
>>>             <ch...@apache.org> wrote:
>>>
>>>                 You're misinterpreting the docs.
>>>
>>>                 The parent/child-first classloading controls where
>>>                 Flink looks for a class /first/, specifically
>>>                 whether we first load from /lib or the user-jar.
>>>                 It does not allow you to load something from the
>>>                 user-jar in the parent classloader. That's just not
>>>                 how it works.
>>>
>>>                 It must be in /lib.
>>>
>>>                 On 27/04/2022 04:59, John Smith wrote:
>>>>                 Hi Chesnay as per the docs...
>>>>                 https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>>>
>>>>                 You can either put the jars in task manager lib
>>>>                 folder or use
>>>>                 |classloader.parent-first-patterns-additional|
>>>>                 <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>
>>>>                 I prefer the latter like this: the dependency stays
>>>>                 with the user-jar and not on the task manager.
>>>>
>>>>                 On Tue, Apr 26, 2022 at 9:52 PM John Smith
>>>>                 <ja...@gmail.com> wrote:
>>>>
>>>>                     Ok so I should put the Apache ignite and my
>>>>                     Microsoft drivers in the lib folders of my task
>>>>                     managers?
>>>>
>>>>                     And then in my job jar only include them as
>>>>                     compile time dependencies?
>>>>
>>>>
>>>>                     On Tue, Apr 26, 2022 at 10:42 AM Chesnay
>>>>                     Schepler <ch...@apache.org> wrote:
>>>>
>>>>                         JDBC drivers are well-known for leaking
>>>>                         classloaders unfortunately.
>>>>
>>>>                         You have correctly identified your
>>>>                         alternatives.
>>>>
>>>>                         You must put the jdbc driver into /lib
>>>>                         instead. Setting only the parent-first
>>>>                         pattern shouldn't affect anything.
>>>>                         That is only relevant if something is in
>>>>                         both in /lib and the user-jar, telling
>>>>                         Flink to prioritize what is in lib.
>>>>
>>>>
>>>>
>>>>                         On 26/04/2022 15:35, John Smith wrote:
>>>>>                         So I
>>>>>                         put classloader.parent-first-patterns.additional:
>>>>>                         "org.apache.ignite." in the task config
>>>>>                         and so far I don't think I'm getting
>>>>>                         "java.lang.OutOfMemoryError: Metaspace"
>>>>>                         any more.
>>>>>
>>>>>                         Or it's too early to tell.
>>>>>
>>>>>                         Though now, the task managers are shutting
>>>>>                         down due to some other failures.
>>>>>
>>>>>                         So maybe because tasks were failing and
>>>>>                         reloading often the task manager was
>>>>>                         running out of Metspace. But now maybe
>>>>>                         it's just cleanly shutting down.
>>>>>
>>>>>                         On Wed, Apr 20, 2022 at 11:35 AM John
>>>>>                         Smith <ja...@gmail.com> wrote:
>>>>>
>>>>>                             Or I can put in the config to treat
>>>>>                             org.apache.ignite. classes as first class?
>>>>>
>>>>>                             On Tue, Apr 19, 2022 at 10:18 PM John
>>>>>                             Smith <ja...@gmail.com> wrote:
>>>>>
>>>>>                                 Ok, so I loaded the dump into
>>>>>                                 Eclipse Mat and followed:
>>>>>                                 https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>
>>>>>                                 - On the Histogram, I got over 30
>>>>>                                 entries for: ChildFirstClassLoader
>>>>>                                 - Then I clicked on one of them
>>>>>                                 "Merge Shortest Path..." and
>>>>>                                 picked "Exclude all
>>>>>                                 phantom/weak/soft references"
>>>>>                                 - Which then gave me:
>>>>>                                 SqlDriverManager > Apache Ignite
>>>>>                                 JdbcThin Driver
>>>>>
>>>>>                                 So i'm guessing anything JDBC
>>>>>                                 based. I should copy into the task
>>>>>                                 manager libs folder and my jobs
>>>>>                                 make the dependencies as compile only?
>>>>>
>>>>>                                 On Tue, Apr 19, 2022 at 12:18 PM
>>>>>                                 Yaroslav Tkachenko
>>>>>                                 <ya...@goldsky.io> wrote:
>>>>>
>>>>>                                     Also
>>>>>                                     https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>>                                     might be helpful (has a
>>>>>                                     section on profiling, as well
>>>>>                                     as classloading).
>>>>>
>>>>>                                     On Tue, Apr 19, 2022 at 4:35
>>>>>                                     AM Chesnay Schepler
>>>>>                                     <ch...@apache.org> wrote:
>>>>>
>>>>>                                         We have a very rough
>>>>>                                         "guide" in the wiki (it's
>>>>>                                         just the specific steps I
>>>>>                                         took to debug another leak):
>>>>>                                         https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>
>>>>>                                         On 19/04/2022 12:01,
>>>>>                                         huweihua wrote:
>>>>>>                                         Hi, John
>>>>>>
>>>>>>                                         Sorry for the late reply.
>>>>>>                                         You can use MAT[1] to
>>>>>>                                         analyze the dump file.
>>>>>>                                         Check whether have too
>>>>>>                                         many loaded classes.
>>>>>>
>>>>>>                                         [1]
>>>>>>                                         https://www.eclipse.org/mat/
>>>>>>
>>>>>>>                                         2022年4月18日 下午9:55，John
>>>>>>>                                         Smith
>>>>>>>                                         <ja...@gmail.com>
>>>>>>>                                         写道：
>>>>>>>
>>>>>>>                                         Hi, can anyone help with
>>>>>>>                                         this? I never looked at
>>>>>>>                                         a dump file before.
>>>>>>>
>>>>>>>                                         On Thu, Apr 14, 2022 at
>>>>>>>                                         11:59 AM John Smith
>>>>>>>                                         <ja...@gmail.com>
>>>>>>>                                         wrote:
>>>>>>>
>>>>>>>                                             Hi, so I have a dump
>>>>>>>                                             file. What do I look
>>>>>>>                                             for?
>>>>>>>
>>>>>>>                                             On Thu, Mar 31, 2022
>>>>>>>                                             at 3:28 PM John
>>>>>>>                                             Smith
>>>>>>>                                             <ja...@gmail.com>
>>>>>>>                                             wrote:
>>>>>>>
>>>>>>>                                                 Ok so if there's
>>>>>>>                                                 a leak, if I
>>>>>>>                                                 manually stop
>>>>>>>                                                 the job and
>>>>>>>                                                 restart it from
>>>>>>>                                                 the UI multiple
>>>>>>>                                                 times, I won't
>>>>>>>                                                 see the issue
>>>>>>>                                                 because because
>>>>>>>                                                 the classes are
>>>>>>>                                                 unloaded correctly?
>>>>>>>
>>>>>>>
>>>>>>>                                                 On Thu, Mar 31,
>>>>>>>                                                 2022 at 9:20 AM
>>>>>>>                                                 huweihua
>>>>>>>                                                 <hu...@gmail.com>
>>>>>>>                                                 wrote:
>>>>>>>
>>>>>>>
>>>>>>>                                                     The
>>>>>>>                                                     difference
>>>>>>>                                                     is that
>>>>>>>                                                     manually
>>>>>>>                                                     canceling
>>>>>>>                                                     the job
>>>>>>>                                                     stops the
>>>>>>>                                                     JobMaster,
>>>>>>>                                                     but
>>>>>>>                                                     automatic
>>>>>>>                                                     failover
>>>>>>>                                                     keeps the
>>>>>>>                                                     JobMaster
>>>>>>>                                                     running. But
>>>>>>>                                                     looking on
>>>>>>>                                                     TaskManager,
>>>>>>>                                                     it doesn't
>>>>>>>                                                     make much
>>>>>>>                                                     difference
>>>>>>>
>>>>>>>
>>>>>>>>                                                     2022年3月31日
>>>>>>>>                                                     上午4:01，John
>>>>>>>>                                                     Smith
>>>>>>>>                                                     <ja...@gmail.com>
>>>>>>>>                                                     写道：
>>>>>>>>
>>>>>>>>                                                     Also if I
>>>>>>>>                                                     manually
>>>>>>>>                                                     cancel and
>>>>>>>>                                                     restart the
>>>>>>>>                                                     same job
>>>>>>>>                                                     over and
>>>>>>>>                                                     over is it
>>>>>>>>                                                     the same as
>>>>>>>>                                                     if flink
>>>>>>>>                                                     was
>>>>>>>>                                                     restarting
>>>>>>>>                                                     a job due
>>>>>>>>                                                     to failure?
>>>>>>>>
>>>>>>>>                                                     I.e: When I
>>>>>>>>                                                     click
>>>>>>>>                                                     "Cancel
>>>>>>>>                                                     Job" on the
>>>>>>>>                                                     UI is the
>>>>>>>>                                                     job
>>>>>>>>                                                     completely
>>>>>>>>                                                     unloaded vs
>>>>>>>>                                                     when the
>>>>>>>>                                                     job
>>>>>>>>                                                     scheduler
>>>>>>>>                                                     restarts a
>>>>>>>>                                                     job because
>>>>>>>>                                                     if whatever
>>>>>>>>                                                     reason?
>>>>>>>>
>>>>>>>>                                                     Lile this
>>>>>>>>                                                     I'll stop
>>>>>>>>                                                     and restart
>>>>>>>>                                                     the job a
>>>>>>>>                                                     few times
>>>>>>>>                                                     or maybe I
>>>>>>>>                                                     can trick
>>>>>>>>                                                     my job to
>>>>>>>>                                                     fail and
>>>>>>>>                                                     have the
>>>>>>>>                                                     scheduler
>>>>>>>>                                                     restart it.
>>>>>>>>                                                     Ok let me
>>>>>>>>                                                     think about
>>>>>>>>                                                     this...
>>>>>>>>
>>>>>>>>                                                     On Wed, Mar
>>>>>>>>                                                     30, 2022 at
>>>>>>>>                                                     10:24 AM
>>>>>>>>                                                     胡伟华
>>>>>>>>                                                     <hu...@gmail.com>
>>>>>>>>                                                     wrote:
>>>>>>>>
>>>>>>>>>                                                         So if
>>>>>>>>>                                                         I run
>>>>>>>>>                                                         the
>>>>>>>>>                                                         same
>>>>>>>>>                                                         jobs
>>>>>>>>>                                                         in my
>>>>>>>>>                                                         dev
>>>>>>>>>                                                         env
>>>>>>>>>                                                         will I
>>>>>>>>>                                                         still
>>>>>>>>>                                                         be
>>>>>>>>>                                                         able
>>>>>>>>>                                                         to see
>>>>>>>>>                                                         the
>>>>>>>>>                                                         similar
>>>>>>>>>                                                         dump?
>>>>>>>>                                                         I think
>>>>>>>>                                                         running
>>>>>>>>                                                         the
>>>>>>>>                                                         same
>>>>>>>>                                                         job in
>>>>>>>>                                                         dev
>>>>>>>>                                                         should
>>>>>>>>                                                         be
>>>>>>>>                                                         reproducible,
>>>>>>>>                                                         maybe
>>>>>>>>                                                         you can
>>>>>>>>                                                         have a try.
>>>>>>>>
>>>>>>>>>                                                          If
>>>>>>>>>                                                         not I
>>>>>>>>>                                                         would
>>>>>>>>>                                                         have
>>>>>>>>>                                                         to
>>>>>>>>>                                                         wait
>>>>>>>>>                                                         at a
>>>>>>>>>                                                         low
>>>>>>>>>                                                         volume
>>>>>>>>>                                                         time
>>>>>>>>>                                                         to do
>>>>>>>>>                                                         it on
>>>>>>>>>                                                         production.
>>>>>>>>>                                                         Aldo
>>>>>>>>>                                                         if I
>>>>>>>>>                                                         recall
>>>>>>>>>                                                         the
>>>>>>>>>                                                         dump
>>>>>>>>>                                                         is as
>>>>>>>>>                                                         big as
>>>>>>>>>                                                         the
>>>>>>>>>                                                         JVM
>>>>>>>>>                                                         memory
>>>>>>>>>                                                         right
>>>>>>>>>                                                         so if
>>>>>>>>>                                                         I have
>>>>>>>>>                                                         10GB
>>>>>>>>>                                                         configed
>>>>>>>>>                                                         for
>>>>>>>>>                                                         the
>>>>>>>>>                                                         JVM
>>>>>>>>>                                                         the
>>>>>>>>>                                                         dump
>>>>>>>>>                                                         will
>>>>>>>>>                                                         be
>>>>>>>>>                                                         10GB file?
>>>>>>>>                                                         Yes,
>>>>>>>>                                                         JMAP
>>>>>>>>                                                         will
>>>>>>>>                                                         pause
>>>>>>>>                                                         the
>>>>>>>>                                                         JVM,
>>>>>>>>                                                         the
>>>>>>>>                                                         time of
>>>>>>>>                                                         pause
>>>>>>>>                                                         depends
>>>>>>>>                                                         on the
>>>>>>>>                                                         size to
>>>>>>>>                                                         dump.
>>>>>>>>                                                         you can
>>>>>>>>                                                         use
>>>>>>>>                                                         "jmap
>>>>>>>>                                                         -dump:live"
>>>>>>>>                                                         to dump
>>>>>>>>                                                         only
>>>>>>>>                                                         the
>>>>>>>>                                                         reachable
>>>>>>>>                                                         objects,
>>>>>>>>                                                         this
>>>>>>>>                                                         will
>>>>>>>>                                                         take a
>>>>>>>>                                                         brief pause
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>>                                                         2022年3月30日
>>>>>>>>>                                                         下午9:47，John
>>>>>>>>>                                                         Smith
>>>>>>>>>                                                         <ja...@gmail.com>
>>>>>>>>>                                                         写道：
>>>>>>>>>
>>>>>>>>>                                                         I have
>>>>>>>>>                                                         3 task
>>>>>>>>>                                                         managers
>>>>>>>>>                                                         (see
>>>>>>>>>                                                         config
>>>>>>>>>                                                         below).
>>>>>>>>>                                                         There
>>>>>>>>>                                                         is
>>>>>>>>>                                                         total
>>>>>>>>>                                                         of 10
>>>>>>>>>                                                         jobs
>>>>>>>>>                                                         with
>>>>>>>>>                                                         25
>>>>>>>>>                                                         slots
>>>>>>>>>                                                         being
>>>>>>>>>                                                         used.
>>>>>>>>>                                                         The
>>>>>>>>>                                                         jobs
>>>>>>>>>                                                         are
>>>>>>>>>                                                         100%
>>>>>>>>>                                                         ETL
>>>>>>>>>                                                         I.e;
>>>>>>>>>                                                         They
>>>>>>>>>                                                         load
>>>>>>>>>                                                         Json,
>>>>>>>>>                                                         transform
>>>>>>>>>                                                         it and
>>>>>>>>>                                                         push
>>>>>>>>>                                                         it to
>>>>>>>>>                                                         JDBC,
>>>>>>>>>                                                         only 1
>>>>>>>>>                                                         job of
>>>>>>>>>                                                         the 10
>>>>>>>>>                                                         is
>>>>>>>>>                                                         pushing
>>>>>>>>>                                                         to
>>>>>>>>>                                                         Apache
>>>>>>>>>                                                         Ignite
>>>>>>>>>                                                         cluster.
>>>>>>>>>
>>>>>>>>>                                                         FOR
>>>>>>>>>                                                         JMAP.
>>>>>>>>>                                                         I know
>>>>>>>>>                                                         that
>>>>>>>>>                                                         it
>>>>>>>>>                                                         will
>>>>>>>>>                                                         pause
>>>>>>>>>                                                         the
>>>>>>>>>                                                         task
>>>>>>>>>                                                         manager.
>>>>>>>>>                                                         So if
>>>>>>>>>                                                         I run
>>>>>>>>>                                                         the
>>>>>>>>>                                                         same
>>>>>>>>>                                                         jobs
>>>>>>>>>                                                         in my
>>>>>>>>>                                                         dev
>>>>>>>>>                                                         env
>>>>>>>>>                                                         will I
>>>>>>>>>                                                         still
>>>>>>>>>                                                         be
>>>>>>>>>                                                         able
>>>>>>>>>                                                         to see
>>>>>>>>>                                                         the
>>>>>>>>>                                                         similar
>>>>>>>>>                                                         dump?
>>>>>>>>>                                                         I I
>>>>>>>>>                                                         assume
>>>>>>>>>                                                         so. If
>>>>>>>>>                                                         not I
>>>>>>>>>                                                         would
>>>>>>>>>                                                         have
>>>>>>>>>                                                         to
>>>>>>>>>                                                         wait
>>>>>>>>>                                                         at a
>>>>>>>>>                                                         low
>>>>>>>>>                                                         volume
>>>>>>>>>                                                         time
>>>>>>>>>                                                         to do
>>>>>>>>>                                                         it on
>>>>>>>>>                                                         production.
>>>>>>>>>                                                         Aldo
>>>>>>>>>                                                         if I
>>>>>>>>>                                                         recall
>>>>>>>>>                                                         the
>>>>>>>>>                                                         dump
>>>>>>>>>                                                         is as
>>>>>>>>>                                                         big as
>>>>>>>>>                                                         the
>>>>>>>>>                                                         JVM
>>>>>>>>>                                                         memory
>>>>>>>>>                                                         right
>>>>>>>>>                                                         so if
>>>>>>>>>                                                         I have
>>>>>>>>>                                                         10GB
>>>>>>>>>                                                         configed
>>>>>>>>>                                                         for
>>>>>>>>>                                                         the
>>>>>>>>>                                                         JVM
>>>>>>>>>                                                         the
>>>>>>>>>                                                         dump
>>>>>>>>>                                                         will
>>>>>>>>>                                                         be
>>>>>>>>>                                                         10GB file?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                                                         #
>>>>>>>>>                                                         Operating
>>>>>>>>>                                                         system
>>>>>>>>>                                                         has
>>>>>>>>>                                                         16GB
>>>>>>>>>                                                         total.
>>>>>>>>>                                                         env.ssh.opts:
>>>>>>>>>                                                         -l
>>>>>>>>>                                                         flink
>>>>>>>>>                                                         -oStrictHostKeyChecking=no
>>>>>>>>>
>>>>>>>>>                                                         cluster.evenly-spread-out-slots:
>>>>>>>>>                                                         true
>>>>>>>>>
>>>>>>>>>                                                         taskmanager.memory.flink.size:
>>>>>>>>>                                                         10240m
>>>>>>>>>                                                         taskmanager.memory.jvm-metaspace.size:
>>>>>>>>>                                                         2048m
>>>>>>>>>                                                         taskmanager.numberOfTaskSlots:
>>>>>>>>>                                                         16
>>>>>>>>>                                                         parallelism.default:
>>>>>>>>>                                                         1
>>>>>>>>>
>>>>>>>>>                                                         high-availability:
>>>>>>>>>                                                         zookeeper
>>>>>>>>>                                                         high-availability.storageDir:
>>>>>>>>>                                                         file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>                                                         high-availability.zookeeper.quorum:
>>>>>>>>>                                                         ...
>>>>>>>>>                                                         high-availability.zookeeper.path.root:
>>>>>>>>>                                                         /flink_1_14
>>>>>>>>>                                                         high-availability.cluster-id:
>>>>>>>>>                                                         /flink_1_14_cluster_0001
>>>>>>>>>
>>>>>>>>>                                                         web.upload.dir:
>>>>>>>>>                                                         /mnt/flink/uploads/flink_1_14
>>>>>>>>>
>>>>>>>>>                                                         state.backend:
>>>>>>>>>                                                         rocksdb
>>>>>>>>>                                                         state.backend.incremental:
>>>>>>>>>                                                         true
>>>>>>>>>                                                         state.checkpoints.dir:
>>>>>>>>>                                                         file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>                                                         state.savepoints.dir:
>>>>>>>>>                                                         file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>
>>>>>>>>>                                                         On
>>>>>>>>>                                                         Wed,
>>>>>>>>>                                                         Mar
>>>>>>>>>                                                         30,
>>>>>>>>>                                                         2022
>>>>>>>>>                                                         at
>>>>>>>>>                                                         2:16
>>>>>>>>>                                                         AM 胡伟华
>>>>>>>>>                                                         <hu...@gmail.com>
>>>>>>>>>                                                         wrote:
>>>>>>>>>
>>>>>>>>>                                                             Hi,
>>>>>>>>>                                                             John
>>>>>>>>>
>>>>>>>>>                                                             Could
>>>>>>>>>                                                             you
>>>>>>>>>                                                             tell
>>>>>>>>>                                                             us
>>>>>>>>>                                                             you
>>>>>>>>>                                                             application
>>>>>>>>>                                                             scenario?
>>>>>>>>>                                                             Is
>>>>>>>>>                                                             it
>>>>>>>>>                                                             a
>>>>>>>>>                                                             flink
>>>>>>>>>                                                             session
>>>>>>>>>                                                             cluster
>>>>>>>>>                                                             with
>>>>>>>>>                                                             a
>>>>>>>>>                                                             lot
>>>>>>>>>                                                             of
>>>>>>>>>                                                             jobs?
>>>>>>>>>
>>>>>>>>>                                                             Maybe
>>>>>>>>>                                                             you
>>>>>>>>>                                                             can
>>>>>>>>>                                                             try
>>>>>>>>>                                                             to
>>>>>>>>>                                                             dump
>>>>>>>>>                                                             the
>>>>>>>>>                                                             memory
>>>>>>>>>                                                             with
>>>>>>>>>                                                             jmap
>>>>>>>>>                                                             and
>>>>>>>>>                                                             use
>>>>>>>>>                                                             tools
>>>>>>>>>                                                             such
>>>>>>>>>                                                             as
>>>>>>>>>                                                             MAT
>>>>>>>>>                                                             to
>>>>>>>>>                                                             analyze
>>>>>>>>>                                                             whether
>>>>>>>>>                                                             there
>>>>>>>>>                                                             are
>>>>>>>>>                                                             abnormal
>>>>>>>>>                                                             classes
>>>>>>>>>                                                             and
>>>>>>>>>                                                             classloaders
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>                                                             >
>>>>>>>>>                                                             2022年3月30日
>>>>>>>>>                                                             上午6:09，John
>>>>>>>>>                                                             Smith
>>>>>>>>>                                                             <ja...@gmail.com>
>>>>>>>>>                                                             写道：
>>>>>>>>>                                                             >
>>>>>>>>>                                                             >
>>>>>>>>>                                                             Hi
>>>>>>>>>                                                             running
>>>>>>>>>                                                             1.14.4
>>>>>>>>>                                                             >
>>>>>>>>>                                                             >
>>>>>>>>>                                                             My
>>>>>>>>>                                                             tasks
>>>>>>>>>                                                             manager
>>>>>>>>>                                                             still
>>>>>>>>>                                                             fails
>>>>>>>>>                                                             with
>>>>>>>>>                                                             java.lang.OutOfMemoryError:
>>>>>>>>>                                                             Metaspace.
>>>>>>>>>                                                             The
>>>>>>>>>                                                             metaspace
>>>>>>>>>                                                             out-of-memory
>>>>>>>>>                                                             error
>>>>>>>>>                                                             has
>>>>>>>>>                                                             occurred.
>>>>>>>>>                                                             This
>>>>>>>>>                                                             can
>>>>>>>>>                                                             mean
>>>>>>>>>                                                             two
>>>>>>>>>                                                             things:
>>>>>>>>>                                                             either
>>>>>>>>>                                                             the
>>>>>>>>>                                                             job
>>>>>>>>>                                                             requires
>>>>>>>>>                                                             a
>>>>>>>>>                                                             larger
>>>>>>>>>                                                             size
>>>>>>>>>                                                             of
>>>>>>>>>                                                             JVM
>>>>>>>>>                                                             metaspace
>>>>>>>>>                                                             to
>>>>>>>>>                                                             load
>>>>>>>>>                                                             classes
>>>>>>>>>                                                             or
>>>>>>>>>                                                             there
>>>>>>>>>                                                             is
>>>>>>>>>                                                             a
>>>>>>>>>                                                             class
>>>>>>>>>                                                             loading
>>>>>>>>>                                                             leak.
>>>>>>>>>                                                             >
>>>>>>>>>                                                             >
>>>>>>>>>                                                             I
>>>>>>>>>                                                             have
>>>>>>>>>                                                             2GB
>>>>>>>>>                                                             of
>>>>>>>>>                                                             metaspace
>>>>>>>>>                                                             configed
>>>>>>>>>                                                             taskmanager.memory.jvm-metaspace.size:
>>>>>>>>>                                                             2048m
>>>>>>>>>                                                             >
>>>>>>>>>                                                             >
>>>>>>>>>                                                             But
>>>>>>>>>                                                             the
>>>>>>>>>                                                             task
>>>>>>>>>                                                             nodes
>>>>>>>>>                                                             still
>>>>>>>>>                                                             fail.
>>>>>>>>>                                                             >
>>>>>>>>>                                                             >
>>>>>>>>>                                                             When
>>>>>>>>>                                                             looking
>>>>>>>>>                                                             at
>>>>>>>>>                                                             the
>>>>>>>>>                                                             UI
>>>>>>>>>                                                             metrics,
>>>>>>>>>                                                             the
>>>>>>>>>                                                             metaspace
>>>>>>>>>                                                             starts
>>>>>>>>>                                                             low.
>>>>>>>>>                                                             Now
>>>>>>>>>                                                             I
>>>>>>>>>                                                             see
>>>>>>>>>                                                             85%
>>>>>>>>>                                                             usage.
>>>>>>>>>                                                             It
>>>>>>>>>                                                             seems
>>>>>>>>>                                                             to
>>>>>>>>>                                                             be
>>>>>>>>>                                                             a
>>>>>>>>>                                                             class
>>>>>>>>>                                                             loading
>>>>>>>>>                                                             leak
>>>>>>>>>                                                             at
>>>>>>>>>                                                             this
>>>>>>>>>                                                             point,
>>>>>>>>>                                                             how
>>>>>>>>>                                                             can
>>>>>>>>>                                                             we
>>>>>>>>>                                                             debug
>>>>>>>>>                                                             this
>>>>>>>>>                                                             issue?
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to debug Metaspace exception?

Posted by John Smith <ja...@gmail.com>.

Also just to be sure this is a Task Manager setting right?

On Thu, Apr 28, 2022 at 11:13 AM John Smith <ja...@gmail.com> wrote:

> I assume you will take action on your side to track and fix the doc? :)
>
> On Thu, Apr 28, 2022 at 11:12 AM John Smith <ja...@gmail.com>
> wrote:
>
>> Ok so to summarize...
>>
>> - Build my job jar and have the JDBC driver as a compile only
>> dependency and copy the JDBC driver to flink lib folder.
>>
>> Or
>>
>> - Build my job jar and include JDBC driver in the shadow, plus copy the
>> JDBC driver in the flink lib folder, plus  make an entry in config for
>> classloader.parent-first-patterns-additional
>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>
>>
>> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler <ch...@apache.org>
>> wrote:
>>
>>> I think what I meant was "either add it to /lib, or [if it is already in
>>> /lib but also bundled in the jar] add it to the parent-first patterns."
>>>
>>> On 28/04/2022 15:56, Chesnay Schepler wrote:
>>>
>>> Pretty sure, even though I seemingly documented it incorrectly :)
>>>
>>> On 28/04/2022 15:49, John Smith wrote:
>>>
>>> You sure?
>>>
>>>    -
>>>
>>>    *JDBC*: JDBC drivers leak references outside the user code
>>>    classloader. To ensure that these classes are only loaded once you should
>>>    either add the driver jars to Flink’s lib/ folder, or add the driver
>>>    classes to the list of parent-first loaded class via
>>>    classloader.parent-first-patterns-additional
>>>    <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>    .
>>>
>>>    It says either or
>>>
>>>
>>> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ch...@apache.org>
>>> wrote:
>>>
>>>> You're misinterpreting the docs.
>>>>
>>>> The parent/child-first classloading controls where Flink looks for a
>>>> class *first*, specifically whether we first load from /lib or the
>>>> user-jar.
>>>> It does not allow you to load something from the user-jar in the parent
>>>> classloader. That's just not how it works.
>>>>
>>>> It must be in /lib.
>>>>
>>>> On 27/04/2022 04:59, John Smith wrote:
>>>>
>>>> Hi Chesnay as per the docs...
>>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>>>
>>>> You can either put the jars in task manager lib folder or use
>>>> classloader.parent-first-patterns-additional
>>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>>
>>>> I prefer the latter like this: the dependency stays with the user-jar
>>>> and not on the task manager.
>>>>
>>>> On Tue, Apr 26, 2022 at 9:52 PM John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Ok so I should put the Apache ignite and my Microsoft drivers in the
>>>>> lib folders of my task managers?
>>>>>
>>>>> And then in my job jar only include them as compile time dependencies?
>>>>>
>>>>>
>>>>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ch...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>>>>>
>>>>>> You have correctly identified your alternatives.
>>>>>>
>>>>>> You must put the jdbc driver into /lib instead. Setting only the
>>>>>> parent-first pattern shouldn't affect anything.
>>>>>> That is only relevant if something is in both in /lib and the
>>>>>> user-jar, telling Flink to prioritize what is in lib.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On 26/04/2022 15:35, John Smith wrote:
>>>>>>
>>>>>> So I put classloader.parent-first-patterns.additional:
>>>>>> "org.apache.ignite." in the task config and so far I don't think I'm
>>>>>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>>>>>
>>>>>> Or it's too early to tell.
>>>>>>
>>>>>> Though now, the task managers are shutting down due to some
>>>>>> other failures.
>>>>>>
>>>>>> So maybe because tasks were failing and reloading often the task
>>>>>> manager was running out of Metspace. But now maybe it's just
>>>>>> cleanly shutting down.
>>>>>>
>>>>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Or I can put in the config to treat org.apache.ignite. classes as
>>>>>>> first class?
>>>>>>>
>>>>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <ja...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok, so I loaded the dump into Eclipse Mat and followed:
>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>
>>>>>>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
>>>>>>>> - Then I clicked on one of them "Merge Shortest Path..." and picked
>>>>>>>> "Exclude all phantom/weak/soft references"
>>>>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin
>>>>>>>> Driver
>>>>>>>>
>>>>>>>> So i'm guessing anything JDBC based. I should copy into the task
>>>>>>>> manager libs folder and my jobs make the dependencies as compile only?
>>>>>>>>
>>>>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
>>>>>>>> yaroslav@goldsky.io> wrote:
>>>>>>>>
>>>>>>>>> Also
>>>>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>>>>>> might be helpful (has a section on profiling, as well as classloading).
>>>>>>>>>
>>>>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <
>>>>>>>>> chesnay@apache.org> wrote:
>>>>>>>>>
>>>>>>>>>> We have a very rough "guide" in the wiki (it's just the specific
>>>>>>>>>> steps I took to debug another leak):
>>>>>>>>>>
>>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>>>
>>>>>>>>>> On 19/04/2022 12:01, huweihua wrote:
>>>>>>>>>>
>>>>>>>>>> Hi, John
>>>>>>>>>>
>>>>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump
>>>>>>>>>> file. Check whether have too many loaded classes.
>>>>>>>>>>
>>>>>>>>>> [1] https://www.eclipse.org/mat/
>>>>>>>>>>
>>>>>>>>>> 2022年4月18日 下午9:55，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>
>>>>>>>>>> Hi, can anyone help with this? I never looked at a dump file
>>>>>>>>>> before.
>>>>>>>>>>
>>>>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <
>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, so I have a dump file. What do I look for?
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <
>>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Ok so if there's a leak, if I manually stop the job and restart
>>>>>>>>>>>> it from the UI multiple times, I won't see the issue because because the
>>>>>>>>>>>> classes are unloaded correctly?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <
>>>>>>>>>>>> huweihua.ckl@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> The difference is that manually canceling the job stops the
>>>>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But looking
>>>>>>>>>>>>> on TaskManager, it doesn't make much difference
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2022年3月31日 上午4:01，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>>
>>>>>>>>>>>>> Also if I manually cancel and restart the same job over and
>>>>>>>>>>>>> over is it the same as if flink was restarting a job due to failure?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely
>>>>>>>>>>>>> unloaded vs when the job scheduler restarts a job because if whatever
>>>>>>>>>>>>> reason?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe I
>>>>>>>>>>>>> can trick my job to fail and have the scheduler restart it. Ok let me think
>>>>>>>>>>>>> about this...
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <hu...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> So if I run the same jobs in my dev env will I still be able
>>>>>>>>>>>>>> to see the similar dump?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I think running the same job in dev should be reproducible,
>>>>>>>>>>>>>> maybe you can have a try.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  If not I would have to wait at a low volume time to do it on
>>>>>>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>>>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on
>>>>>>>>>>>>>> the size to dump. you can use "jmap -dump:live" to dump only the reachable
>>>>>>>>>>>>>> objects, this will take a brief pause
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> 2022年3月30日 下午9:47，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I have 3 task managers (see config below). There is total of
>>>>>>>>>>>>>> 10 jobs with 25 slots being used.
>>>>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and
>>>>>>>>>>>>>> push it to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I
>>>>>>>>>>>>>> run the same jobs in my dev env will I still be able to see the similar
>>>>>>>>>>>>>> dump? I I assume so. If not I would have to wait at a low volume time to do
>>>>>>>>>>>>>> it on production. Aldo if I recall the dump is as big as the JVM memory
>>>>>>>>>>>>>> right so if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> # Operating system has 16GB total.
>>>>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>>>>>>>>> parallelism.default: 1
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> high-availability: zookeeper
>>>>>>>>>>>>>> high-availability.storageDir:
>>>>>>>>>>>>>> file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> state.backend: rocksdb
>>>>>>>>>>>>>> state.backend.incremental: true
>>>>>>>>>>>>>> state.checkpoints.dir:
>>>>>>>>>>>>>> file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <hu...@gmail.com>
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Hi, John
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink
>>>>>>>>>>>>>>> session cluster with a lot of jobs?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools
>>>>>>>>>>>>>>> such as MAT to analyze whether there are abnormal classes and classloaders
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> > 2022年3月30日 上午6:09，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > Hi running 1.14.4
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > My tasks manager still fails with
>>>>>>>>>>>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error
>>>>>>>>>>>>>>> has occurred. This can mean two things: either the job requires a larger
>>>>>>>>>>>>>>> size of JVM metaspace to load classes or there is a class loading leak.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > But the task nodes still fail.
>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low.
>>>>>>>>>>>>>>> Now I see 85% usage. It seems to be a class loading leak at this point, how
>>>>>>>>>>>>>>> can we debug this issue?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>
>>>>
>>>
>>>

Re: How to debug Metaspace exception?

Posted by John Smith <ja...@gmail.com>.

I assume you will take action on your side to track and fix the doc? :)

On Thu, Apr 28, 2022 at 11:12 AM John Smith <ja...@gmail.com> wrote:

> Ok so to summarize...
>
> - Build my job jar and have the JDBC driver as a compile only
> dependency and copy the JDBC driver to flink lib folder.
>
> Or
>
> - Build my job jar and include JDBC driver in the shadow, plus copy the
> JDBC driver in the flink lib folder, plus  make an entry in config for
> classloader.parent-first-patterns-additional
> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>
>
> On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler <ch...@apache.org>
> wrote:
>
>> I think what I meant was "either add it to /lib, or [if it is already in
>> /lib but also bundled in the jar] add it to the parent-first patterns."
>>
>> On 28/04/2022 15:56, Chesnay Schepler wrote:
>>
>> Pretty sure, even though I seemingly documented it incorrectly :)
>>
>> On 28/04/2022 15:49, John Smith wrote:
>>
>> You sure?
>>
>>    -
>>
>>    *JDBC*: JDBC drivers leak references outside the user code
>>    classloader. To ensure that these classes are only loaded once you should
>>    either add the driver jars to Flink’s lib/ folder, or add the driver
>>    classes to the list of parent-first loaded class via
>>    classloader.parent-first-patterns-additional
>>    <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>    .
>>
>>    It says either or
>>
>>
>> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ch...@apache.org>
>> wrote:
>>
>>> You're misinterpreting the docs.
>>>
>>> The parent/child-first classloading controls where Flink looks for a
>>> class *first*, specifically whether we first load from /lib or the
>>> user-jar.
>>> It does not allow you to load something from the user-jar in the parent
>>> classloader. That's just not how it works.
>>>
>>> It must be in /lib.
>>>
>>> On 27/04/2022 04:59, John Smith wrote:
>>>
>>> Hi Chesnay as per the docs...
>>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>>
>>> You can either put the jars in task manager lib folder or use
>>> classloader.parent-first-patterns-additional
>>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>
>>> I prefer the latter like this: the dependency stays with the user-jar
>>> and not on the task manager.
>>>
>>> On Tue, Apr 26, 2022 at 9:52 PM John Smith <ja...@gmail.com>
>>> wrote:
>>>
>>>> Ok so I should put the Apache ignite and my Microsoft drivers in the
>>>> lib folders of my task managers?
>>>>
>>>> And then in my job jar only include them as compile time dependencies?
>>>>
>>>>
>>>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ch...@apache.org>
>>>> wrote:
>>>>
>>>>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>>>>
>>>>> You have correctly identified your alternatives.
>>>>>
>>>>> You must put the jdbc driver into /lib instead. Setting only the
>>>>> parent-first pattern shouldn't affect anything.
>>>>> That is only relevant if something is in both in /lib and the
>>>>> user-jar, telling Flink to prioritize what is in lib.
>>>>>
>>>>>
>>>>>
>>>>> On 26/04/2022 15:35, John Smith wrote:
>>>>>
>>>>> So I put classloader.parent-first-patterns.additional:
>>>>> "org.apache.ignite." in the task config and so far I don't think I'm
>>>>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>>>>
>>>>> Or it's too early to tell.
>>>>>
>>>>> Though now, the task managers are shutting down due to some
>>>>> other failures.
>>>>>
>>>>> So maybe because tasks were failing and reloading often the task
>>>>> manager was running out of Metspace. But now maybe it's just
>>>>> cleanly shutting down.
>>>>>
>>>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Or I can put in the config to treat org.apache.ignite. classes as
>>>>>> first class?
>>>>>>
>>>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Ok, so I loaded the dump into Eclipse Mat and followed:
>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>
>>>>>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
>>>>>>> - Then I clicked on one of them "Merge Shortest Path..." and picked
>>>>>>> "Exclude all phantom/weak/soft references"
>>>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin
>>>>>>> Driver
>>>>>>>
>>>>>>> So i'm guessing anything JDBC based. I should copy into the task
>>>>>>> manager libs folder and my jobs make the dependencies as compile only?
>>>>>>>
>>>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
>>>>>>> yaroslav@goldsky.io> wrote:
>>>>>>>
>>>>>>>> Also
>>>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>>>>> might be helpful (has a section on profiling, as well as classloading).
>>>>>>>>
>>>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <
>>>>>>>> chesnay@apache.org> wrote:
>>>>>>>>
>>>>>>>>> We have a very rough "guide" in the wiki (it's just the specific
>>>>>>>>> steps I took to debug another leak):
>>>>>>>>>
>>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>>
>>>>>>>>> On 19/04/2022 12:01, huweihua wrote:
>>>>>>>>>
>>>>>>>>> Hi, John
>>>>>>>>>
>>>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump
>>>>>>>>> file. Check whether have too many loaded classes.
>>>>>>>>>
>>>>>>>>> [1] https://www.eclipse.org/mat/
>>>>>>>>>
>>>>>>>>> 2022年4月18日 下午9:55，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>
>>>>>>>>> Hi, can anyone help with this? I never looked at a dump file
>>>>>>>>> before.
>>>>>>>>>
>>>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <
>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi, so I have a dump file. What do I look for?
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <
>>>>>>>>>> java.dev.mtl@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Ok so if there's a leak, if I manually stop the job and restart
>>>>>>>>>>> it from the UI multiple times, I won't see the issue because because the
>>>>>>>>>>> classes are unloaded correctly?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <hu...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> The difference is that manually canceling the job stops the
>>>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But looking
>>>>>>>>>>>> on TaskManager, it doesn't make much difference
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2022年3月31日 上午4:01，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>
>>>>>>>>>>>> Also if I manually cancel and restart the same job over and
>>>>>>>>>>>> over is it the same as if flink was restarting a job due to failure?
>>>>>>>>>>>>
>>>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely
>>>>>>>>>>>> unloaded vs when the job scheduler restarts a job because if whatever
>>>>>>>>>>>> reason?
>>>>>>>>>>>>
>>>>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe I
>>>>>>>>>>>> can trick my job to fail and have the scheduler restart it. Ok let me think
>>>>>>>>>>>> about this...
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <hu...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> So if I run the same jobs in my dev env will I still be able
>>>>>>>>>>>>> to see the similar dump?
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think running the same job in dev should be reproducible,
>>>>>>>>>>>>> maybe you can have a try.
>>>>>>>>>>>>>
>>>>>>>>>>>>>  If not I would have to wait at a low volume time to do it on
>>>>>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the
>>>>>>>>>>>>> size to dump. you can use "jmap -dump:live" to dump only the reachable
>>>>>>>>>>>>> objects, this will take a brief pause
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> 2022年3月30日 下午9:47，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>>
>>>>>>>>>>>>> I have 3 task managers (see config below). There is total of
>>>>>>>>>>>>> 10 jobs with 25 slots being used.
>>>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and
>>>>>>>>>>>>> push it to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>>>>>>>>>
>>>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I
>>>>>>>>>>>>> run the same jobs in my dev env will I still be able to see the similar
>>>>>>>>>>>>> dump? I I assume so. If not I would have to wait at a low volume time to do
>>>>>>>>>>>>> it on production. Aldo if I recall the dump is as big as the JVM memory
>>>>>>>>>>>>> right so if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> # Operating system has 16GB total.
>>>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>>>>>>>
>>>>>>>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>>>>>>>
>>>>>>>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>>>>>>>> parallelism.default: 1
>>>>>>>>>>>>>
>>>>>>>>>>>>> high-availability: zookeeper
>>>>>>>>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>>>>>>>
>>>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>>>>>>>
>>>>>>>>>>>>> state.backend: rocksdb
>>>>>>>>>>>>> state.backend.incremental: true
>>>>>>>>>>>>> state.checkpoints.dir:
>>>>>>>>>>>>> file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <hu...@gmail.com>
>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hi, John
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink
>>>>>>>>>>>>>> session cluster with a lot of jobs?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools
>>>>>>>>>>>>>> such as MAT to analyze whether there are abnormal classes and classloaders
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> > 2022年3月30日 上午6:09，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Hi running 1.14.4
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > My tasks manager still fails with
>>>>>>>>>>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error
>>>>>>>>>>>>>> has occurred. This can mean two things: either the job requires a larger
>>>>>>>>>>>>>> size of JVM metaspace to load classes or there is a class loading leak.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > But the task nodes still fail.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low.
>>>>>>>>>>>>>> Now I see 85% usage. It seems to be a class loading leak at this point, how
>>>>>>>>>>>>>> can we debug this issue?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>
>>>
>>
>>

Re: How to debug Metaspace exception?

Posted by John Smith <ja...@gmail.com>.

Ok so to summarize...

- Build my job jar and have the JDBC driver as a compile only
dependency and copy the JDBC driver to flink lib folder.

Or

- Build my job jar and include JDBC driver in the shadow, plus copy the
JDBC driver in the flink lib folder, plus  make an entry in config for
classloader.parent-first-patterns-additional
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>


On Thu, Apr 28, 2022 at 10:17 AM Chesnay Schepler <ch...@apache.org>
wrote:

> I think what I meant was "either add it to /lib, or [if it is already in
> /lib but also bundled in the jar] add it to the parent-first patterns."
>
> On 28/04/2022 15:56, Chesnay Schepler wrote:
>
> Pretty sure, even though I seemingly documented it incorrectly :)
>
> On 28/04/2022 15:49, John Smith wrote:
>
> You sure?
>
>    -
>
>    *JDBC*: JDBC drivers leak references outside the user code
>    classloader. To ensure that these classes are only loaded once you should
>    either add the driver jars to Flink’s lib/ folder, or add the driver
>    classes to the list of parent-first loaded class via
>    classloader.parent-first-patterns-additional
>    <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>    .
>
>    It says either or
>
>
> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ch...@apache.org>
> wrote:
>
>> You're misinterpreting the docs.
>>
>> The parent/child-first classloading controls where Flink looks for a
>> class *first*, specifically whether we first load from /lib or the
>> user-jar.
>> It does not allow you to load something from the user-jar in the parent
>> classloader. That's just not how it works.
>>
>> It must be in /lib.
>>
>> On 27/04/2022 04:59, John Smith wrote:
>>
>> Hi Chesnay as per the docs...
>> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>
>> You can either put the jars in task manager lib folder or use
>> classloader.parent-first-patterns-additional
>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>
>> I prefer the latter like this: the dependency stays with the user-jar and
>> not on the task manager.
>>
>> On Tue, Apr 26, 2022 at 9:52 PM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> Ok so I should put the Apache ignite and my Microsoft drivers in the lib
>>> folders of my task managers?
>>>
>>> And then in my job jar only include them as compile time dependencies?
>>>
>>>
>>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ch...@apache.org>
>>> wrote:
>>>
>>>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>>>
>>>> You have correctly identified your alternatives.
>>>>
>>>> You must put the jdbc driver into /lib instead. Setting only the
>>>> parent-first pattern shouldn't affect anything.
>>>> That is only relevant if something is in both in /lib and the user-jar,
>>>> telling Flink to prioritize what is in lib.
>>>>
>>>>
>>>>
>>>> On 26/04/2022 15:35, John Smith wrote:
>>>>
>>>> So I put classloader.parent-first-patterns.additional:
>>>> "org.apache.ignite." in the task config and so far I don't think I'm
>>>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>>>
>>>> Or it's too early to tell.
>>>>
>>>> Though now, the task managers are shutting down due to some
>>>> other failures.
>>>>
>>>> So maybe because tasks were failing and reloading often the task
>>>> manager was running out of Metspace. But now maybe it's just
>>>> cleanly shutting down.
>>>>
>>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Or I can put in the config to treat org.apache.ignite. classes as
>>>>> first class?
>>>>>
>>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Ok, so I loaded the dump into Eclipse Mat and followed:
>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>
>>>>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
>>>>>> - Then I clicked on one of them "Merge Shortest Path..." and picked
>>>>>> "Exclude all phantom/weak/soft references"
>>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin
>>>>>> Driver
>>>>>>
>>>>>> So i'm guessing anything JDBC based. I should copy into the task
>>>>>> manager libs folder and my jobs make the dependencies as compile only?
>>>>>>
>>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
>>>>>> yaroslav@goldsky.io> wrote:
>>>>>>
>>>>>>> Also
>>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>>>> might be helpful (has a section on profiling, as well as classloading).
>>>>>>>
>>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ch...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> We have a very rough "guide" in the wiki (it's just the specific
>>>>>>>> steps I took to debug another leak):
>>>>>>>>
>>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>>
>>>>>>>> On 19/04/2022 12:01, huweihua wrote:
>>>>>>>>
>>>>>>>> Hi, John
>>>>>>>>
>>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump
>>>>>>>> file. Check whether have too many loaded classes.
>>>>>>>>
>>>>>>>> [1] https://www.eclipse.org/mat/
>>>>>>>>
>>>>>>>> 2022年4月18日 下午9:55，John Smith <ja...@gmail.com> 写道：
>>>>>>>>
>>>>>>>> Hi, can anyone help with this? I never looked at a dump file before.
>>>>>>>>
>>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <ja...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Hi, so I have a dump file. What do I look for?
>>>>>>>>>
>>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <ja...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Ok so if there's a leak, if I manually stop the job and restart
>>>>>>>>>> it from the UI multiple times, I won't see the issue because because the
>>>>>>>>>> classes are unloaded correctly?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <hu...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> The difference is that manually canceling the job stops the
>>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But looking
>>>>>>>>>>> on TaskManager, it doesn't make much difference
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2022年3月31日 上午4:01，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>
>>>>>>>>>>> Also if I manually cancel and restart the same job over and over
>>>>>>>>>>> is it the same as if flink was restarting a job due to failure?
>>>>>>>>>>>
>>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely
>>>>>>>>>>> unloaded vs when the job scheduler restarts a job because if whatever
>>>>>>>>>>> reason?
>>>>>>>>>>>
>>>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe I
>>>>>>>>>>> can trick my job to fail and have the scheduler restart it. Ok let me think
>>>>>>>>>>> about this...
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <hu...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> So if I run the same jobs in my dev env will I still be able to
>>>>>>>>>>>> see the similar dump?
>>>>>>>>>>>>
>>>>>>>>>>>> I think running the same job in dev should be reproducible,
>>>>>>>>>>>> maybe you can have a try.
>>>>>>>>>>>>
>>>>>>>>>>>>  If not I would have to wait at a low volume time to do it on
>>>>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>>>
>>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the
>>>>>>>>>>>> size to dump. you can use "jmap -dump:live" to dump only the reachable
>>>>>>>>>>>> objects, this will take a brief pause
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> 2022年3月30日 下午9:47，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>
>>>>>>>>>>>> I have 3 task managers (see config below). There is total of 10
>>>>>>>>>>>> jobs with 25 slots being used.
>>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and
>>>>>>>>>>>> push it to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>>>>>>>>
>>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I
>>>>>>>>>>>> run the same jobs in my dev env will I still be able to see the similar
>>>>>>>>>>>> dump? I I assume so. If not I would have to wait at a low volume time to do
>>>>>>>>>>>> it on production. Aldo if I recall the dump is as big as the JVM memory
>>>>>>>>>>>> right so if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> # Operating system has 16GB total.
>>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>>>>>>
>>>>>>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>>>>>>
>>>>>>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>>>>>>> parallelism.default: 1
>>>>>>>>>>>>
>>>>>>>>>>>> high-availability: zookeeper
>>>>>>>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>>>>>>
>>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>>>>>>
>>>>>>>>>>>> state.backend: rocksdb
>>>>>>>>>>>> state.backend.incremental: true
>>>>>>>>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <hu...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi, John
>>>>>>>>>>>>>
>>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink
>>>>>>>>>>>>> session cluster with a lot of jobs?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools
>>>>>>>>>>>>> such as MAT to analyze whether there are abnormal classes and classloaders
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> > 2022年3月30日 上午6:09，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > Hi running 1.14.4
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > My tasks manager still fails with
>>>>>>>>>>>>> java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error
>>>>>>>>>>>>> has occurred. This can mean two things: either the job requires a larger
>>>>>>>>>>>>> size of JVM metaspace to load classes or there is a class loading leak.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > But the task nodes still fail.
>>>>>>>>>>>>> >
>>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low.
>>>>>>>>>>>>> Now I see 85% usage. It seems to be a class loading leak at this point, how
>>>>>>>>>>>>> can we debug this issue?
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>
>>
>
>

Re: How to debug Metaspace exception?

Posted by Chesnay Schepler <ch...@apache.org>.

I think what I meant was "either add it to /lib, or [if it is already in 
/lib but also bundled in the jar] add it to the parent-first patterns."

On 28/04/2022 15:56, Chesnay Schepler wrote:
> Pretty sure, even though I seemingly documented it incorrectly :)
>
> On 28/04/2022 15:49, John Smith wrote:
>> You sure?
>>
>>  *
>>
>>     /JDBC/: JDBC drivers leak references outside the user code
>>     classloader. To ensure that these classes are only loaded once
>>     you should either add the driver jars to Flink’s |lib/| folder,
>>     or add the driver classes to the list of parent-first loaded
>>     class via |classloader.parent-first-patterns-additional|
>>     <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>.
>>
>>     It says either or
>>
>>
>> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ch...@apache.org> 
>> wrote:
>>
>>     You're misinterpreting the docs.
>>
>>     The parent/child-first classloading controls where Flink looks
>>     for a class /first/, specifically whether we first load from /lib
>>     or the user-jar.
>>     It does not allow you to load something from the user-jar in the
>>     parent classloader. That's just not how it works.
>>
>>     It must be in /lib.
>>
>>     On 27/04/2022 04:59, John Smith wrote:
>>>     Hi Chesnay as per the docs...
>>>     https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>>
>>>     You can either put the jars in task manager lib folder or use
>>>     |classloader.parent-first-patterns-additional|
>>>     <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>>
>>>     I prefer the latter like this: the dependency stays with the
>>>     user-jar and not on the task manager.
>>>
>>>     On Tue, Apr 26, 2022 at 9:52 PM John Smith
>>>     <ja...@gmail.com> wrote:
>>>
>>>         Ok so I should put the Apache ignite and my Microsoft
>>>         drivers in the lib folders of my task managers?
>>>
>>>         And then in my job jar only include them as compile time
>>>         dependencies?
>>>
>>>
>>>         On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler
>>>         <ch...@apache.org> wrote:
>>>
>>>             JDBC drivers are well-known for leaking classloaders
>>>             unfortunately.
>>>
>>>             You have correctly identified your alternatives.
>>>
>>>             You must put the jdbc driver into /lib instead. Setting
>>>             only the parent-first pattern shouldn't affect anything.
>>>             That is only relevant if something is in both in /lib
>>>             and the user-jar, telling Flink to prioritize what is in
>>>             lib.
>>>
>>>
>>>
>>>             On 26/04/2022 15:35, John Smith wrote:
>>>>             So I put classloader.parent-first-patterns.additional:
>>>>             "org.apache.ignite." in the task config and so far I
>>>>             don't think I'm getting "java.lang.OutOfMemoryError:
>>>>             Metaspace" any more.
>>>>
>>>>             Or it's too early to tell.
>>>>
>>>>             Though now, the task managers are shutting down due to
>>>>             some other failures.
>>>>
>>>>             So maybe because tasks were failing and reloading often
>>>>             the task manager was running out of Metspace. But now
>>>>             maybe it's just cleanly shutting down.
>>>>
>>>>             On Wed, Apr 20, 2022 at 11:35 AM John Smith
>>>>             <ja...@gmail.com> wrote:
>>>>
>>>>                 Or I can put in the config to treat
>>>>                 org.apache.ignite. classes as first class?
>>>>
>>>>                 On Tue, Apr 19, 2022 at 10:18 PM John Smith
>>>>                 <ja...@gmail.com> wrote:
>>>>
>>>>                     Ok, so I loaded the dump into Eclipse Mat and
>>>>                     followed:
>>>>                     https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>
>>>>                     - On the Histogram, I got over 30 entries for:
>>>>                     ChildFirstClassLoader
>>>>                     - Then I clicked on one of them "Merge Shortest
>>>>                     Path..." and picked "Exclude all
>>>>                     phantom/weak/soft references"
>>>>                     - Which then gave me: SqlDriverManager > Apache
>>>>                     Ignite JdbcThin Driver
>>>>
>>>>                     So i'm guessing anything JDBC based. I should
>>>>                     copy into the task manager libs folder and my
>>>>                     jobs make the dependencies as compile only?
>>>>
>>>>                     On Tue, Apr 19, 2022 at 12:18 PM Yaroslav
>>>>                     Tkachenko <ya...@goldsky.io> wrote:
>>>>
>>>>                         Also
>>>>                         https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>                         might be helpful (has a section on
>>>>                         profiling, as well as classloading).
>>>>
>>>>                         On Tue, Apr 19, 2022 at 4:35 AM Chesnay
>>>>                         Schepler <ch...@apache.org> wrote:
>>>>
>>>>                             We have a very rough "guide" in the
>>>>                             wiki (it's just the specific steps I
>>>>                             took to debug another leak):
>>>>                             https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>
>>>>                             On 19/04/2022 12:01, huweihua wrote:
>>>>>                             Hi, John
>>>>>
>>>>>                             Sorry for the late reply. You can use
>>>>>                             MAT[1] to analyze the dump file. Check
>>>>>                             whether have too many loaded classes.
>>>>>
>>>>>                             [1] https://www.eclipse.org/mat/
>>>>>
>>>>>>                             2022年4月18日 下午9:55，John Smith
>>>>>>                             <ja...@gmail.com> 写道：
>>>>>>
>>>>>>                             Hi, can anyone help with this? I
>>>>>>                             never looked at a dump file before.
>>>>>>
>>>>>>                             On Thu, Apr 14, 2022 at 11:59 AM John
>>>>>>                             Smith <ja...@gmail.com> wrote:
>>>>>>
>>>>>>                                 Hi, so I have a dump file. What
>>>>>>                                 do I look for?
>>>>>>
>>>>>>                                 On Thu, Mar 31, 2022 at 3:28 PM
>>>>>>                                 John Smith
>>>>>>                                 <ja...@gmail.com> wrote:
>>>>>>
>>>>>>                                     Ok so if there's a leak, if I
>>>>>>                                     manually stop the job and
>>>>>>                                     restart it from the UI
>>>>>>                                     multiple times, I won't see
>>>>>>                                     the issue because because the
>>>>>>                                     classes are unloaded correctly?
>>>>>>
>>>>>>
>>>>>>                                     On Thu, Mar 31, 2022 at 9:20
>>>>>>                                     AM huweihua
>>>>>>                                     <hu...@gmail.com> wrote:
>>>>>>
>>>>>>
>>>>>>                                         The difference is that
>>>>>>                                         manually canceling the
>>>>>>                                         job stops the JobMaster,
>>>>>>                                         but automatic failover
>>>>>>                                         keeps the JobMaster
>>>>>>                                         running. But looking on
>>>>>>                                         TaskManager, it doesn't
>>>>>>                                         make much difference
>>>>>>
>>>>>>
>>>>>>>                                         2022年3月31日 上午4:01，John
>>>>>>>                                         Smith
>>>>>>>                                         <ja...@gmail.com>
>>>>>>>                                         写道：
>>>>>>>
>>>>>>>                                         Also if I manually
>>>>>>>                                         cancel and restart the
>>>>>>>                                         same job over and over
>>>>>>>                                         is it the same as if
>>>>>>>                                         flink was restarting a
>>>>>>>                                         job due to failure?
>>>>>>>
>>>>>>>                                         I.e: When I click
>>>>>>>                                         "Cancel Job" on the UI
>>>>>>>                                         is the job completely
>>>>>>>                                         unloaded vs when the job
>>>>>>>                                         scheduler restarts a job
>>>>>>>                                         because if whatever reason?
>>>>>>>
>>>>>>>                                         Lile this I'll stop and
>>>>>>>                                         restart the job a few
>>>>>>>                                         times or maybe I can
>>>>>>>                                         trick my job to fail and
>>>>>>>                                         have the scheduler
>>>>>>>                                         restart it. Ok let me
>>>>>>>                                         think about this...
>>>>>>>
>>>>>>>                                         On Wed, Mar 30, 2022 at
>>>>>>>                                         10:24 AM 胡伟华
>>>>>>>                                         <hu...@gmail.com>
>>>>>>>                                         wrote:
>>>>>>>
>>>>>>>>                                             So if I run the
>>>>>>>>                                             same jobs in my dev
>>>>>>>>                                             env will I still be
>>>>>>>>                                             able to see the
>>>>>>>>                                             similar dump?
>>>>>>>                                             I think running the
>>>>>>>                                             same job in dev
>>>>>>>                                             should be
>>>>>>>                                             reproducible, maybe
>>>>>>>                                             you can have a try.
>>>>>>>
>>>>>>>>                                              If not I would
>>>>>>>>                                             have to wait at a
>>>>>>>>                                             low volume time to
>>>>>>>>                                             do it on
>>>>>>>>                                             production. Aldo if
>>>>>>>>                                             I recall the dump
>>>>>>>>                                             is as big as the
>>>>>>>>                                             JVM memory right so
>>>>>>>>                                             if I have 10GB
>>>>>>>>                                             configed for the
>>>>>>>>                                             JVM the dump will
>>>>>>>>                                             be 10GB file?
>>>>>>>                                             Yes, JMAP will pause
>>>>>>>                                             the JVM, the time of
>>>>>>>                                             pause depends on the
>>>>>>>                                             size to dump. you
>>>>>>>                                             can use "jmap
>>>>>>>                                             -dump:live" to dump
>>>>>>>                                             only the reachable
>>>>>>>                                             objects, this will
>>>>>>>                                             take a brief pause
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>>                                             2022年3月30日
>>>>>>>>                                             下午9:47，John Smith
>>>>>>>>                                             <ja...@gmail.com>
>>>>>>>>                                             写道：
>>>>>>>>
>>>>>>>>                                             I have 3 task
>>>>>>>>                                             managers (see
>>>>>>>>                                             config below).
>>>>>>>>                                             There is total of
>>>>>>>>                                             10 jobs with 25
>>>>>>>>                                             slots being used.
>>>>>>>>                                             The jobs are 100%
>>>>>>>>                                             ETL I.e; They load
>>>>>>>>                                             Json, transform it
>>>>>>>>                                             and push it to
>>>>>>>>                                             JDBC, only 1 job of
>>>>>>>>                                             the 10 is pushing
>>>>>>>>                                             to Apache Ignite
>>>>>>>>                                             cluster.
>>>>>>>>
>>>>>>>>                                             FOR JMAP. I know
>>>>>>>>                                             that it will pause
>>>>>>>>                                             the task manager.
>>>>>>>>                                             So if I run the
>>>>>>>>                                             same jobs in my dev
>>>>>>>>                                             env will I still be
>>>>>>>>                                             able to see the
>>>>>>>>                                             similar dump? I I
>>>>>>>>                                             assume so. If not I
>>>>>>>>                                             would have to wait
>>>>>>>>                                             at a low volume
>>>>>>>>                                             time to do it on
>>>>>>>>                                             production. Aldo if
>>>>>>>>                                             I recall the dump
>>>>>>>>                                             is as big as the
>>>>>>>>                                             JVM memory right so
>>>>>>>>                                             if I have 10GB
>>>>>>>>                                             configed for the
>>>>>>>>                                             JVM the dump will
>>>>>>>>                                             be 10GB file?
>>>>>>>>
>>>>>>>>
>>>>>>>>                                             # Operating system
>>>>>>>>                                             has 16GB total.
>>>>>>>>                                             env.ssh.opts: -l
>>>>>>>>                                             flink
>>>>>>>>                                             -oStrictHostKeyChecking=no
>>>>>>>>
>>>>>>>>                                             cluster.evenly-spread-out-slots:
>>>>>>>>                                             true
>>>>>>>>
>>>>>>>>                                             taskmanager.memory.flink.size:
>>>>>>>>                                             10240m
>>>>>>>>                                             taskmanager.memory.jvm-metaspace.size:
>>>>>>>>                                             2048m
>>>>>>>>                                             taskmanager.numberOfTaskSlots:
>>>>>>>>                                             16
>>>>>>>>                                             parallelism.default: 1
>>>>>>>>
>>>>>>>>                                             high-availability:
>>>>>>>>                                             zookeeper
>>>>>>>>                                             high-availability.storageDir:
>>>>>>>>                                             file:///mnt/flink/ha/flink_1_14/
>>>>>>>>                                             high-availability.zookeeper.quorum:
>>>>>>>>                                             ...
>>>>>>>>                                             high-availability.zookeeper.path.root:
>>>>>>>>                                             /flink_1_14
>>>>>>>>                                             high-availability.cluster-id:
>>>>>>>>                                             /flink_1_14_cluster_0001
>>>>>>>>
>>>>>>>>                                             web.upload.dir:
>>>>>>>>                                             /mnt/flink/uploads/flink_1_14
>>>>>>>>
>>>>>>>>                                             state.backend: rocksdb
>>>>>>>>                                             state.backend.incremental:
>>>>>>>>                                             true
>>>>>>>>                                             state.checkpoints.dir:
>>>>>>>>                                             file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>                                             state.savepoints.dir:
>>>>>>>>                                             file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>
>>>>>>>>                                             On Wed, Mar 30,
>>>>>>>>                                             2022 at 2:16 AM 胡伟华
>>>>>>>>                                             <hu...@gmail.com>
>>>>>>>>                                             wrote:
>>>>>>>>
>>>>>>>>                                                 Hi, John
>>>>>>>>
>>>>>>>>                                                 Could you tell
>>>>>>>>                                                 us you
>>>>>>>>                                                 application
>>>>>>>>                                                 scenario? Is it
>>>>>>>>                                                 a flink session
>>>>>>>>                                                 cluster with a
>>>>>>>>                                                 lot of jobs?
>>>>>>>>
>>>>>>>>                                                 Maybe you can
>>>>>>>>                                                 try to dump the
>>>>>>>>                                                 memory with
>>>>>>>>                                                 jmap and use
>>>>>>>>                                                 tools such as
>>>>>>>>                                                 MAT to analyze
>>>>>>>>                                                 whether there
>>>>>>>>                                                 are abnormal
>>>>>>>>                                                 classes and
>>>>>>>>                                                 classloaders
>>>>>>>>
>>>>>>>>
>>>>>>>>                                                 > 2022年3月30日
>>>>>>>>                                                 上午6:09，John
>>>>>>>>                                                 Smith
>>>>>>>>                                                 <ja...@gmail.com>
>>>>>>>>                                                 写道：
>>>>>>>>                                                 >
>>>>>>>>                                                 > Hi running 1.14.4
>>>>>>>>                                                 >
>>>>>>>>                                                 > My tasks
>>>>>>>>                                                 manager still
>>>>>>>>                                                 fails with
>>>>>>>>                                                 java.lang.OutOfMemoryError:
>>>>>>>>                                                 Metaspace. The
>>>>>>>>                                                 metaspace
>>>>>>>>                                                 out-of-memory
>>>>>>>>                                                 error has
>>>>>>>>                                                 occurred. This
>>>>>>>>                                                 can mean two
>>>>>>>>                                                 things: either
>>>>>>>>                                                 the job
>>>>>>>>                                                 requires a
>>>>>>>>                                                 larger size of
>>>>>>>>                                                 JVM metaspace
>>>>>>>>                                                 to load classes
>>>>>>>>                                                 or there is a
>>>>>>>>                                                 class loading leak.
>>>>>>>>                                                 >
>>>>>>>>                                                 > I have 2GB of
>>>>>>>>                                                 metaspace
>>>>>>>>                                                 configed
>>>>>>>>                                                 taskmanager.memory.jvm-metaspace.size:
>>>>>>>>                                                 2048m
>>>>>>>>                                                 >
>>>>>>>>                                                 > But the task
>>>>>>>>                                                 nodes still fail.
>>>>>>>>                                                 >
>>>>>>>>                                                 > When looking
>>>>>>>>                                                 at the UI
>>>>>>>>                                                 metrics, the
>>>>>>>>                                                 metaspace
>>>>>>>>                                                 starts low. Now
>>>>>>>>                                                 I see 85%
>>>>>>>>                                                 usage. It seems
>>>>>>>>                                                 to be a class
>>>>>>>>                                                 loading leak at
>>>>>>>>                                                 this point, how
>>>>>>>>                                                 can we debug
>>>>>>>>                                                 this issue?
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to debug Metaspace exception?

Posted by Chesnay Schepler <ch...@apache.org>.

Pretty sure, even though I seemingly documented it incorrectly :)

On 28/04/2022 15:49, John Smith wrote:
> You sure?
>
>  *
>
>     /JDBC/: JDBC drivers leak references outside the user code
>     classloader. To ensure that these classes are only loaded once you
>     should either add the driver jars to Flink’s |lib/| folder, or add
>     the driver classes to the list of parent-first loaded class via
>     |classloader.parent-first-patterns-additional|
>     <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>.
>
>     It says either or
>
>
> On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ch...@apache.org> 
> wrote:
>
>     You're misinterpreting the docs.
>
>     The parent/child-first classloading controls where Flink looks for
>     a class /first/, specifically whether we first load from /lib or
>     the user-jar.
>     It does not allow you to load something from the user-jar in the
>     parent classloader. That's just not how it works.
>
>     It must be in /lib.
>
>     On 27/04/2022 04:59, John Smith wrote:
>>     Hi Chesnay as per the docs...
>>     https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>>
>>     You can either put the jars in task manager lib folder or use
>>     |classloader.parent-first-patterns-additional|
>>     <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>>
>>     I prefer the latter like this: the dependency stays with the
>>     user-jar and not on the task manager.
>>
>>     On Tue, Apr 26, 2022 at 9:52 PM John Smith
>>     <ja...@gmail.com> wrote:
>>
>>         Ok so I should put the Apache ignite and my Microsoft drivers
>>         in the lib folders of my task managers?
>>
>>         And then in my job jar only include them as compile time
>>         dependencies?
>>
>>
>>         On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler
>>         <ch...@apache.org> wrote:
>>
>>             JDBC drivers are well-known for leaking classloaders
>>             unfortunately.
>>
>>             You have correctly identified your alternatives.
>>
>>             You must put the jdbc driver into /lib instead. Setting
>>             only the parent-first pattern shouldn't affect anything.
>>             That is only relevant if something is in both in /lib and
>>             the user-jar, telling Flink to prioritize what is in lib.
>>
>>
>>
>>             On 26/04/2022 15:35, John Smith wrote:
>>>             So I put classloader.parent-first-patterns.additional:
>>>             "org.apache.ignite." in the task config and so far I
>>>             don't think I'm getting "java.lang.OutOfMemoryError:
>>>             Metaspace" any more.
>>>
>>>             Or it's too early to tell.
>>>
>>>             Though now, the task managers are shutting down due to
>>>             some other failures.
>>>
>>>             So maybe because tasks were failing and reloading often
>>>             the task manager was running out of Metspace. But now
>>>             maybe it's just cleanly shutting down.
>>>
>>>             On Wed, Apr 20, 2022 at 11:35 AM John Smith
>>>             <ja...@gmail.com> wrote:
>>>
>>>                 Or I can put in the config to treat
>>>                 org.apache.ignite. classes as first class?
>>>
>>>                 On Tue, Apr 19, 2022 at 10:18 PM John Smith
>>>                 <ja...@gmail.com> wrote:
>>>
>>>                     Ok, so I loaded the dump into Eclipse Mat and
>>>                     followed:
>>>                     https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>
>>>                     - On the Histogram, I got over 30 entries for:
>>>                     ChildFirstClassLoader
>>>                     - Then I clicked on one of them "Merge Shortest
>>>                     Path..." and picked "Exclude all
>>>                     phantom/weak/soft references"
>>>                     - Which then gave me: SqlDriverManager > Apache
>>>                     Ignite JdbcThin Driver
>>>
>>>                     So i'm guessing anything JDBC based. I should
>>>                     copy into the task manager libs folder and my
>>>                     jobs make the dependencies as compile only?
>>>
>>>                     On Tue, Apr 19, 2022 at 12:18 PM Yaroslav
>>>                     Tkachenko <ya...@goldsky.io> wrote:
>>>
>>>                         Also
>>>                         https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>                         might be helpful (has a section on
>>>                         profiling, as well as classloading).
>>>
>>>                         On Tue, Apr 19, 2022 at 4:35 AM Chesnay
>>>                         Schepler <ch...@apache.org> wrote:
>>>
>>>                             We have a very rough "guide" in the wiki
>>>                             (it's just the specific steps I took to
>>>                             debug another leak):
>>>                             https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>
>>>                             On 19/04/2022 12:01, huweihua wrote:
>>>>                             Hi, John
>>>>
>>>>                             Sorry for the late reply. You can use
>>>>                             MAT[1] to analyze the dump file. Check
>>>>                             whether have too many loaded classes.
>>>>
>>>>                             [1] https://www.eclipse.org/mat/
>>>>
>>>>>                             2022年4月18日 下午9:55，John Smith
>>>>>                             <ja...@gmail.com> 写道：
>>>>>
>>>>>                             Hi, can anyone help with this? I never
>>>>>                             looked at a dump file before.
>>>>>
>>>>>                             On Thu, Apr 14, 2022 at 11:59 AM John
>>>>>                             Smith <ja...@gmail.com> wrote:
>>>>>
>>>>>                                 Hi, so I have a dump file. What do
>>>>>                                 I look for?
>>>>>
>>>>>                                 On Thu, Mar 31, 2022 at 3:28 PM
>>>>>                                 John Smith
>>>>>                                 <ja...@gmail.com> wrote:
>>>>>
>>>>>                                     Ok so if there's a leak, if I
>>>>>                                     manually stop the job and
>>>>>                                     restart it from the UI
>>>>>                                     multiple times, I won't see
>>>>>                                     the issue because because the
>>>>>                                     classes are unloaded correctly?
>>>>>
>>>>>
>>>>>                                     On Thu, Mar 31, 2022 at 9:20
>>>>>                                     AM huweihua
>>>>>                                     <hu...@gmail.com> wrote:
>>>>>
>>>>>
>>>>>                                         The difference is that
>>>>>                                         manually canceling the job
>>>>>                                         stops the JobMaster, but
>>>>>                                         automatic failover keeps
>>>>>                                         the JobMaster running. But
>>>>>                                         looking on TaskManager, it
>>>>>                                         doesn't make much difference
>>>>>
>>>>>
>>>>>>                                         2022年3月31日 上午4:01，John
>>>>>>                                         Smith
>>>>>>                                         <ja...@gmail.com>
>>>>>>                                         写道：
>>>>>>
>>>>>>                                         Also if I manually cancel
>>>>>>                                         and restart the same job
>>>>>>                                         over and over is it the
>>>>>>                                         same as if flink was
>>>>>>                                         restarting a job due to
>>>>>>                                         failure?
>>>>>>
>>>>>>                                         I.e: When I click "Cancel
>>>>>>                                         Job" on the UI is the job
>>>>>>                                         completely unloaded vs
>>>>>>                                         when the job scheduler
>>>>>>                                         restarts a job because if
>>>>>>                                         whatever reason?
>>>>>>
>>>>>>                                         Lile this I'll stop and
>>>>>>                                         restart the job a few
>>>>>>                                         times or maybe I can
>>>>>>                                         trick my job to fail and
>>>>>>                                         have the scheduler
>>>>>>                                         restart it. Ok let me
>>>>>>                                         think about this...
>>>>>>
>>>>>>                                         On Wed, Mar 30, 2022 at
>>>>>>                                         10:24 AM 胡伟华
>>>>>>                                         <hu...@gmail.com>
>>>>>>                                         wrote:
>>>>>>
>>>>>>>                                             So if I run the same
>>>>>>>                                             jobs in my dev env
>>>>>>>                                             will I still be able
>>>>>>>                                             to see the similar
>>>>>>>                                             dump?
>>>>>>                                             I think running the
>>>>>>                                             same job in dev
>>>>>>                                             should be
>>>>>>                                             reproducible, maybe
>>>>>>                                             you can have a try.
>>>>>>
>>>>>>>                                              If not I would have
>>>>>>>                                             to wait at a low
>>>>>>>                                             volume time to do it
>>>>>>>                                             on production. Aldo
>>>>>>>                                             if I recall the dump
>>>>>>>                                             is as big as the JVM
>>>>>>>                                             memory right so if I
>>>>>>>                                             have 10GB configed
>>>>>>>                                             for the JVM the dump
>>>>>>>                                             will be 10GB file?
>>>>>>                                             Yes, JMAP will pause
>>>>>>                                             the JVM, the time of
>>>>>>                                             pause depends on the
>>>>>>                                             size to dump. you can
>>>>>>                                             use "jmap -dump:live"
>>>>>>                                             to dump only the
>>>>>>                                             reachable objects,
>>>>>>                                             this will take a
>>>>>>                                             brief pause
>>>>>>
>>>>>>
>>>>>>
>>>>>>>                                             2022年3月30日
>>>>>>>                                             下午9:47，John Smith
>>>>>>>                                             <ja...@gmail.com>
>>>>>>>                                             写道：
>>>>>>>
>>>>>>>                                             I have 3 task
>>>>>>>                                             managers (see config
>>>>>>>                                             below). There is
>>>>>>>                                             total of 10 jobs
>>>>>>>                                             with 25 slots being
>>>>>>>                                             used.
>>>>>>>                                             The jobs are 100%
>>>>>>>                                             ETL I.e; They load
>>>>>>>                                             Json, transform it
>>>>>>>                                             and push it to JDBC,
>>>>>>>                                             only 1 job of the 10
>>>>>>>                                             is pushing to Apache
>>>>>>>                                             Ignite cluster.
>>>>>>>
>>>>>>>                                             FOR JMAP. I know
>>>>>>>                                             that it will pause
>>>>>>>                                             the task manager. So
>>>>>>>                                             if I run the same
>>>>>>>                                             jobs in my dev env
>>>>>>>                                             will I still be able
>>>>>>>                                             to see the similar
>>>>>>>                                             dump? I I assume so.
>>>>>>>                                             If not I would have
>>>>>>>                                             to wait at a low
>>>>>>>                                             volume time to do it
>>>>>>>                                             on production. Aldo
>>>>>>>                                             if I recall the dump
>>>>>>>                                             is as big as the JVM
>>>>>>>                                             memory right so if I
>>>>>>>                                             have 10GB configed
>>>>>>>                                             for the JVM the dump
>>>>>>>                                             will be 10GB file?
>>>>>>>
>>>>>>>
>>>>>>>                                             # Operating system
>>>>>>>                                             has 16GB total.
>>>>>>>                                             env.ssh.opts: -l
>>>>>>>                                             flink
>>>>>>>                                             -oStrictHostKeyChecking=no
>>>>>>>
>>>>>>>                                             cluster.evenly-spread-out-slots:
>>>>>>>                                             true
>>>>>>>
>>>>>>>                                             taskmanager.memory.flink.size:
>>>>>>>                                             10240m
>>>>>>>                                             taskmanager.memory.jvm-metaspace.size:
>>>>>>>                                             2048m
>>>>>>>                                             taskmanager.numberOfTaskSlots:
>>>>>>>                                             16
>>>>>>>                                             parallelism.default: 1
>>>>>>>
>>>>>>>                                             high-availability:
>>>>>>>                                             zookeeper
>>>>>>>                                             high-availability.storageDir:
>>>>>>>                                             file:///mnt/flink/ha/flink_1_14/
>>>>>>>                                             high-availability.zookeeper.quorum:
>>>>>>>                                             ...
>>>>>>>                                             high-availability.zookeeper.path.root:
>>>>>>>                                             /flink_1_14
>>>>>>>                                             high-availability.cluster-id:
>>>>>>>                                             /flink_1_14_cluster_0001
>>>>>>>
>>>>>>>                                             web.upload.dir:
>>>>>>>                                             /mnt/flink/uploads/flink_1_14
>>>>>>>
>>>>>>>                                             state.backend: rocksdb
>>>>>>>                                             state.backend.incremental:
>>>>>>>                                             true
>>>>>>>                                             state.checkpoints.dir:
>>>>>>>                                             file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>                                             state.savepoints.dir:
>>>>>>>                                             file:///mnt/flink/savepoints/flink_1_14
>>>>>>>
>>>>>>>                                             On Wed, Mar 30, 2022
>>>>>>>                                             at 2:16 AM 胡伟华
>>>>>>>                                             <hu...@gmail.com>
>>>>>>>                                             wrote:
>>>>>>>
>>>>>>>                                                 Hi, John
>>>>>>>
>>>>>>>                                                 Could you tell
>>>>>>>                                                 us you
>>>>>>>                                                 application
>>>>>>>                                                 scenario? Is it
>>>>>>>                                                 a flink session
>>>>>>>                                                 cluster with a
>>>>>>>                                                 lot of jobs?
>>>>>>>
>>>>>>>                                                 Maybe you can
>>>>>>>                                                 try to dump the
>>>>>>>                                                 memory with jmap
>>>>>>>                                                 and use tools
>>>>>>>                                                 such as MAT to
>>>>>>>                                                 analyze whether
>>>>>>>                                                 there are
>>>>>>>                                                 abnormal classes
>>>>>>>                                                 and classloaders
>>>>>>>
>>>>>>>
>>>>>>>                                                 > 2022年3月30日
>>>>>>>                                                 上午6:09，John
>>>>>>>                                                 Smith
>>>>>>>                                                 <ja...@gmail.com>
>>>>>>>                                                 写道：
>>>>>>>                                                 >
>>>>>>>                                                 > Hi running 1.14.4
>>>>>>>                                                 >
>>>>>>>                                                 > My tasks
>>>>>>>                                                 manager still
>>>>>>>                                                 fails with
>>>>>>>                                                 java.lang.OutOfMemoryError:
>>>>>>>                                                 Metaspace. The
>>>>>>>                                                 metaspace
>>>>>>>                                                 out-of-memory
>>>>>>>                                                 error has
>>>>>>>                                                 occurred. This
>>>>>>>                                                 can mean two
>>>>>>>                                                 things: either
>>>>>>>                                                 the job requires
>>>>>>>                                                 a larger size of
>>>>>>>                                                 JVM metaspace to
>>>>>>>                                                 load classes or
>>>>>>>                                                 there is a class
>>>>>>>                                                 loading leak.
>>>>>>>                                                 >
>>>>>>>                                                 > I have 2GB of
>>>>>>>                                                 metaspace
>>>>>>>                                                 configed
>>>>>>>                                                 taskmanager.memory.jvm-metaspace.size:
>>>>>>>                                                 2048m
>>>>>>>                                                 >
>>>>>>>                                                 > But the task
>>>>>>>                                                 nodes still fail.
>>>>>>>                                                 >
>>>>>>>                                                 > When looking
>>>>>>>                                                 at the UI
>>>>>>>                                                 metrics, the
>>>>>>>                                                 metaspace starts
>>>>>>>                                                 low. Now I see
>>>>>>>                                                 85% usage. It
>>>>>>>                                                 seems to be a
>>>>>>>                                                 class loading
>>>>>>>                                                 leak at this
>>>>>>>                                                 point, how can
>>>>>>>                                                 we debug this issue?
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to debug Metaspace exception?

Posted by John Smith <ja...@gmail.com>.

You sure?

   -

   *JDBC*: JDBC drivers leak references outside the user code classloader.
   To ensure that these classes are only loaded once you should either add the
   driver jars to Flink’s lib/ folder, or add the driver classes to the
   list of parent-first loaded class via
   classloader.parent-first-patterns-additional
   <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
   .

   It says either or


On Wed, Apr 27, 2022 at 3:44 AM Chesnay Schepler <ch...@apache.org> wrote:

> You're misinterpreting the docs.
>
> The parent/child-first classloading controls where Flink looks for a class
> *first*, specifically whether we first load from /lib or the user-jar.
> It does not allow you to load something from the user-jar in the parent
> classloader. That's just not how it works.
>
> It must be in /lib.
>
> On 27/04/2022 04:59, John Smith wrote:
>
> Hi Chesnay as per the docs...
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>
> You can either put the jars in task manager lib folder or use
> classloader.parent-first-patterns-additional
> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>
> I prefer the latter like this: the dependency stays with the user-jar and
> not on the task manager.
>
> On Tue, Apr 26, 2022 at 9:52 PM John Smith <ja...@gmail.com> wrote:
>
>> Ok so I should put the Apache ignite and my Microsoft drivers in the lib
>> folders of my task managers?
>>
>> And then in my job jar only include them as compile time dependencies?
>>
>>
>> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ch...@apache.org>
>> wrote:
>>
>>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>>
>>> You have correctly identified your alternatives.
>>>
>>> You must put the jdbc driver into /lib instead. Setting only the
>>> parent-first pattern shouldn't affect anything.
>>> That is only relevant if something is in both in /lib and the user-jar,
>>> telling Flink to prioritize what is in lib.
>>>
>>>
>>>
>>> On 26/04/2022 15:35, John Smith wrote:
>>>
>>> So I put classloader.parent-first-patterns.additional:
>>> "org.apache.ignite." in the task config and so far I don't think I'm
>>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>>
>>> Or it's too early to tell.
>>>
>>> Though now, the task managers are shutting down due to some
>>> other failures.
>>>
>>> So maybe because tasks were failing and reloading often the task manager
>>> was running out of Metspace. But now maybe it's just cleanly shutting down.
>>>
>>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <ja...@gmail.com>
>>> wrote:
>>>
>>>> Or I can put in the config to treat org.apache.ignite. classes as first
>>>> class?
>>>>
>>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Ok, so I loaded the dump into Eclipse Mat and followed:
>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>
>>>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
>>>>> - Then I clicked on one of them "Merge Shortest Path..." and picked
>>>>> "Exclude all phantom/weak/soft references"
>>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver
>>>>>
>>>>> So i'm guessing anything JDBC based. I should copy into the task
>>>>> manager libs folder and my jobs make the dependencies as compile only?
>>>>>
>>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
>>>>> yaroslav@goldsky.io> wrote:
>>>>>
>>>>>> Also
>>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>>> might be helpful (has a section on profiling, as well as classloading).
>>>>>>
>>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ch...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> We have a very rough "guide" in the wiki (it's just the specific
>>>>>>> steps I took to debug another leak):
>>>>>>>
>>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>>
>>>>>>> On 19/04/2022 12:01, huweihua wrote:
>>>>>>>
>>>>>>> Hi, John
>>>>>>>
>>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump
>>>>>>> file. Check whether have too many loaded classes.
>>>>>>>
>>>>>>> [1] https://www.eclipse.org/mat/
>>>>>>>
>>>>>>> 2022年4月18日 下午9:55，John Smith <ja...@gmail.com> 写道：
>>>>>>>
>>>>>>> Hi, can anyone help with this? I never looked at a dump file before.
>>>>>>>
>>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <ja...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Hi, so I have a dump file. What do I look for?
>>>>>>>>
>>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <ja...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> Ok so if there's a leak, if I manually stop the job and restart it
>>>>>>>>> from the UI multiple times, I won't see the issue because because the
>>>>>>>>> classes are unloaded correctly?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <hu...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> The difference is that manually canceling the job stops the
>>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But looking
>>>>>>>>>> on TaskManager, it doesn't make much difference
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2022年3月31日 上午4:01，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>
>>>>>>>>>> Also if I manually cancel and restart the same job over and over
>>>>>>>>>> is it the same as if flink was restarting a job due to failure?
>>>>>>>>>>
>>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely
>>>>>>>>>> unloaded vs when the job scheduler restarts a job because if whatever
>>>>>>>>>> reason?
>>>>>>>>>>
>>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe I
>>>>>>>>>> can trick my job to fail and have the scheduler restart it. Ok let me think
>>>>>>>>>> about this...
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <hu...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> So if I run the same jobs in my dev env will I still be able to
>>>>>>>>>>> see the similar dump?
>>>>>>>>>>>
>>>>>>>>>>> I think running the same job in dev should be reproducible,
>>>>>>>>>>> maybe you can have a try.
>>>>>>>>>>>
>>>>>>>>>>>  If not I would have to wait at a low volume time to do it on
>>>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>>
>>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the
>>>>>>>>>>> size to dump. you can use "jmap -dump:live" to dump only the reachable
>>>>>>>>>>> objects, this will take a brief pause
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> 2022年3月30日 下午9:47，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>
>>>>>>>>>>> I have 3 task managers (see config below). There is total of 10
>>>>>>>>>>> jobs with 25 slots being used.
>>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and push
>>>>>>>>>>> it to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>>>>>>>
>>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I
>>>>>>>>>>> run the same jobs in my dev env will I still be able to see the similar
>>>>>>>>>>> dump? I I assume so. If not I would have to wait at a low volume time to do
>>>>>>>>>>> it on production. Aldo if I recall the dump is as big as the JVM memory
>>>>>>>>>>> right so if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> # Operating system has 16GB total.
>>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>>>>>
>>>>>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>>>>>
>>>>>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>>>>>> parallelism.default: 1
>>>>>>>>>>>
>>>>>>>>>>> high-availability: zookeeper
>>>>>>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>>>>>
>>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>>>>>
>>>>>>>>>>> state.backend: rocksdb
>>>>>>>>>>> state.backend.incremental: true
>>>>>>>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <hu...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi, John
>>>>>>>>>>>>
>>>>>>>>>>>> Could you tell us you application scenario? Is it a flink
>>>>>>>>>>>> session cluster with a lot of jobs?
>>>>>>>>>>>>
>>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools
>>>>>>>>>>>> such as MAT to analyze whether there are abnormal classes and classloaders
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> > 2022年3月30日 上午6:09，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>>> >
>>>>>>>>>>>> > Hi running 1.14.4
>>>>>>>>>>>> >
>>>>>>>>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>>>>>>>>>>>> Metaspace. The metaspace out-of-memory error has occurred. This can mean
>>>>>>>>>>>> two things: either the job requires a larger size of JVM metaspace to load
>>>>>>>>>>>> classes or there is a class loading leak.
>>>>>>>>>>>> >
>>>>>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>>> >
>>>>>>>>>>>> > But the task nodes still fail.
>>>>>>>>>>>> >
>>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low. Now
>>>>>>>>>>>> I see 85% usage. It seems to be a class loading leak at this point, how can
>>>>>>>>>>>> we debug this issue?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>>
>>>
>

Re: How to debug Metaspace exception?

Posted by Chesnay Schepler <ch...@apache.org>.

You're misinterpreting the docs.

The parent/child-first classloading controls where Flink looks for a 
class /first/, specifically whether we first load from /lib or the user-jar.
It does not allow you to load something from the user-jar in the parent 
classloader. That's just not how it works.

It must be in /lib.

On 27/04/2022 04:59, John Smith wrote:
> Hi Chesnay as per the docs... 
> https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/
>
> You can either put the jars in task manager lib folder or use 
> |classloader.parent-first-patterns-additional| 
> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>
>
> I prefer the latter like this: the dependency stays with the user-jar 
> and not on the task manager.
>
> On Tue, Apr 26, 2022 at 9:52 PM John Smith <ja...@gmail.com> wrote:
>
>     Ok so I should put the Apache ignite and my Microsoft drivers in
>     the lib folders of my task managers?
>
>     And then in my job jar only include them as compile time
>     dependencies?
>
>
>     On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler
>     <ch...@apache.org> wrote:
>
>         JDBC drivers are well-known for leaking classloaders
>         unfortunately.
>
>         You have correctly identified your alternatives.
>
>         You must put the jdbc driver into /lib instead. Setting only
>         the parent-first pattern shouldn't affect anything.
>         That is only relevant if something is in both in /lib and the
>         user-jar, telling Flink to prioritize what is in lib.
>
>
>
>         On 26/04/2022 15:35, John Smith wrote:
>>         So I put classloader.parent-first-patterns.additional:
>>         "org.apache.ignite." in the task config and so far I don't
>>         think I'm getting "java.lang.OutOfMemoryError: Metaspace" any
>>         more.
>>
>>         Or it's too early to tell.
>>
>>         Though now, the task managers are shutting down due to some
>>         other failures.
>>
>>         So maybe because tasks were failing and reloading often the
>>         task manager was running out of Metspace. But now maybe it's
>>         just cleanly shutting down.
>>
>>         On Wed, Apr 20, 2022 at 11:35 AM John Smith
>>         <ja...@gmail.com> wrote:
>>
>>             Or I can put in the config to treat org.apache.ignite.
>>             classes as first class?
>>
>>             On Tue, Apr 19, 2022 at 10:18 PM John Smith
>>             <ja...@gmail.com> wrote:
>>
>>                 Ok, so I loaded the dump into Eclipse Mat and
>>                 followed:
>>                 https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>
>>                 - On the Histogram, I got over 30 entries for:
>>                 ChildFirstClassLoader
>>                 - Then I clicked on one of them "Merge Shortest
>>                 Path..." and picked "Exclude all phantom/weak/soft
>>                 references"
>>                 - Which then gave me: SqlDriverManager > Apache
>>                 Ignite JdbcThin Driver
>>
>>                 So i'm guessing anything JDBC based. I should copy
>>                 into the task manager libs folder and my jobs make
>>                 the dependencies as compile only?
>>
>>                 On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko
>>                 <ya...@goldsky.io> wrote:
>>
>>                     Also
>>                     https://shopify.engineering/optimizing-apache-flink-applications-tips
>>                     might be helpful (has a section on profiling, as
>>                     well as classloading).
>>
>>                     On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler
>>                     <ch...@apache.org> wrote:
>>
>>                         We have a very rough "guide" in the wiki
>>                         (it's just the specific steps I took to debug
>>                         another leak):
>>                         https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>
>>                         On 19/04/2022 12:01, huweihua wrote:
>>>                         Hi, John
>>>
>>>                         Sorry for the late reply. You can use MAT[1]
>>>                         to analyze the dump file. Check whether have
>>>                         too many loaded classes.
>>>
>>>                         [1] https://www.eclipse.org/mat/
>>>
>>>>                         2022年4月18日 下午9:55，John Smith
>>>>                         <ja...@gmail.com> 写道：
>>>>
>>>>                         Hi, can anyone help with this? I never
>>>>                         looked at a dump file before.
>>>>
>>>>                         On Thu, Apr 14, 2022 at 11:59 AM John Smith
>>>>                         <ja...@gmail.com> wrote:
>>>>
>>>>                             Hi, so I have a dump file. What do I
>>>>                             look for?
>>>>
>>>>                             On Thu, Mar 31, 2022 at 3:28 PM John
>>>>                             Smith <ja...@gmail.com> wrote:
>>>>
>>>>                                 Ok so if there's a leak, if I
>>>>                                 manually stop the job and restart
>>>>                                 it from the UI multiple times, I
>>>>                                 won't see the issue because because
>>>>                                 the classes are unloaded correctly?
>>>>
>>>>
>>>>                                 On Thu, Mar 31, 2022 at 9:20 AM
>>>>                                 huweihua <hu...@gmail.com>
>>>>                                 wrote:
>>>>
>>>>
>>>>                                     The difference is that manually
>>>>                                     canceling the job stops the
>>>>                                     JobMaster, but automatic
>>>>                                     failover keeps the JobMaster
>>>>                                     running. But looking on
>>>>                                     TaskManager, it doesn't make
>>>>                                     much difference
>>>>
>>>>
>>>>>                                     2022年3月31日 上午4:01，John Smith
>>>>>                                     <ja...@gmail.com> 写道：
>>>>>
>>>>>                                     Also if I manually cancel and
>>>>>                                     restart the same job over and
>>>>>                                     over is it the same as if
>>>>>                                     flink was restarting a job due
>>>>>                                     to failure?
>>>>>
>>>>>                                     I.e: When I click "Cancel Job"
>>>>>                                     on the UI is the job
>>>>>                                     completely unloaded vs when
>>>>>                                     the job scheduler restarts a
>>>>>                                     job because if whatever reason?
>>>>>
>>>>>                                     Lile this I'll stop and
>>>>>                                     restart the job a few times or
>>>>>                                     maybe I can trick my job to
>>>>>                                     fail and have the scheduler
>>>>>                                     restart it. Ok let me think
>>>>>                                     about this...
>>>>>
>>>>>                                     On Wed, Mar 30, 2022 at 10:24
>>>>>                                     AM 胡伟华
>>>>>                                     <hu...@gmail.com> wrote:
>>>>>
>>>>>>                                         So if I run the same jobs
>>>>>>                                         in my dev env will I
>>>>>>                                         still be able to see the
>>>>>>                                         similar dump?
>>>>>                                         I think running the same
>>>>>                                         job in dev should be
>>>>>                                         reproducible, maybe you
>>>>>                                         can have a try.
>>>>>
>>>>>>                                          If not I would have to
>>>>>>                                         wait at a low volume time
>>>>>>                                         to do it on production.
>>>>>>                                         Aldo if I recall the dump
>>>>>>                                         is as big as the JVM
>>>>>>                                         memory right so if I have
>>>>>>                                         10GB configed for the JVM
>>>>>>                                         the dump will be 10GB file?
>>>>>                                         Yes, JMAP will pause the
>>>>>                                         JVM, the time of pause
>>>>>                                         depends on the size to
>>>>>                                         dump. you can use "jmap
>>>>>                                         -dump:live" to dump only
>>>>>                                         the reachable objects,
>>>>>                                         this will take a brief pause
>>>>>
>>>>>
>>>>>
>>>>>>                                         2022年3月30日 下午9:47，John
>>>>>>                                         Smith
>>>>>>                                         <ja...@gmail.com>
>>>>>>                                         写道：
>>>>>>
>>>>>>                                         I have 3 task managers
>>>>>>                                         (see config below). There
>>>>>>                                         is total of 10 jobs with
>>>>>>                                         25 slots being used.
>>>>>>                                         The jobs are 100% ETL
>>>>>>                                         I.e; They load Json,
>>>>>>                                         transform it and push it
>>>>>>                                         to JDBC, only 1 job of
>>>>>>                                         the 10 is pushing to
>>>>>>                                         Apache Ignite cluster.
>>>>>>
>>>>>>                                         FOR JMAP. I know that it
>>>>>>                                         will pause the task
>>>>>>                                         manager. So if I run the
>>>>>>                                         same jobs in my dev env
>>>>>>                                         will I still be able to
>>>>>>                                         see the similar dump? I I
>>>>>>                                         assume so. If not I would
>>>>>>                                         have to wait at a low
>>>>>>                                         volume time to do it on
>>>>>>                                         production. Aldo if I
>>>>>>                                         recall the dump is as big
>>>>>>                                         as the JVM memory right
>>>>>>                                         so if I have 10GB
>>>>>>                                         configed for the JVM the
>>>>>>                                         dump will be 10GB file?
>>>>>>
>>>>>>
>>>>>>                                         # Operating system has
>>>>>>                                         16GB total.
>>>>>>                                         env.ssh.opts: -l flink
>>>>>>                                         -oStrictHostKeyChecking=no
>>>>>>
>>>>>>                                         cluster.evenly-spread-out-slots:
>>>>>>                                         true
>>>>>>
>>>>>>                                         taskmanager.memory.flink.size:
>>>>>>                                         10240m
>>>>>>                                         taskmanager.memory.jvm-metaspace.size:
>>>>>>                                         2048m
>>>>>>                                         taskmanager.numberOfTaskSlots:
>>>>>>                                         16
>>>>>>                                         parallelism.default: 1
>>>>>>
>>>>>>                                         high-availability: zookeeper
>>>>>>                                         high-availability.storageDir:
>>>>>>                                         file:///mnt/flink/ha/flink_1_14/
>>>>>>                                         high-availability.zookeeper.quorum:
>>>>>>                                         ...
>>>>>>                                         high-availability.zookeeper.path.root:
>>>>>>                                         /flink_1_14
>>>>>>                                         high-availability.cluster-id:
>>>>>>                                         /flink_1_14_cluster_0001
>>>>>>
>>>>>>                                         web.upload.dir:
>>>>>>                                         /mnt/flink/uploads/flink_1_14
>>>>>>
>>>>>>                                         state.backend: rocksdb
>>>>>>                                         state.backend.incremental:
>>>>>>                                         true
>>>>>>                                         state.checkpoints.dir:
>>>>>>                                         file:///mnt/flink/checkpoints/flink_1_14
>>>>>>                                         state.savepoints.dir:
>>>>>>                                         file:///mnt/flink/savepoints/flink_1_14
>>>>>>
>>>>>>                                         On Wed, Mar 30, 2022 at
>>>>>>                                         2:16 AM 胡伟华
>>>>>>                                         <hu...@gmail.com>
>>>>>>                                         wrote:
>>>>>>
>>>>>>                                             Hi, John
>>>>>>
>>>>>>                                             Could you tell us you
>>>>>>                                             application scenario?
>>>>>>                                             Is it a flink session
>>>>>>                                             cluster with a lot of
>>>>>>                                             jobs?
>>>>>>
>>>>>>                                             Maybe you can try to
>>>>>>                                             dump the memory with
>>>>>>                                             jmap and use tools
>>>>>>                                             such as MAT to
>>>>>>                                             analyze whether there
>>>>>>                                             are abnormal classes
>>>>>>                                             and classloaders
>>>>>>
>>>>>>
>>>>>>                                             > 2022年3月30日
>>>>>>                                             上午6:09，John Smith
>>>>>>                                             <ja...@gmail.com>
>>>>>>                                             写道：
>>>>>>                                             >
>>>>>>                                             > Hi running 1.14.4
>>>>>>                                             >
>>>>>>                                             > My tasks manager
>>>>>>                                             still fails with
>>>>>>                                             java.lang.OutOfMemoryError:
>>>>>>                                             Metaspace. The
>>>>>>                                             metaspace
>>>>>>                                             out-of-memory error
>>>>>>                                             has occurred. This
>>>>>>                                             can mean two things:
>>>>>>                                             either the job
>>>>>>                                             requires a larger
>>>>>>                                             size of JVM metaspace
>>>>>>                                             to load classes or
>>>>>>                                             there is a class
>>>>>>                                             loading leak.
>>>>>>                                             >
>>>>>>                                             > I have 2GB of
>>>>>>                                             metaspace configed
>>>>>>                                             taskmanager.memory.jvm-metaspace.size:
>>>>>>                                             2048m
>>>>>>                                             >
>>>>>>                                             > But the task nodes
>>>>>>                                             still fail.
>>>>>>                                             >
>>>>>>                                             > When looking at the
>>>>>>                                             UI metrics, the
>>>>>>                                             metaspace starts low.
>>>>>>                                             Now I see 85% usage.
>>>>>>                                             It seems to be a
>>>>>>                                             class loading leak at
>>>>>>                                             this point, how can
>>>>>>                                             we debug this issue?
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: How to debug Metaspace exception?

Posted by John Smith <ja...@gmail.com>.

Hi Chesnay as per the docs...
https://nightlies.apache.org/flink/flink-docs-master/docs/ops/debugging/debugging_classloading/

You can either put the jars in task manager lib folder or use
classloader.parent-first-patterns-additional
<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#classloader-parent-first-patterns-additional>

I prefer the latter like this: the dependency stays with the user-jar and
not on the task manager.

On Tue, Apr 26, 2022 at 9:52 PM John Smith <ja...@gmail.com> wrote:

> Ok so I should put the Apache ignite and my Microsoft drivers in the lib
> folders of my task managers?
>
> And then in my job jar only include them as compile time dependencies?
>
>
> On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ch...@apache.org>
> wrote:
>
>> JDBC drivers are well-known for leaking classloaders unfortunately.
>>
>> You have correctly identified your alternatives.
>>
>> You must put the jdbc driver into /lib instead. Setting only the
>> parent-first pattern shouldn't affect anything.
>> That is only relevant if something is in both in /lib and the user-jar,
>> telling Flink to prioritize what is in lib.
>>
>>
>>
>> On 26/04/2022 15:35, John Smith wrote:
>>
>> So I put classloader.parent-first-patterns.additional:
>> "org.apache.ignite." in the task config and so far I don't think I'm
>> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>>
>> Or it's too early to tell.
>>
>> Though now, the task managers are shutting down due to some
>> other failures.
>>
>> So maybe because tasks were failing and reloading often the task manager
>> was running out of Metspace. But now maybe it's just cleanly shutting down.
>>
>> On Wed, Apr 20, 2022 at 11:35 AM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> Or I can put in the config to treat org.apache.ignite. classes as first
>>> class?
>>>
>>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <ja...@gmail.com>
>>> wrote:
>>>
>>>> Ok, so I loaded the dump into Eclipse Mat and followed:
>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>
>>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
>>>> - Then I clicked on one of them "Merge Shortest Path..." and picked
>>>> "Exclude all phantom/weak/soft references"
>>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver
>>>>
>>>> So i'm guessing anything JDBC based. I should copy into the task
>>>> manager libs folder and my jobs make the dependencies as compile only?
>>>>
>>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <
>>>> yaroslav@goldsky.io> wrote:
>>>>
>>>>> Also
>>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>>> might be helpful (has a section on profiling, as well as classloading).
>>>>>
>>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ch...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> We have a very rough "guide" in the wiki (it's just the specific
>>>>>> steps I took to debug another leak):
>>>>>>
>>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>>
>>>>>> On 19/04/2022 12:01, huweihua wrote:
>>>>>>
>>>>>> Hi, John
>>>>>>
>>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump
>>>>>> file. Check whether have too many loaded classes.
>>>>>>
>>>>>> [1] https://www.eclipse.org/mat/
>>>>>>
>>>>>> 2022年4月18日 下午9:55，John Smith <ja...@gmail.com> 写道：
>>>>>>
>>>>>> Hi, can anyone help with this? I never looked at a dump file before.
>>>>>>
>>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi, so I have a dump file. What do I look for?
>>>>>>>
>>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <ja...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Ok so if there's a leak, if I manually stop the job and restart it
>>>>>>>> from the UI multiple times, I won't see the issue because because the
>>>>>>>> classes are unloaded correctly?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <hu...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>> The difference is that manually canceling the job stops the
>>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But looking
>>>>>>>>> on TaskManager, it doesn't make much difference
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2022年3月31日 上午4:01，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>
>>>>>>>>> Also if I manually cancel and restart the same job over and over
>>>>>>>>> is it the same as if flink was restarting a job due to failure?
>>>>>>>>>
>>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely
>>>>>>>>> unloaded vs when the job scheduler restarts a job because if whatever
>>>>>>>>> reason?
>>>>>>>>>
>>>>>>>>> Lile this I'll stop and restart the job a few times or maybe I can
>>>>>>>>> trick my job to fail and have the scheduler restart it. Ok let me think
>>>>>>>>> about this...
>>>>>>>>>
>>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <hu...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> So if I run the same jobs in my dev env will I still be able to
>>>>>>>>>> see the similar dump?
>>>>>>>>>>
>>>>>>>>>> I think running the same job in dev should be reproducible, maybe
>>>>>>>>>> you can have a try.
>>>>>>>>>>
>>>>>>>>>>  If not I would have to wait at a low volume time to do it on
>>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>
>>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the
>>>>>>>>>> size to dump. you can use "jmap -dump:live" to dump only the reachable
>>>>>>>>>> objects, this will take a brief pause
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> 2022年3月30日 下午9:47，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>
>>>>>>>>>> I have 3 task managers (see config below). There is total of 10
>>>>>>>>>> jobs with 25 slots being used.
>>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and push
>>>>>>>>>> it to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>>>>>>
>>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I run
>>>>>>>>>> the same jobs in my dev env will I still be able to see the similar dump? I
>>>>>>>>>> I assume so. If not I would have to wait at a low volume time to do it on
>>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> # Operating system has 16GB total.
>>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>>>>
>>>>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>>>>
>>>>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>>>>> parallelism.default: 1
>>>>>>>>>>
>>>>>>>>>> high-availability: zookeeper
>>>>>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>>>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>>>>
>>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>>>>
>>>>>>>>>> state.backend: rocksdb
>>>>>>>>>> state.backend.incremental: true
>>>>>>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>>
>>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <hu...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi, John
>>>>>>>>>>>
>>>>>>>>>>> Could you tell us you application scenario? Is it a flink
>>>>>>>>>>> session cluster with a lot of jobs?
>>>>>>>>>>>
>>>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools
>>>>>>>>>>> such as MAT to analyze whether there are abnormal classes and classloaders
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> > 2022年3月30日 上午6:09，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>>> >
>>>>>>>>>>> > Hi running 1.14.4
>>>>>>>>>>> >
>>>>>>>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>>>>>>>>>>> Metaspace. The metaspace out-of-memory error has occurred. This can mean
>>>>>>>>>>> two things: either the job requires a larger size of JVM metaspace to load
>>>>>>>>>>> classes or there is a class loading leak.
>>>>>>>>>>> >
>>>>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>>> >
>>>>>>>>>>> > But the task nodes still fail.
>>>>>>>>>>> >
>>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low. Now
>>>>>>>>>>> I see 85% usage. It seems to be a class loading leak at this point, how can
>>>>>>>>>>> we debug this issue?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>>
>>

Re: How to debug Metaspace exception?

Posted by John Smith <ja...@gmail.com>.

Ok so I should put the Apache ignite and my Microsoft drivers in the lib
folders of my task managers?

And then in my job jar only include them as compile time dependencies?


On Tue, Apr 26, 2022 at 10:42 AM Chesnay Schepler <ch...@apache.org>
wrote:

> JDBC drivers are well-known for leaking classloaders unfortunately.
>
> You have correctly identified your alternatives.
>
> You must put the jdbc driver into /lib instead. Setting only the
> parent-first pattern shouldn't affect anything.
> That is only relevant if something is in both in /lib and the user-jar,
> telling Flink to prioritize what is in lib.
>
>
>
> On 26/04/2022 15:35, John Smith wrote:
>
> So I put classloader.parent-first-patterns.additional:
> "org.apache.ignite." in the task config and so far I don't think I'm
> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>
> Or it's too early to tell.
>
> Though now, the task managers are shutting down due to some other failures.
>
> So maybe because tasks were failing and reloading often the task manager
> was running out of Metspace. But now maybe it's just cleanly shutting down.
>
> On Wed, Apr 20, 2022 at 11:35 AM John Smith <ja...@gmail.com>
> wrote:
>
>> Or I can put in the config to treat org.apache.ignite. classes as first
>> class?
>>
>> On Tue, Apr 19, 2022 at 10:18 PM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> Ok, so I loaded the dump into Eclipse Mat and followed:
>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>
>>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
>>> - Then I clicked on one of them "Merge Shortest Path..." and picked
>>> "Exclude all phantom/weak/soft references"
>>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver
>>>
>>> So i'm guessing anything JDBC based. I should copy into the task manager
>>> libs folder and my jobs make the dependencies as compile only?
>>>
>>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <ya...@goldsky.io>
>>> wrote:
>>>
>>>> Also
>>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>>> might be helpful (has a section on profiling, as well as classloading).
>>>>
>>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ch...@apache.org>
>>>> wrote:
>>>>
>>>>> We have a very rough "guide" in the wiki (it's just the specific steps
>>>>> I took to debug another leak):
>>>>>
>>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>>
>>>>> On 19/04/2022 12:01, huweihua wrote:
>>>>>
>>>>> Hi, John
>>>>>
>>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump file.
>>>>> Check whether have too many loaded classes.
>>>>>
>>>>> [1] https://www.eclipse.org/mat/
>>>>>
>>>>> 2022年4月18日 下午9:55，John Smith <ja...@gmail.com> 写道：
>>>>>
>>>>> Hi, can anyone help with this? I never looked at a dump file before.
>>>>>
>>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, so I have a dump file. What do I look for?
>>>>>>
>>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <ja...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Ok so if there's a leak, if I manually stop the job and restart it
>>>>>>> from the UI multiple times, I won't see the issue because because the
>>>>>>> classes are unloaded correctly?
>>>>>>>
>>>>>>>
>>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <hu...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> The difference is that manually canceling the job stops the
>>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But looking
>>>>>>>> on TaskManager, it doesn't make much difference
>>>>>>>>
>>>>>>>>
>>>>>>>> 2022年3月31日 上午4:01，John Smith <ja...@gmail.com> 写道：
>>>>>>>>
>>>>>>>> Also if I manually cancel and restart the same job over and over is
>>>>>>>> it the same as if flink was restarting a job due to failure?
>>>>>>>>
>>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely
>>>>>>>> unloaded vs when the job scheduler restarts a job because if whatever
>>>>>>>> reason?
>>>>>>>>
>>>>>>>> Lile this I'll stop and restart the job a few times or maybe I can
>>>>>>>> trick my job to fail and have the scheduler restart it. Ok let me think
>>>>>>>> about this...
>>>>>>>>
>>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <hu...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> So if I run the same jobs in my dev env will I still be able to
>>>>>>>>> see the similar dump?
>>>>>>>>>
>>>>>>>>> I think running the same job in dev should be reproducible, maybe
>>>>>>>>> you can have a try.
>>>>>>>>>
>>>>>>>>>  If not I would have to wait at a low volume time to do it on
>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>
>>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the
>>>>>>>>> size to dump. you can use "jmap -dump:live" to dump only the reachable
>>>>>>>>> objects, this will take a brief pause
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2022年3月30日 下午9:47，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>
>>>>>>>>> I have 3 task managers (see config below). There is total of 10
>>>>>>>>> jobs with 25 slots being used.
>>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and push
>>>>>>>>> it to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>>>>>
>>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I run
>>>>>>>>> the same jobs in my dev env will I still be able to see the similar dump? I
>>>>>>>>> I assume so. If not I would have to wait at a low volume time to do it on
>>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> # Operating system has 16GB total.
>>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>>>
>>>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>>>
>>>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>>>> parallelism.default: 1
>>>>>>>>>
>>>>>>>>> high-availability: zookeeper
>>>>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>>>
>>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>>>
>>>>>>>>> state.backend: rocksdb
>>>>>>>>> state.backend.incremental: true
>>>>>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>>
>>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <hu...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi, John
>>>>>>>>>>
>>>>>>>>>> Could you tell us you application scenario? Is it a flink session
>>>>>>>>>> cluster with a lot of jobs?
>>>>>>>>>>
>>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools such
>>>>>>>>>> as MAT to analyze whether there are abnormal classes and classloaders
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> > 2022年3月30日 上午6:09，John Smith <ja...@gmail.com> 写道：
>>>>>>>>>> >
>>>>>>>>>> > Hi running 1.14.4
>>>>>>>>>> >
>>>>>>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>>>>>>>>>> Metaspace. The metaspace out-of-memory error has occurred. This can mean
>>>>>>>>>> two things: either the job requires a larger size of JVM metaspace to load
>>>>>>>>>> classes or there is a class loading leak.
>>>>>>>>>> >
>>>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>>> >
>>>>>>>>>> > But the task nodes still fail.
>>>>>>>>>> >
>>>>>>>>>> > When looking at the UI metrics, the metaspace starts low. Now I
>>>>>>>>>> see 85% usage. It seems to be a class loading leak at this point, how can
>>>>>>>>>> we debug this issue?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>>
>

Re: How to debug Metaspace exception?

Posted by Chesnay Schepler <ch...@apache.org>.

JDBC drivers are well-known for leaking classloaders unfortunately.

You have correctly identified your alternatives.

You must put the jdbc driver into /lib instead. Setting only the 
parent-first pattern shouldn't affect anything.
That is only relevant if something is in both in /lib and the user-jar, 
telling Flink to prioritize what is in lib.



On 26/04/2022 15:35, John Smith wrote:
> So I put classloader.parent-first-patterns.additional: 
> "org.apache.ignite." in the task config and so far I don't think I'm 
> getting "java.lang.OutOfMemoryError: Metaspace" any more.
>
> Or it's too early to tell.
>
> Though now, the task managers are shutting down due to some 
> other failures.
>
> So maybe because tasks were failing and reloading often the task 
> manager was running out of Metspace. But now maybe it's just 
> cleanly shutting down.
>
> On Wed, Apr 20, 2022 at 11:35 AM John Smith <ja...@gmail.com> 
> wrote:
>
>     Or I can put in the config to treat org.apache.ignite. classes as
>     first class?
>
>     On Tue, Apr 19, 2022 at 10:18 PM John Smith
>     <ja...@gmail.com> wrote:
>
>         Ok, so I loaded the dump into Eclipse Mat and followed:
>         https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>
>         - On the Histogram, I got over 30 entries for:
>         ChildFirstClassLoader
>         - Then I clicked on one of them "Merge Shortest Path..." and
>         picked "Exclude all phantom/weak/soft references"
>         - Which then gave me: SqlDriverManager > Apache Ignite
>         JdbcThin Driver
>
>         So i'm guessing anything JDBC based. I should copy into the
>         task manager libs folder and my jobs make the dependencies as
>         compile only?
>
>         On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko
>         <ya...@goldsky.io> wrote:
>
>             Also
>             https://shopify.engineering/optimizing-apache-flink-applications-tips
>             might be helpful (has a section on profiling, as well as
>             classloading).
>
>             On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler
>             <ch...@apache.org> wrote:
>
>                 We have a very rough "guide" in the wiki (it's just
>                 the specific steps I took to debug another leak):
>                 https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>
>                 On 19/04/2022 12:01, huweihua wrote:
>>                 Hi, John
>>
>>                 Sorry for the late reply. You can use MAT[1] to
>>                 analyze the dump file. Check whether have too many
>>                 loaded classes.
>>
>>                 [1] https://www.eclipse.org/mat/
>>
>>>                 2022年4月18日 下午9:55，John Smith
>>>                 <ja...@gmail.com> 写道：
>>>
>>>                 Hi, can anyone help with this? I never looked at a
>>>                 dump file before.
>>>
>>>                 On Thu, Apr 14, 2022 at 11:59 AM John Smith
>>>                 <ja...@gmail.com> wrote:
>>>
>>>                     Hi, so I have a dump file. What do I look for?
>>>
>>>                     On Thu, Mar 31, 2022 at 3:28 PM John Smith
>>>                     <ja...@gmail.com> wrote:
>>>
>>>                         Ok so if there's a leak, if I manually stop
>>>                         the job and restart it from the UI multiple
>>>                         times, I won't see the issue because because
>>>                         the classes are unloaded correctly?
>>>
>>>
>>>                         On Thu, Mar 31, 2022 at 9:20 AM huweihua
>>>                         <hu...@gmail.com> wrote:
>>>
>>>
>>>                             The difference is that manually
>>>                             canceling the job stops the JobMaster,
>>>                             but automatic failover keeps the
>>>                             JobMaster running. But looking on
>>>                             TaskManager, it doesn't make much difference
>>>
>>>
>>>>                             2022年3月31日 上午4:01，John Smith
>>>>                             <ja...@gmail.com> 写道：
>>>>
>>>>                             Also if I manually cancel and restart
>>>>                             the same job over and over is it the
>>>>                             same as if flink was restarting a job
>>>>                             due to failure?
>>>>
>>>>                             I.e: When I click "Cancel Job" on the
>>>>                             UI is the job completely unloaded vs
>>>>                             when the job scheduler restarts a job
>>>>                             because if whatever reason?
>>>>
>>>>                             Lile this I'll stop and restart the job
>>>>                             a few times or maybe I can trick my job
>>>>                             to fail and have the scheduler restart
>>>>                             it. Ok let me think about this...
>>>>
>>>>                             On Wed, Mar 30, 2022 at 10:24 AM 胡伟华
>>>>                             <hu...@gmail.com> wrote:
>>>>
>>>>>                                 So if I run the same jobs in my
>>>>>                                 dev env will I still be able to
>>>>>                                 see the similar dump?
>>>>                                 I think running the same job in dev
>>>>                                 should be reproducible, maybe you
>>>>                                 can have a try.
>>>>
>>>>>                                  If not I would have to wait at a
>>>>>                                 low volume time to do it on
>>>>>                                 production. Aldo if I recall the
>>>>>                                 dump is as big as the JVM memory
>>>>>                                 right so if I have 10GB configed
>>>>>                                 for the JVM the dump will be 10GB
>>>>>                                 file?
>>>>                                 Yes, JMAP will pause the JVM, the
>>>>                                 time of pause depends on the size
>>>>                                 to dump. you can use "jmap
>>>>                                 -dump:live" to dump only the
>>>>                                 reachable objects, this will take a
>>>>                                 brief pause
>>>>
>>>>
>>>>
>>>>>                                 2022年3月30日 下午9:47，John Smith
>>>>>                                 <ja...@gmail.com> 写道：
>>>>>
>>>>>                                 I have 3 task managers (see config
>>>>>                                 below). There is total of 10 jobs
>>>>>                                 with 25 slots being used.
>>>>>                                 The jobs are 100% ETL I.e; They
>>>>>                                 load Json, transform it and push
>>>>>                                 it to JDBC, only 1 job of the 10
>>>>>                                 is pushing to Apache Ignite cluster.
>>>>>
>>>>>                                 FOR JMAP. I know that it will
>>>>>                                 pause the task manager. So if I
>>>>>                                 run the same jobs in my dev env
>>>>>                                 will I still be able to see the
>>>>>                                 similar dump? I I assume so. If
>>>>>                                 not I would have to wait at a low
>>>>>                                 volume time to do it on
>>>>>                                 production. Aldo if I recall the
>>>>>                                 dump is as big as the JVM memory
>>>>>                                 right so if I have 10GB configed
>>>>>                                 for the JVM the dump will be 10GB
>>>>>                                 file?
>>>>>
>>>>>
>>>>>                                 # Operating system has 16GB total.
>>>>>                                 env.ssh.opts: -l flink
>>>>>                                 -oStrictHostKeyChecking=no
>>>>>
>>>>>                                 cluster.evenly-spread-out-slots: true
>>>>>
>>>>>                                 taskmanager.memory.flink.size: 10240m
>>>>>                                 taskmanager.memory.jvm-metaspace.size:
>>>>>                                 2048m
>>>>>                                 taskmanager.numberOfTaskSlots: 16
>>>>>                                 parallelism.default: 1
>>>>>
>>>>>                                 high-availability: zookeeper
>>>>>                                 high-availability.storageDir:
>>>>>                                 file:///mnt/flink/ha/flink_1_14/
>>>>>                                 high-availability.zookeeper.quorum:
>>>>>                                 ...
>>>>>                                 high-availability.zookeeper.path.root:
>>>>>                                 /flink_1_14
>>>>>                                 high-availability.cluster-id:
>>>>>                                 /flink_1_14_cluster_0001
>>>>>
>>>>>                                 web.upload.dir:
>>>>>                                 /mnt/flink/uploads/flink_1_14
>>>>>
>>>>>                                 state.backend: rocksdb
>>>>>                                 state.backend.incremental: true
>>>>>                                 state.checkpoints.dir:
>>>>>                                 file:///mnt/flink/checkpoints/flink_1_14
>>>>>                                 state.savepoints.dir:
>>>>>                                 file:///mnt/flink/savepoints/flink_1_14
>>>>>
>>>>>                                 On Wed, Mar 30, 2022 at 2:16 AM
>>>>>                                 胡伟华 <hu...@gmail.com> wrote:
>>>>>
>>>>>                                     Hi, John
>>>>>
>>>>>                                     Could you tell us you
>>>>>                                     application scenario? Is it a
>>>>>                                     flink session cluster with a
>>>>>                                     lot of jobs?
>>>>>
>>>>>                                     Maybe you can try to dump the
>>>>>                                     memory with jmap and use tools
>>>>>                                     such as MAT to analyze whether
>>>>>                                     there are abnormal classes and
>>>>>                                     classloaders
>>>>>
>>>>>
>>>>>                                     > 2022年3月30日 上午6:09，John
>>>>>                                     Smith <ja...@gmail.com>
>>>>>                                     写道：
>>>>>                                     >
>>>>>                                     > Hi running 1.14.4
>>>>>                                     >
>>>>>                                     > My tasks manager still fails
>>>>>                                     with
>>>>>                                     java.lang.OutOfMemoryError:
>>>>>                                     Metaspace. The metaspace
>>>>>                                     out-of-memory error has
>>>>>                                     occurred. This can mean two
>>>>>                                     things: either the job
>>>>>                                     requires a larger size of JVM
>>>>>                                     metaspace to load classes or
>>>>>                                     there is a class loading leak.
>>>>>                                     >
>>>>>                                     > I have 2GB of metaspace
>>>>>                                     configed
>>>>>                                     taskmanager.memory.jvm-metaspace.size:
>>>>>                                     2048m
>>>>>                                     >
>>>>>                                     > But the task nodes still fail.
>>>>>                                     >
>>>>>                                     > When looking at the UI
>>>>>                                     metrics, the metaspace starts
>>>>>                                     low. Now I see 85% usage. It
>>>>>                                     seems to be a class loading
>>>>>                                     leak at this point, how can we
>>>>>                                     debug this issue?
>>>>>
>>>>
>>>
>>
>

Re: How to debug Metaspace exception?

Posted by John Smith <ja...@gmail.com>.

So I put classloader.parent-first-patterns.additional: "org.apache.ignite."
in the task config and so far I don't think I'm getting
"java.lang.OutOfMemoryError:
Metaspace" any more.

Or it's too early to tell.

Though now, the task managers are shutting down due to some other failures.

So maybe because tasks were failing and reloading often the task manager
was running out of Metspace. But now maybe it's just cleanly shutting down.

On Wed, Apr 20, 2022 at 11:35 AM John Smith <ja...@gmail.com> wrote:

> Or I can put in the config to treat org.apache.ignite. classes as first
> class?
>
> On Tue, Apr 19, 2022 at 10:18 PM John Smith <ja...@gmail.com>
> wrote:
>
>> Ok, so I loaded the dump into Eclipse Mat and followed:
>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>
>> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
>> - Then I clicked on one of them "Merge Shortest Path..." and picked
>> "Exclude all phantom/weak/soft references"
>> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver
>>
>> So i'm guessing anything JDBC based. I should copy into the task manager
>> libs folder and my jobs make the dependencies as compile only?
>>
>> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <ya...@goldsky.io>
>> wrote:
>>
>>> Also
>>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>>> might be helpful (has a section on profiling, as well as classloading).
>>>
>>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ch...@apache.org>
>>> wrote:
>>>
>>>> We have a very rough "guide" in the wiki (it's just the specific steps
>>>> I took to debug another leak):
>>>>
>>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>>
>>>> On 19/04/2022 12:01, huweihua wrote:
>>>>
>>>> Hi, John
>>>>
>>>> Sorry for the late reply. You can use MAT[1] to analyze the dump file.
>>>> Check whether have too many loaded classes.
>>>>
>>>> [1] https://www.eclipse.org/mat/
>>>>
>>>> 2022年4月18日 下午9:55，John Smith <ja...@gmail.com> 写道：
>>>>
>>>> Hi, can anyone help with this? I never looked at a dump file before.
>>>>
>>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, so I have a dump file. What do I look for?
>>>>>
>>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Ok so if there's a leak, if I manually stop the job and restart it
>>>>>> from the UI multiple times, I won't see the issue because because the
>>>>>> classes are unloaded correctly?
>>>>>>
>>>>>>
>>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <hu...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> The difference is that manually canceling the job stops the
>>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But looking
>>>>>>> on TaskManager, it doesn't make much difference
>>>>>>>
>>>>>>>
>>>>>>> 2022年3月31日 上午4:01，John Smith <ja...@gmail.com> 写道：
>>>>>>>
>>>>>>> Also if I manually cancel and restart the same job over and over is
>>>>>>> it the same as if flink was restarting a job due to failure?
>>>>>>>
>>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely
>>>>>>> unloaded vs when the job scheduler restarts a job because if whatever
>>>>>>> reason?
>>>>>>>
>>>>>>> Lile this I'll stop and restart the job a few times or maybe I can
>>>>>>> trick my job to fail and have the scheduler restart it. Ok let me think
>>>>>>> about this...
>>>>>>>
>>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <hu...@gmail.com> wrote:
>>>>>>>
>>>>>>>> So if I run the same jobs in my dev env will I still be able to see
>>>>>>>> the similar dump?
>>>>>>>>
>>>>>>>> I think running the same job in dev should be reproducible, maybe
>>>>>>>> you can have a try.
>>>>>>>>
>>>>>>>>  If not I would have to wait at a low volume time to do it on
>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>
>>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the size
>>>>>>>> to dump. you can use "jmap -dump:live" to dump only the reachable objects,
>>>>>>>> this will take a brief pause
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> 2022年3月30日 下午9:47，John Smith <ja...@gmail.com> 写道：
>>>>>>>>
>>>>>>>> I have 3 task managers (see config below). There is total of 10
>>>>>>>> jobs with 25 slots being used.
>>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and push it
>>>>>>>> to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>>>>
>>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I run
>>>>>>>> the same jobs in my dev env will I still be able to see the similar dump? I
>>>>>>>> I assume so. If not I would have to wait at a low volume time to do it on
>>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>>
>>>>>>>>
>>>>>>>> # Operating system has 16GB total.
>>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>>
>>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>>
>>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>>> parallelism.default: 1
>>>>>>>>
>>>>>>>> high-availability: zookeeper
>>>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>>
>>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>>
>>>>>>>> state.backend: rocksdb
>>>>>>>> state.backend.incremental: true
>>>>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>>>>
>>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <hu...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Hi, John
>>>>>>>>>
>>>>>>>>> Could you tell us you application scenario? Is it a flink session
>>>>>>>>> cluster with a lot of jobs?
>>>>>>>>>
>>>>>>>>> Maybe you can try to dump the memory with jmap and use tools such
>>>>>>>>> as MAT to analyze whether there are abnormal classes and classloaders
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> > 2022年3月30日 上午6:09，John Smith <ja...@gmail.com> 写道：
>>>>>>>>> >
>>>>>>>>> > Hi running 1.14.4
>>>>>>>>> >
>>>>>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>>>>>>>>> Metaspace. The metaspace out-of-memory error has occurred. This can mean
>>>>>>>>> two things: either the job requires a larger size of JVM metaspace to load
>>>>>>>>> classes or there is a class loading leak.
>>>>>>>>> >
>>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>>> >
>>>>>>>>> > But the task nodes still fail.
>>>>>>>>> >
>>>>>>>>> > When looking at the UI metrics, the metaspace starts low. Now I
>>>>>>>>> see 85% usage. It seems to be a class loading leak at this point, how can
>>>>>>>>> we debug this issue?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>
>>>>

Re: How to debug Metaspace exception?

Posted by John Smith <ja...@gmail.com>.

Or I can put in the config to treat org.apache.ignite. classes as first
class?

On Tue, Apr 19, 2022 at 10:18 PM John Smith <ja...@gmail.com> wrote:

> Ok, so I loaded the dump into Eclipse Mat and followed:
> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>
> - On the Histogram, I got over 30 entries for: ChildFirstClassLoader
> - Then I clicked on one of them "Merge Shortest Path..." and picked
> "Exclude all phantom/weak/soft references"
> - Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver
>
> So i'm guessing anything JDBC based. I should copy into the task manager
> libs folder and my jobs make the dependencies as compile only?
>
> On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <ya...@goldsky.io>
> wrote:
>
>> Also
>> https://shopify.engineering/optimizing-apache-flink-applications-tips
>> might be helpful (has a section on profiling, as well as classloading).
>>
>> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ch...@apache.org>
>> wrote:
>>
>>> We have a very rough "guide" in the wiki (it's just the specific steps I
>>> took to debug another leak):
>>>
>>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>>
>>> On 19/04/2022 12:01, huweihua wrote:
>>>
>>> Hi, John
>>>
>>> Sorry for the late reply. You can use MAT[1] to analyze the dump file.
>>> Check whether have too many loaded classes.
>>>
>>> [1] https://www.eclipse.org/mat/
>>>
>>> 2022年4月18日 下午9:55，John Smith <ja...@gmail.com> 写道：
>>>
>>> Hi, can anyone help with this? I never looked at a dump file before.
>>>
>>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <ja...@gmail.com>
>>> wrote:
>>>
>>>> Hi, so I have a dump file. What do I look for?
>>>>
>>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Ok so if there's a leak, if I manually stop the job and restart it
>>>>> from the UI multiple times, I won't see the issue because because the
>>>>> classes are unloaded correctly?
>>>>>
>>>>>
>>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <hu...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> The difference is that manually canceling the job stops the
>>>>>> JobMaster, but automatic failover keeps the JobMaster running. But looking
>>>>>> on TaskManager, it doesn't make much difference
>>>>>>
>>>>>>
>>>>>> 2022年3月31日 上午4:01，John Smith <ja...@gmail.com> 写道：
>>>>>>
>>>>>> Also if I manually cancel and restart the same job over and over is
>>>>>> it the same as if flink was restarting a job due to failure?
>>>>>>
>>>>>> I.e: When I click "Cancel Job" on the UI is the job completely
>>>>>> unloaded vs when the job scheduler restarts a job because if whatever
>>>>>> reason?
>>>>>>
>>>>>> Lile this I'll stop and restart the job a few times or maybe I can
>>>>>> trick my job to fail and have the scheduler restart it. Ok let me think
>>>>>> about this...
>>>>>>
>>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <hu...@gmail.com> wrote:
>>>>>>
>>>>>>> So if I run the same jobs in my dev env will I still be able to see
>>>>>>> the similar dump?
>>>>>>>
>>>>>>> I think running the same job in dev should be reproducible, maybe
>>>>>>> you can have a try.
>>>>>>>
>>>>>>>  If not I would have to wait at a low volume time to do it on
>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>
>>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the size
>>>>>>> to dump. you can use "jmap -dump:live" to dump only the reachable objects,
>>>>>>> this will take a brief pause
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> 2022年3月30日 下午9:47，John Smith <ja...@gmail.com> 写道：
>>>>>>>
>>>>>>> I have 3 task managers (see config below). There is total of 10 jobs
>>>>>>> with 25 slots being used.
>>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and push it
>>>>>>> to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>>>
>>>>>>> FOR JMAP. I know that it will pause the task manager. So if I run
>>>>>>> the same jobs in my dev env will I still be able to see the similar dump? I
>>>>>>> I assume so. If not I would have to wait at a low volume time to do it on
>>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>>
>>>>>>>
>>>>>>> # Operating system has 16GB total.
>>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>>
>>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>>
>>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>>> parallelism.default: 1
>>>>>>>
>>>>>>> high-availability: zookeeper
>>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>>>>>> high-availability.zookeeper.quorum: ...
>>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>>
>>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>>
>>>>>>> state.backend: rocksdb
>>>>>>> state.backend.incremental: true
>>>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>>>
>>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <hu...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Hi, John
>>>>>>>>
>>>>>>>> Could you tell us you application scenario? Is it a flink session
>>>>>>>> cluster with a lot of jobs?
>>>>>>>>
>>>>>>>> Maybe you can try to dump the memory with jmap and use tools such
>>>>>>>> as MAT to analyze whether there are abnormal classes and classloaders
>>>>>>>>
>>>>>>>>
>>>>>>>> > 2022年3月30日 上午6:09，John Smith <ja...@gmail.com> 写道：
>>>>>>>> >
>>>>>>>> > Hi running 1.14.4
>>>>>>>> >
>>>>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>>>>>>>> Metaspace. The metaspace out-of-memory error has occurred. This can mean
>>>>>>>> two things: either the job requires a larger size of JVM metaspace to load
>>>>>>>> classes or there is a class loading leak.
>>>>>>>> >
>>>>>>>> > I have 2GB of metaspace configed
>>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>>> >
>>>>>>>> > But the task nodes still fail.
>>>>>>>> >
>>>>>>>> > When looking at the UI metrics, the metaspace starts low. Now I
>>>>>>>> see 85% usage. It seems to be a class loading leak at this point, how can
>>>>>>>> we debug this issue?
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>
>>>

Re: How to debug Metaspace exception?

Posted by John Smith <ja...@gmail.com>.

Ok, so I loaded the dump into Eclipse Mat and followed:
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

- On the Histogram, I got over 30 entries for: ChildFirstClassLoader
- Then I clicked on one of them "Merge Shortest Path..." and picked
"Exclude all phantom/weak/soft references"
- Which then gave me: SqlDriverManager > Apache Ignite JdbcThin Driver

So i'm guessing anything JDBC based. I should copy into the task manager
libs folder and my jobs make the dependencies as compile only?

On Tue, Apr 19, 2022 at 12:18 PM Yaroslav Tkachenko <ya...@goldsky.io>
wrote:

> Also https://shopify.engineering/optimizing-apache-flink-applications-tips
> might be helpful (has a section on profiling, as well as classloading).
>
> On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ch...@apache.org>
> wrote:
>
>> We have a very rough "guide" in the wiki (it's just the specific steps I
>> took to debug another leak):
>>
>> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>>
>> On 19/04/2022 12:01, huweihua wrote:
>>
>> Hi, John
>>
>> Sorry for the late reply. You can use MAT[1] to analyze the dump file.
>> Check whether have too many loaded classes.
>>
>> [1] https://www.eclipse.org/mat/
>>
>> 2022年4月18日 下午9:55，John Smith <ja...@gmail.com> 写道：
>>
>> Hi, can anyone help with this? I never looked at a dump file before.
>>
>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> Hi, so I have a dump file. What do I look for?
>>>
>>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <ja...@gmail.com>
>>> wrote:
>>>
>>>> Ok so if there's a leak, if I manually stop the job and restart it from
>>>> the UI multiple times, I won't see the issue because because the classes
>>>> are unloaded correctly?
>>>>
>>>>
>>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <hu...@gmail.com>
>>>> wrote:
>>>>
>>>>>
>>>>> The difference is that manually canceling the job stops the JobMaster,
>>>>> but automatic failover keeps the JobMaster running. But looking on
>>>>> TaskManager, it doesn't make much difference
>>>>>
>>>>>
>>>>> 2022年3月31日 上午4:01，John Smith <ja...@gmail.com> 写道：
>>>>>
>>>>> Also if I manually cancel and restart the same job over and over is it
>>>>> the same as if flink was restarting a job due to failure?
>>>>>
>>>>> I.e: When I click "Cancel Job" on the UI is the job completely
>>>>> unloaded vs when the job scheduler restarts a job because if whatever
>>>>> reason?
>>>>>
>>>>> Lile this I'll stop and restart the job a few times or maybe I can
>>>>> trick my job to fail and have the scheduler restart it. Ok let me think
>>>>> about this...
>>>>>
>>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <hu...@gmail.com> wrote:
>>>>>
>>>>>> So if I run the same jobs in my dev env will I still be able to see
>>>>>> the similar dump?
>>>>>>
>>>>>> I think running the same job in dev should be reproducible, maybe you
>>>>>> can have a try.
>>>>>>
>>>>>>  If not I would have to wait at a low volume time to do it on
>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>
>>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the size
>>>>>> to dump. you can use "jmap -dump:live" to dump only the reachable objects,
>>>>>> this will take a brief pause
>>>>>>
>>>>>>
>>>>>>
>>>>>> 2022年3月30日 下午9:47，John Smith <ja...@gmail.com> 写道：
>>>>>>
>>>>>> I have 3 task managers (see config below). There is total of 10 jobs
>>>>>> with 25 slots being used.
>>>>>> The jobs are 100% ETL I.e; They load Json, transform it and push it
>>>>>> to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>>
>>>>>> FOR JMAP. I know that it will pause the task manager. So if I run the
>>>>>> same jobs in my dev env will I still be able to see the similar dump? I I
>>>>>> assume so. If not I would have to wait at a low volume time to do it on
>>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>>
>>>>>>
>>>>>> # Operating system has 16GB total.
>>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>>
>>>>>> cluster.evenly-spread-out-slots: true
>>>>>>
>>>>>> taskmanager.memory.flink.size: 10240m
>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>> taskmanager.numberOfTaskSlots: 16
>>>>>> parallelism.default: 1
>>>>>>
>>>>>> high-availability: zookeeper
>>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>>>>> high-availability.zookeeper.quorum: ...
>>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>>
>>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>>
>>>>>> state.backend: rocksdb
>>>>>> state.backend.incremental: true
>>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>>
>>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <hu...@gmail.com> wrote:
>>>>>>
>>>>>>> Hi, John
>>>>>>>
>>>>>>> Could you tell us you application scenario? Is it a flink session
>>>>>>> cluster with a lot of jobs?
>>>>>>>
>>>>>>> Maybe you can try to dump the memory with jmap and use tools such as
>>>>>>> MAT to analyze whether there are abnormal classes and classloaders
>>>>>>>
>>>>>>>
>>>>>>> > 2022年3月30日 上午6:09，John Smith <ja...@gmail.com> 写道：
>>>>>>> >
>>>>>>> > Hi running 1.14.4
>>>>>>> >
>>>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>>>>>>> Metaspace. The metaspace out-of-memory error has occurred. This can mean
>>>>>>> two things: either the job requires a larger size of JVM metaspace to load
>>>>>>> classes or there is a class loading leak.
>>>>>>> >
>>>>>>> > I have 2GB of metaspace configed
>>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>>> >
>>>>>>> > But the task nodes still fail.
>>>>>>> >
>>>>>>> > When looking at the UI metrics, the metaspace starts low. Now I
>>>>>>> see 85% usage. It seems to be a class loading leak at this point, how can
>>>>>>> we debug this issue?
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>
>>

Re: How to debug Metaspace exception?

Posted by Yaroslav Tkachenko <ya...@goldsky.io>.

Also https://shopify.engineering/optimizing-apache-flink-applications-tips
might be helpful (has a section on profiling, as well as classloading).

On Tue, Apr 19, 2022 at 4:35 AM Chesnay Schepler <ch...@apache.org> wrote:

> We have a very rough "guide" in the wiki (it's just the specific steps I
> took to debug another leak):
>
> https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks
>
> On 19/04/2022 12:01, huweihua wrote:
>
> Hi, John
>
> Sorry for the late reply. You can use MAT[1] to analyze the dump file.
> Check whether have too many loaded classes.
>
> [1] https://www.eclipse.org/mat/
>
> 2022年4月18日 下午9:55，John Smith <ja...@gmail.com> 写道：
>
> Hi, can anyone help with this? I never looked at a dump file before.
>
> On Thu, Apr 14, 2022 at 11:59 AM John Smith <ja...@gmail.com>
> wrote:
>
>> Hi, so I have a dump file. What do I look for?
>>
>> On Thu, Mar 31, 2022 at 3:28 PM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> Ok so if there's a leak, if I manually stop the job and restart it from
>>> the UI multiple times, I won't see the issue because because the classes
>>> are unloaded correctly?
>>>
>>>
>>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <hu...@gmail.com> wrote:
>>>
>>>>
>>>> The difference is that manually canceling the job stops the JobMaster,
>>>> but automatic failover keeps the JobMaster running. But looking on
>>>> TaskManager, it doesn't make much difference
>>>>
>>>>
>>>> 2022年3月31日 上午4:01，John Smith <ja...@gmail.com> 写道：
>>>>
>>>> Also if I manually cancel and restart the same job over and over is it
>>>> the same as if flink was restarting a job due to failure?
>>>>
>>>> I.e: When I click "Cancel Job" on the UI is the job completely unloaded
>>>> vs when the job scheduler restarts a job because if whatever reason?
>>>>
>>>> Lile this I'll stop and restart the job a few times or maybe I can
>>>> trick my job to fail and have the scheduler restart it. Ok let me think
>>>> about this...
>>>>
>>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <hu...@gmail.com> wrote:
>>>>
>>>>> So if I run the same jobs in my dev env will I still be able to see
>>>>> the similar dump?
>>>>>
>>>>> I think running the same job in dev should be reproducible, maybe you
>>>>> can have a try.
>>>>>
>>>>>  If not I would have to wait at a low volume time to do it on
>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>
>>>>> Yes, JMAP will pause the JVM, the time of pause depends on the size to
>>>>> dump. you can use "jmap -dump:live" to dump only the reachable objects,
>>>>> this will take a brief pause
>>>>>
>>>>>
>>>>>
>>>>> 2022年3月30日 下午9:47，John Smith <ja...@gmail.com> 写道：
>>>>>
>>>>> I have 3 task managers (see config below). There is total of 10 jobs
>>>>> with 25 slots being used.
>>>>> The jobs are 100% ETL I.e; They load Json, transform it and push it to
>>>>> JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>>
>>>>> FOR JMAP. I know that it will pause the task manager. So if I run the
>>>>> same jobs in my dev env will I still be able to see the similar dump? I I
>>>>> assume so. If not I would have to wait at a low volume time to do it on
>>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>>
>>>>>
>>>>> # Operating system has 16GB total.
>>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>>
>>>>> cluster.evenly-spread-out-slots: true
>>>>>
>>>>> taskmanager.memory.flink.size: 10240m
>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>> taskmanager.numberOfTaskSlots: 16
>>>>> parallelism.default: 1
>>>>>
>>>>> high-availability: zookeeper
>>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>>>> high-availability.zookeeper.quorum: ...
>>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>>
>>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>>
>>>>> state.backend: rocksdb
>>>>> state.backend.incremental: true
>>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>>
>>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <hu...@gmail.com> wrote:
>>>>>
>>>>>> Hi, John
>>>>>>
>>>>>> Could you tell us you application scenario? Is it a flink session
>>>>>> cluster with a lot of jobs?
>>>>>>
>>>>>> Maybe you can try to dump the memory with jmap and use tools such as
>>>>>> MAT to analyze whether there are abnormal classes and classloaders
>>>>>>
>>>>>>
>>>>>> > 2022年3月30日 上午6:09，John Smith <ja...@gmail.com> 写道：
>>>>>> >
>>>>>> > Hi running 1.14.4
>>>>>> >
>>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>>>>>> Metaspace. The metaspace out-of-memory error has occurred. This can mean
>>>>>> two things: either the job requires a larger size of JVM metaspace to load
>>>>>> classes or there is a class loading leak.
>>>>>> >
>>>>>> > I have 2GB of metaspace configed
>>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>>> >
>>>>>> > But the task nodes still fail.
>>>>>> >
>>>>>> > When looking at the UI metrics, the metaspace starts low. Now I see
>>>>>> 85% usage. It seems to be a class loading leak at this point, how can we
>>>>>> debug this issue?
>>>>>>
>>>>>>
>>>>>
>>>>
>
>

Re: How to debug Metaspace exception?

Posted by Chesnay Schepler <ch...@apache.org>.

We have a very rough "guide" in the wiki (it's just the specific steps I 
took to debug another leak):
https://cwiki.apache.org/confluence/display/FLINK/Debugging+ClassLoader+leaks

On 19/04/2022 12:01, huweihua wrote:
> Hi, John
>
> Sorry for the late reply. You can use MAT[1] to analyze the dump file. 
> Check whether have too many loaded classes.
>
> [1] https://www.eclipse.org/mat/
>
>> 2022年4月18日 下午9:55，John Smith <ja...@gmail.com> 写道：
>>
>> Hi, can anyone help with this? I never looked at a dump file before.
>>
>> On Thu, Apr 14, 2022 at 11:59 AM John Smith <ja...@gmail.com> 
>> wrote:
>>
>>     Hi, so I have a dump file. What do I look for?
>>
>>     On Thu, Mar 31, 2022 at 3:28 PM John Smith
>>     <ja...@gmail.com> wrote:
>>
>>         Ok so if there's a leak, if I manually stop the job and
>>         restart it from the UI multiple times, I won't see the issue
>>         because because the classes are unloaded correctly?
>>
>>
>>         On Thu, Mar 31, 2022 at 9:20 AM huweihua
>>         <hu...@gmail.com> wrote:
>>
>>
>>             The difference is that manually canceling the job stops
>>             the JobMaster, but automatic failover keeps the JobMaster
>>             running. But looking on TaskManager, it doesn't make much
>>             difference
>>
>>
>>>             2022年3月31日 上午4:01，John Smith <ja...@gmail.com>
>>>             写道：
>>>
>>>             Also if I manually cancel and restart the same job over
>>>             and over is it the same as if flink was restarting a job
>>>             due to failure?
>>>
>>>             I.e: When I click "Cancel Job" on the UI is the job
>>>             completely unloaded vs when the job scheduler restarts a
>>>             job because if whatever reason?
>>>
>>>             Lile this I'll stop and restart the job a few times or
>>>             maybe I can trick my job to fail and have the scheduler
>>>             restart it. Ok let me think about this...
>>>
>>>             On Wed, Mar 30, 2022 at 10:24 AM 胡伟华
>>>             <hu...@gmail.com> wrote:
>>>
>>>>                 So if I run the same jobs in my dev env will I
>>>>                 still be able to see the similar dump?
>>>                 I think running the same job in dev should be
>>>                 reproducible, maybe you can have a try.
>>>
>>>>                  If not I would have to wait at a low volume time
>>>>                 to do it on production. Aldo if I recall the dump
>>>>                 is as big as the JVM memory right so if I have 10GB
>>>>                 configed for the JVM the dump will be 10GB file?
>>>                 Yes, JMAP will pause the JVM, the time of pause
>>>                 depends on the size to dump. you can use "jmap
>>>                 -dump:live" to dump only the reachable objects, this
>>>                 will take a brief pause
>>>
>>>
>>>
>>>>                 2022年3月30日 下午9:47，John Smith
>>>>                 <ja...@gmail.com> 写道：
>>>>
>>>>                 I have 3 task managers (see config below). There is
>>>>                 total of 10 jobs with 25 slots being used.
>>>>                 The jobs are 100% ETL I.e; They load Json,
>>>>                 transform it and push it to JDBC, only 1 job of the
>>>>                 10 is pushing to Apache Ignite cluster.
>>>>
>>>>                 FOR JMAP. I know that it will pause the task
>>>>                 manager. So if I run the same jobs in my dev env
>>>>                 will I still be able to see the similar dump? I I
>>>>                 assume so. If not I would have to wait at a low
>>>>                 volume time to do it on production. Aldo if I
>>>>                 recall the dump is as big as the JVM memory right
>>>>                 so if I have 10GB configed for the JVM the dump
>>>>                 will be 10GB file?
>>>>
>>>>
>>>>                 # Operating system has 16GB total.
>>>>                 env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>
>>>>                 cluster.evenly-spread-out-slots: true
>>>>
>>>>                 taskmanager.memory.flink.size: 10240m
>>>>                 taskmanager.memory.jvm-metaspace.size: 2048m
>>>>                 taskmanager.numberOfTaskSlots: 16
>>>>                 parallelism.default: 1
>>>>
>>>>                 high-availability: zookeeper
>>>>                 high-availability.storageDir:
>>>>                 file:///mnt/flink/ha/flink_1_14/
>>>>                 high-availability.zookeeper.quorum: ...
>>>>                 high-availability.zookeeper.path.root: /flink_1_14
>>>>                 high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>
>>>>                 web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>
>>>>                 state.backend: rocksdb
>>>>                 state.backend.incremental: true
>>>>                 state.checkpoints.dir:
>>>>                 file:///mnt/flink/checkpoints/flink_1_14
>>>>                 state.savepoints.dir:
>>>>                 file:///mnt/flink/savepoints/flink_1_14
>>>>
>>>>                 On Wed, Mar 30, 2022 at 2:16 AM 胡伟华
>>>>                 <hu...@gmail.com> wrote:
>>>>
>>>>                     Hi, John
>>>>
>>>>                     Could you tell us you application scenario? Is
>>>>                     it a flink session cluster with a lot of jobs?
>>>>
>>>>                     Maybe you can try to dump the memory with jmap
>>>>                     and use tools such as MAT to analyze whether
>>>>                     there are abnormal classes and classloaders
>>>>
>>>>
>>>>                     > 2022年3月30日 上午6:09，John Smith
>>>>                     <ja...@gmail.com> 写道：
>>>>                     >
>>>>                     > Hi running 1.14.4
>>>>                     >
>>>>                     > My tasks manager still fails with
>>>>                     java.lang.OutOfMemoryError: Metaspace. The
>>>>                     metaspace out-of-memory error has occurred.
>>>>                     This can mean two things: either the job
>>>>                     requires a larger size of JVM metaspace to load
>>>>                     classes or there is a class loading leak.
>>>>                     >
>>>>                     > I have 2GB of metaspace configed
>>>>                     taskmanager.memory.jvm-metaspace.size: 2048m
>>>>                     >
>>>>                     > But the task nodes still fail.
>>>>                     >
>>>>                     > When looking at the UI metrics, the metaspace
>>>>                     starts low. Now I see 85% usage. It seems to be
>>>>                     a class loading leak at this point, how can we
>>>>                     debug this issue?
>>>>
>>>
>>
>

Re: How to debug Metaspace exception?

Posted by huweihua <hu...@gmail.com>.

Hi, John

Sorry for the late reply. You can use MAT[1] to analyze the dump file. Check whether have too many loaded classes.

[1] https://www.eclipse.org/mat/

> 2022年4月18日 下午9:55，John Smith <ja...@gmail.com> 写道：
> 
> Hi, can anyone help with this? I never looked at a dump file before.
> 
> On Thu, Apr 14, 2022 at 11:59 AM John Smith <java.dev.mtl@gmail.com <ma...@gmail.com>> wrote:
> Hi, so I have a dump file. What do I look for?
> 
> On Thu, Mar 31, 2022 at 3:28 PM John Smith <java.dev.mtl@gmail.com <ma...@gmail.com>> wrote:
> Ok so if there's a leak, if I manually stop the job and restart it from the UI multiple times, I won't see the issue because because the classes are unloaded correctly?
> 
> 
> On Thu, Mar 31, 2022 at 9:20 AM huweihua <huweihua.ckl@gmail.com <ma...@gmail.com>> wrote:
> 
> The difference is that manually canceling the job stops the JobMaster, but automatic failover keeps the JobMaster running. But looking on TaskManager, it doesn't make much difference
> 
> 
>> 2022年3月31日 上午4:01，John Smith <java.dev.mtl@gmail.com <ma...@gmail.com>> 写道：
>> 
>> Also if I manually cancel and restart the same job over and over is it the same as if flink was restarting a job due to failure?
>> 
>> I.e: When I click "Cancel Job" on the UI is the job completely unloaded vs when the job scheduler restarts a job because if whatever reason?
>> 
>> Lile this I'll stop and restart the job a few times or maybe I can trick my job to fail and have the scheduler restart it. Ok let me think about this...
>> 
>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <huweihua.ckl@gmail.com <ma...@gmail.com>> wrote:
>>> So if I run the same jobs in my dev env will I still be able to see the similar dump? 
>> I think running the same job in dev should be reproducible, maybe you can have a try.
>> 
>>>  If not I would have to wait at a low volume time to do it on production. Aldo if I recall the dump is as big as the JVM memory right so if I have 10GB configed for the JVM the dump will be 10GB file?
>> 
>> Yes, JMAP will pause the JVM, the time of pause depends on the size to dump. you can use "jmap -dump:live" to dump only the reachable objects, this will take a brief pause
>> 
>> 
>> 
>>> 2022年3月30日 下午9:47，John Smith <java.dev.mtl@gmail.com <ma...@gmail.com>> 写道：
>>> 
>>> I have 3 task managers (see config below). There is total of 10 jobs with 25 slots being used.
>>> The jobs are 100% ETL I.e; They load Json, transform it and push it to JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>> 
>>> FOR JMAP. I know that it will pause the task manager. So if I run the same jobs in my dev env will I still be able to see the similar dump? I I assume so. If not I would have to wait at a low volume time to do it on production. Aldo if I recall the dump is as big as the JVM memory right so if I have 10GB configed for the JVM the dump will be 10GB file?
>>> 
>>> 
>>> # Operating system has 16GB total.
>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>> 
>>> cluster.evenly-spread-out-slots: true
>>> 
>>> taskmanager.memory.flink.size: 10240m
>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>> taskmanager.numberOfTaskSlots: 16
>>> parallelism.default: 1
>>> 
>>> high-availability: zookeeper
>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/ <>
>>> high-availability.zookeeper.quorum: ...
>>> high-availability.zookeeper.path.root: /flink_1_14
>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>> 
>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>> 
>>> state.backend: rocksdb
>>> state.backend.incremental: true
>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14 <>
>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14 <>
>>> 
>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <huweihua.ckl@gmail.com <ma...@gmail.com>> wrote:
>>> Hi, John
>>> 
>>> Could you tell us you application scenario? Is it a flink session cluster with a lot of jobs?
>>> 
>>> Maybe you can try to dump the memory with jmap and use tools such as MAT to analyze whether there are abnormal classes and classloaders
>>> 
>>> 
>>> > 2022年3月30日 上午6:09，John Smith <java.dev.mtl@gmail.com <ma...@gmail.com>> 写道：
>>> > 
>>> > Hi running 1.14.4
>>> > 
>>> > My tasks manager still fails with java.lang.OutOfMemoryError: Metaspace. The metaspace out-of-memory error has occurred. This can mean two things: either the job requires a larger size of JVM metaspace to load classes or there is a class loading leak.
>>> > 
>>> > I have 2GB of metaspace configed taskmanager.memory.jvm-metaspace.size: 2048m
>>> > 
>>> > But the task nodes still fail.
>>> > 
>>> > When looking at the UI metrics, the metaspace starts low. Now I see 85% usage. It seems to be a class loading leak at this point, how can we debug this issue?
>>> 
>> 
>

Re: How to debug Metaspace exception?

Posted by John Smith <ja...@gmail.com>.

Hi, can anyone help with this? I never looked at a dump file before.

On Thu, Apr 14, 2022 at 11:59 AM John Smith <ja...@gmail.com> wrote:

> Hi, so I have a dump file. What do I look for?
>
> On Thu, Mar 31, 2022 at 3:28 PM John Smith <ja...@gmail.com> wrote:
>
>> Ok so if there's a leak, if I manually stop the job and restart it from
>> the UI multiple times, I won't see the issue because because the classes
>> are unloaded correctly?
>>
>>
>> On Thu, Mar 31, 2022 at 9:20 AM huweihua <hu...@gmail.com> wrote:
>>
>>>
>>> The difference is that manually canceling the job stops the JobMaster,
>>> but automatic failover keeps the JobMaster running. But looking on
>>> TaskManager, it doesn't make much difference
>>>
>>>
>>> 2022年3月31日 上午4:01，John Smith <ja...@gmail.com> 写道：
>>>
>>> Also if I manually cancel and restart the same job over and over is it
>>> the same as if flink was restarting a job due to failure?
>>>
>>> I.e: When I click "Cancel Job" on the UI is the job completely unloaded
>>> vs when the job scheduler restarts a job because if whatever reason?
>>>
>>> Lile this I'll stop and restart the job a few times or maybe I can trick
>>> my job to fail and have the scheduler restart it. Ok let me think about
>>> this...
>>>
>>> On Wed, Mar 30, 2022 at 10:24 AM 胡伟华 <hu...@gmail.com> wrote:
>>>
>>>> So if I run the same jobs in my dev env will I still be able to see the
>>>> similar dump?
>>>>
>>>> I think running the same job in dev should be reproducible, maybe you
>>>> can have a try.
>>>>
>>>>  If not I would have to wait at a low volume time to do it on
>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>
>>>> Yes, JMAP will pause the JVM, the time of pause depends on the size to
>>>> dump. you can use "jmap -dump:live" to dump only the reachable objects,
>>>> this will take a brief pause
>>>>
>>>>
>>>>
>>>> 2022年3月30日 下午9:47，John Smith <ja...@gmail.com> 写道：
>>>>
>>>> I have 3 task managers (see config below). There is total of 10 jobs
>>>> with 25 slots being used.
>>>> The jobs are 100% ETL I.e; They load Json, transform it and push it to
>>>> JDBC, only 1 job of the 10 is pushing to Apache Ignite cluster.
>>>>
>>>> FOR JMAP. I know that it will pause the task manager. So if I run the
>>>> same jobs in my dev env will I still be able to see the similar dump? I I
>>>> assume so. If not I would have to wait at a low volume time to do it on
>>>> production. Aldo if I recall the dump is as big as the JVM memory right so
>>>> if I have 10GB configed for the JVM the dump will be 10GB file?
>>>>
>>>>
>>>> # Operating system has 16GB total.
>>>> env.ssh.opts: -l flink -oStrictHostKeyChecking=no
>>>>
>>>> cluster.evenly-spread-out-slots: true
>>>>
>>>> taskmanager.memory.flink.size: 10240m
>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>> taskmanager.numberOfTaskSlots: 16
>>>> parallelism.default: 1
>>>>
>>>> high-availability: zookeeper
>>>> high-availability.storageDir: file:///mnt/flink/ha/flink_1_14/
>>>> high-availability.zookeeper.quorum: ...
>>>> high-availability.zookeeper.path.root: /flink_1_14
>>>> high-availability.cluster-id: /flink_1_14_cluster_0001
>>>>
>>>> web.upload.dir: /mnt/flink/uploads/flink_1_14
>>>>
>>>> state.backend: rocksdb
>>>> state.backend.incremental: true
>>>> state.checkpoints.dir: file:///mnt/flink/checkpoints/flink_1_14
>>>> state.savepoints.dir: file:///mnt/flink/savepoints/flink_1_14
>>>>
>>>> On Wed, Mar 30, 2022 at 2:16 AM 胡伟华 <hu...@gmail.com> wrote:
>>>>
>>>>> Hi, John
>>>>>
>>>>> Could you tell us you application scenario? Is it a flink session
>>>>> cluster with a lot of jobs?
>>>>>
>>>>> Maybe you can try to dump the memory with jmap and use tools such as
>>>>> MAT to analyze whether there are abnormal classes and classloaders
>>>>>
>>>>>
>>>>> > 2022年3月30日 上午6:09，John Smith <ja...@gmail.com> 写道：
>>>>> >
>>>>> > Hi running 1.14.4
>>>>> >
>>>>> > My tasks manager still fails with java.lang.OutOfMemoryError:
>>>>> Metaspace. The metaspace out-of-memory error has occurred. This can mean
>>>>> two things: either the job requires a larger size of JVM metaspace to load
>>>>> classes or there is a class loading leak.
>>>>> >
>>>>> > I have 2GB of metaspace configed
>>>>> taskmanager.memory.jvm-metaspace.size: 2048m
>>>>> >
>>>>> > But the task nodes still fail.
>>>>> >
>>>>> > When looking at the UI metrics, the metaspace starts low. Now I see
>>>>> 85% usage. It seems to be a class loading leak at this point, how can we
>>>>> debug this issue?
>>>>>
>>>>>
>>>>
>>>