You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by John Smith <ja...@gmail.com> on 2020/02/24 23:49:43 UTC

MaxMetaspace default may be to low?

Hi, I just upgraded to 1.10 and I started deploying my jobs. Eventually
task nodes started shutting down with OutOfMemory Metaspace.

I look at the logs and the task managers are started with:
-XX:MaxMetaspaceSize=100663296

So I configed: taskmanager.memory.jvm-metaspace.size: 256m

It seems to be ok for now. What are your thoughts? And should I try 512m or
is that too much?

Re: MaxMetaspace default may be to low?

Posted by Xintong Song <to...@gmail.com>.
I'm sorry that you had bad experience with the migration and
configurations. I believe the changing of limiting metaspace size is
already documented in various places, but maybe it's not obvious enough
that lead to your confusion. Let's keep the discussion on how to improve
that in the JIRA ticket you opened.

Regarding how most people run their jobs, it depends on various factors,
thus, hard to describe. Narrow down to the metaspace memory footprint, it
really depends on how many classes is loaded, i.e., how many libraries used
and classes defined in UDFs, and how many tasks different jobs co-exist in
the same TM process.

According to our testing before the release, the current default value
works for all the e2e tests, and our testing jobs with simple UDFs (without
custom libraries) in single job clusters. We did observe problems in having
large multi-slot TMs concurrently running different jobs. However, such
cases usually requires various changes of configurations
(process.size/flink.size, numOfSlots, etc.) and we think it makes sense to
make metaspace one of them.

Thank you~

Xintong Song



On Tue, Feb 25, 2020 at 9:22 PM John Smith <ja...@gmail.com> wrote:

> Ok maybe it can be documented?
>
> So just trying to understand, how do most people run their jobs? I mean
> like they run less tasks, but tasks that have allot direct or mapped
> memory? Like little JVM_HEAP but huge state outside the JVM?
>
> I also recorded this issue:
> https://issues.apache.org/jira/browse/FLINK-16278 so we can maybe get it
> documented.
>
> On Tue, 25 Feb 2020 at 02:57, Xintong Song <to...@gmail.com> wrote:
>
>> In that case, I think the default metaspace size is too small for you
>> setup. The default configurations are not intended for such large task
>> managers.
>>
>> In Flink 1.8 we do not set the JVM '-XX:MaxMetaspaceSize' parameter,
>> which means you have 'unlimited' metaspace size. We changed that in Flink
>> 1.10 to have stricter control on the overall memory usage of Flink
>> processes.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Tue, Feb 25, 2020 at 1:24 PM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> I would like to also add the same exact jobs on Flink 1.8 where running
>>> perfectly fine.
>>>
>>> On Tue, 25 Feb 2020 at 00:20, John Smith <ja...@gmail.com> wrote:
>>>
>>>> Right after Job execution. Basically as soon as I deployed a 5th job.
>>>> So at 4 jobs it was ok, at 5 jobs it would take like 1-2 minutes max and
>>>> the node would just shut off.
>>>> So far with MaxMetaSpace 256m it's been stable. My task nodes are 16GB
>>>> and the memory config is done as follows...
>>>> taskmanager.memory.flink.size: 12g
>>>> taskmanager.memory.jvm-metaspace.size: 256m
>>>>
>>>> 100% of the jobs right now are ETL with checkpoints, NO state,
>>>> Kafka -----> Json Transform ----> DB
>>>> or
>>>> Kafka ----> DB lookup (to small local cache)--------> Json Transform
>>>> -----> Apache Ignite
>>>>
>>>> None of the jobs are related.
>>>>
>>>> On Mon, 24 Feb 2020 at 20:59, Xintong Song <to...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi John,
>>>>>
>>>>> The default metaspace size is intend for working with a major
>>>>> proportion of jobs. We are aware that for some jobs that need to load lots
>>>>> of classes, the default value might not be large enough. However, having a
>>>>> larger default value means for other jobs that do not load many classes,
>>>>> the overall memory requirements might be unnecessarily high. (Imagine you
>>>>> have a task manager with the default total memory 1.5GB, but 512m of it is
>>>>> reserved for metaspace.)
>>>>>
>>>>> Another possible problem is metaspace leak. When you say "eventually
>>>>> task nodes started shutting down with OutOfMemory Metaspace", does this
>>>>> problem happen shortly after the job execution starts, or does it happen
>>>>> after job running for a while? Does the metaspace footprint keep growing or
>>>>> become stable after the initial growth? If the metaspace keeps growing
>>>>> along with time, it's usually an indicator of metaspace memory leak.
>>>>>
>>>>> Thank you~
>>>>>
>>>>> Xintong Song
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 25, 2020 at 7:50 AM John Smith <ja...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Hi, I just upgraded to 1.10 and I started deploying my jobs.
>>>>>> Eventually task nodes started shutting down with OutOfMemory Metaspace.
>>>>>>
>>>>>> I look at the logs and the task managers are started with:
>>>>>> -XX:MaxMetaspaceSize=100663296
>>>>>>
>>>>>> So I configed: taskmanager.memory.jvm-metaspace.size: 256m
>>>>>>
>>>>>> It seems to be ok for now. What are your thoughts? And should I try
>>>>>> 512m or is that too much?
>>>>>>
>>>>>

Re: MaxMetaspace default may be to low?

Posted by John Smith <ja...@gmail.com>.
Ok maybe it can be documented?

So just trying to understand, how do most people run their jobs? I mean
like they run less tasks, but tasks that have allot direct or mapped
memory? Like little JVM_HEAP but huge state outside the JVM?

I also recorded this issue:
https://issues.apache.org/jira/browse/FLINK-16278 so we can maybe get it
documented.

On Tue, 25 Feb 2020 at 02:57, Xintong Song <to...@gmail.com> wrote:

> In that case, I think the default metaspace size is too small for you
> setup. The default configurations are not intended for such large task
> managers.
>
> In Flink 1.8 we do not set the JVM '-XX:MaxMetaspaceSize' parameter, which
> means you have 'unlimited' metaspace size. We changed that in Flink 1.10 to
> have stricter control on the overall memory usage of Flink processes.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Feb 25, 2020 at 1:24 PM John Smith <ja...@gmail.com> wrote:
>
>> I would like to also add the same exact jobs on Flink 1.8 where running
>> perfectly fine.
>>
>> On Tue, 25 Feb 2020 at 00:20, John Smith <ja...@gmail.com> wrote:
>>
>>> Right after Job execution. Basically as soon as I deployed a 5th job. So
>>> at 4 jobs it was ok, at 5 jobs it would take like 1-2 minutes max and the
>>> node would just shut off.
>>> So far with MaxMetaSpace 256m it's been stable. My task nodes are 16GB
>>> and the memory config is done as follows...
>>> taskmanager.memory.flink.size: 12g
>>> taskmanager.memory.jvm-metaspace.size: 256m
>>>
>>> 100% of the jobs right now are ETL with checkpoints, NO state,
>>> Kafka -----> Json Transform ----> DB
>>> or
>>> Kafka ----> DB lookup (to small local cache)--------> Json Transform
>>> -----> Apache Ignite
>>>
>>> None of the jobs are related.
>>>
>>> On Mon, 24 Feb 2020 at 20:59, Xintong Song <to...@gmail.com>
>>> wrote:
>>>
>>>> Hi John,
>>>>
>>>> The default metaspace size is intend for working with a major
>>>> proportion of jobs. We are aware that for some jobs that need to load lots
>>>> of classes, the default value might not be large enough. However, having a
>>>> larger default value means for other jobs that do not load many classes,
>>>> the overall memory requirements might be unnecessarily high. (Imagine you
>>>> have a task manager with the default total memory 1.5GB, but 512m of it is
>>>> reserved for metaspace.)
>>>>
>>>> Another possible problem is metaspace leak. When you say "eventually
>>>> task nodes started shutting down with OutOfMemory Metaspace", does this
>>>> problem happen shortly after the job execution starts, or does it happen
>>>> after job running for a while? Does the metaspace footprint keep growing or
>>>> become stable after the initial growth? If the metaspace keeps growing
>>>> along with time, it's usually an indicator of metaspace memory leak.
>>>>
>>>> Thank you~
>>>>
>>>> Xintong Song
>>>>
>>>>
>>>>
>>>> On Tue, Feb 25, 2020 at 7:50 AM John Smith <ja...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi, I just upgraded to 1.10 and I started deploying my jobs.
>>>>> Eventually task nodes started shutting down with OutOfMemory Metaspace.
>>>>>
>>>>> I look at the logs and the task managers are started with:
>>>>> -XX:MaxMetaspaceSize=100663296
>>>>>
>>>>> So I configed: taskmanager.memory.jvm-metaspace.size: 256m
>>>>>
>>>>> It seems to be ok for now. What are your thoughts? And should I try
>>>>> 512m or is that too much?
>>>>>
>>>>

Re: MaxMetaspace default may be to low?

Posted by Xintong Song <to...@gmail.com>.
In that case, I think the default metaspace size is too small for you
setup. The default configurations are not intended for such large task
managers.

In Flink 1.8 we do not set the JVM '-XX:MaxMetaspaceSize' parameter, which
means you have 'unlimited' metaspace size. We changed that in Flink 1.10 to
have stricter control on the overall memory usage of Flink processes.

Thank you~

Xintong Song



On Tue, Feb 25, 2020 at 1:24 PM John Smith <ja...@gmail.com> wrote:

> I would like to also add the same exact jobs on Flink 1.8 where running
> perfectly fine.
>
> On Tue, 25 Feb 2020 at 00:20, John Smith <ja...@gmail.com> wrote:
>
>> Right after Job execution. Basically as soon as I deployed a 5th job. So
>> at 4 jobs it was ok, at 5 jobs it would take like 1-2 minutes max and the
>> node would just shut off.
>> So far with MaxMetaSpace 256m it's been stable. My task nodes are 16GB
>> and the memory config is done as follows...
>> taskmanager.memory.flink.size: 12g
>> taskmanager.memory.jvm-metaspace.size: 256m
>>
>> 100% of the jobs right now are ETL with checkpoints, NO state,
>> Kafka -----> Json Transform ----> DB
>> or
>> Kafka ----> DB lookup (to small local cache)--------> Json Transform
>> -----> Apache Ignite
>>
>> None of the jobs are related.
>>
>> On Mon, 24 Feb 2020 at 20:59, Xintong Song <to...@gmail.com> wrote:
>>
>>> Hi John,
>>>
>>> The default metaspace size is intend for working with a major proportion
>>> of jobs. We are aware that for some jobs that need to load lots of classes,
>>> the default value might not be large enough. However, having a larger
>>> default value means for other jobs that do not load many classes, the
>>> overall memory requirements might be unnecessarily high. (Imagine you have
>>> a task manager with the default total memory 1.5GB, but 512m of it is
>>> reserved for metaspace.)
>>>
>>> Another possible problem is metaspace leak. When you say "eventually
>>> task nodes started shutting down with OutOfMemory Metaspace", does this
>>> problem happen shortly after the job execution starts, or does it happen
>>> after job running for a while? Does the metaspace footprint keep growing or
>>> become stable after the initial growth? If the metaspace keeps growing
>>> along with time, it's usually an indicator of metaspace memory leak.
>>>
>>> Thank you~
>>>
>>> Xintong Song
>>>
>>>
>>>
>>> On Tue, Feb 25, 2020 at 7:50 AM John Smith <ja...@gmail.com>
>>> wrote:
>>>
>>>> Hi, I just upgraded to 1.10 and I started deploying my jobs. Eventually
>>>> task nodes started shutting down with OutOfMemory Metaspace.
>>>>
>>>> I look at the logs and the task managers are started with:
>>>> -XX:MaxMetaspaceSize=100663296
>>>>
>>>> So I configed: taskmanager.memory.jvm-metaspace.size: 256m
>>>>
>>>> It seems to be ok for now. What are your thoughts? And should I try
>>>> 512m or is that too much?
>>>>
>>>

Re: MaxMetaspace default may be to low?

Posted by John Smith <ja...@gmail.com>.
I would like to also add the same exact jobs on Flink 1.8 where running
perfectly fine.

On Tue, 25 Feb 2020 at 00:20, John Smith <ja...@gmail.com> wrote:

> Right after Job execution. Basically as soon as I deployed a 5th job. So
> at 4 jobs it was ok, at 5 jobs it would take like 1-2 minutes max and the
> node would just shut off.
> So far with MaxMetaSpace 256m it's been stable. My task nodes are 16GB and
> the memory config is done as follows...
> taskmanager.memory.flink.size: 12g
> taskmanager.memory.jvm-metaspace.size: 256m
>
> 100% of the jobs right now are ETL with checkpoints, NO state,
> Kafka -----> Json Transform ----> DB
> or
> Kafka ----> DB lookup (to small local cache)--------> Json Transform
> -----> Apache Ignite
>
> None of the jobs are related.
>
> On Mon, 24 Feb 2020 at 20:59, Xintong Song <to...@gmail.com> wrote:
>
>> Hi John,
>>
>> The default metaspace size is intend for working with a major proportion
>> of jobs. We are aware that for some jobs that need to load lots of classes,
>> the default value might not be large enough. However, having a larger
>> default value means for other jobs that do not load many classes, the
>> overall memory requirements might be unnecessarily high. (Imagine you have
>> a task manager with the default total memory 1.5GB, but 512m of it is
>> reserved for metaspace.)
>>
>> Another possible problem is metaspace leak. When you say "eventually task
>> nodes started shutting down with OutOfMemory Metaspace", does this problem
>> happen shortly after the job execution starts, or does it happen after job
>> running for a while? Does the metaspace footprint keep growing or become
>> stable after the initial growth? If the metaspace keeps growing along with
>> time, it's usually an indicator of metaspace memory leak.
>>
>> Thank you~
>>
>> Xintong Song
>>
>>
>>
>> On Tue, Feb 25, 2020 at 7:50 AM John Smith <ja...@gmail.com>
>> wrote:
>>
>>> Hi, I just upgraded to 1.10 and I started deploying my jobs. Eventually
>>> task nodes started shutting down with OutOfMemory Metaspace.
>>>
>>> I look at the logs and the task managers are started with:
>>> -XX:MaxMetaspaceSize=100663296
>>>
>>> So I configed: taskmanager.memory.jvm-metaspace.size: 256m
>>>
>>> It seems to be ok for now. What are your thoughts? And should I try 512m
>>> or is that too much?
>>>
>>

Re: MaxMetaspace default may be to low?

Posted by John Smith <ja...@gmail.com>.
Right after Job execution. Basically as soon as I deployed a 5th job. So at
4 jobs it was ok, at 5 jobs it would take like 1-2 minutes max and the node
would just shut off.
So far with MaxMetaSpace 256m it's been stable. My task nodes are 16GB and
the memory config is done as follows...
taskmanager.memory.flink.size: 12g
taskmanager.memory.jvm-metaspace.size: 256m

100% of the jobs right now are ETL with checkpoints, NO state,
Kafka -----> Json Transform ----> DB
or
Kafka ----> DB lookup (to small local cache)--------> Json Transform ----->
Apache Ignite

None of the jobs are related.

On Mon, 24 Feb 2020 at 20:59, Xintong Song <to...@gmail.com> wrote:

> Hi John,
>
> The default metaspace size is intend for working with a major proportion
> of jobs. We are aware that for some jobs that need to load lots of classes,
> the default value might not be large enough. However, having a larger
> default value means for other jobs that do not load many classes, the
> overall memory requirements might be unnecessarily high. (Imagine you have
> a task manager with the default total memory 1.5GB, but 512m of it is
> reserved for metaspace.)
>
> Another possible problem is metaspace leak. When you say "eventually task
> nodes started shutting down with OutOfMemory Metaspace", does this problem
> happen shortly after the job execution starts, or does it happen after job
> running for a while? Does the metaspace footprint keep growing or become
> stable after the initial growth? If the metaspace keeps growing along with
> time, it's usually an indicator of metaspace memory leak.
>
> Thank you~
>
> Xintong Song
>
>
>
> On Tue, Feb 25, 2020 at 7:50 AM John Smith <ja...@gmail.com> wrote:
>
>> Hi, I just upgraded to 1.10 and I started deploying my jobs. Eventually
>> task nodes started shutting down with OutOfMemory Metaspace.
>>
>> I look at the logs and the task managers are started with:
>> -XX:MaxMetaspaceSize=100663296
>>
>> So I configed: taskmanager.memory.jvm-metaspace.size: 256m
>>
>> It seems to be ok for now. What are your thoughts? And should I try 512m
>> or is that too much?
>>
>

Re: MaxMetaspace default may be to low?

Posted by Xintong Song <to...@gmail.com>.
Hi John,

The default metaspace size is intend for working with a major proportion of
jobs. We are aware that for some jobs that need to load lots of classes,
the default value might not be large enough. However, having a larger
default value means for other jobs that do not load many classes, the
overall memory requirements might be unnecessarily high. (Imagine you have
a task manager with the default total memory 1.5GB, but 512m of it is
reserved for metaspace.)

Another possible problem is metaspace leak. When you say "eventually task
nodes started shutting down with OutOfMemory Metaspace", does this problem
happen shortly after the job execution starts, or does it happen after job
running for a while? Does the metaspace footprint keep growing or become
stable after the initial growth? If the metaspace keeps growing along with
time, it's usually an indicator of metaspace memory leak.

Thank you~

Xintong Song



On Tue, Feb 25, 2020 at 7:50 AM John Smith <ja...@gmail.com> wrote:

> Hi, I just upgraded to 1.10 and I started deploying my jobs. Eventually
> task nodes started shutting down with OutOfMemory Metaspace.
>
> I look at the logs and the task managers are started with:
> -XX:MaxMetaspaceSize=100663296
>
> So I configed: taskmanager.memory.jvm-metaspace.size: 256m
>
> It seems to be ok for now. What are your thoughts? And should I try 512m
> or is that too much?
>