You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Alexandre Montecucco <al...@grabtaxi.com> on 2022/02/25 12:14:36 UTC

Pods are OOMKilled with RocksDB backend after a few checkpoints

Hi all,

I am trying to reduce the memory usage of a Flink app.
There is about 25+Gb of state when persisted to checkpoint/savepoint. And a
fair amount of short lived objects as incoming traffic is fairly high.
So far, I have 8TM with 20GB each using Flink 1.12. I would like to reduce
the amount of memory I give, as the state will continue growing. I start my
application from an existing savepoint.

Given that CPU is not really an issue, I  switched to RocksDB backend, so
that state is serialized and supposedly much more compact in memory.
I am setting taskmanager.memory.process.size=20000m and
taskmanager.memory.managed.size=6000m
(and tried other values ranging from 3000m to 10000m).

The issue I observed is that the task manager pod memory is increasing
during each checkpoint and the 4th checkpoint fails because most of the
pods are OOMKilled. There is no java exception in the logs, so I really
suspect it is simply RocksDB using more memory than allocated.
I explicitly set state.backend.rocksdb.memory.managed=true to be sure.
I tried intervals of 2 minutes and 5 minutes for the checkpoint, and it
always seems to fail during the 4th checkpoint.

I tried incremental checkpoints and after 30 checkpoints no sign of failure
so far.

I tried with a few GB of overhead memory but that only delays the issue a
bit longer.
From the heap usage graph, in all cases, it looks as expected. The heap
goes back to a few hundred MB after GC, as the only long lived state is
off-heap. Xmx heap is about 12GB but peak usage is at most 6Gb.


Am I misconfiguring anything that could explain the OOMKilled pods?

Also, what is the best single metric to monitor rocksdb memory usage?  (I
tried estimate-live-data-size and size-all-mem-tables but I am not fully
sure yet about their exact meaning).

Best,
Alex

-- 


By communicating with Grab Inc and/or its subsidiaries, associate 
companies and jointly controlled entities (“Grab Group”), you are deemed to 
have consented to the processing of your personal data as set out in the 
Privacy Notice which can be viewed at https://grab.com/privacy/ 
<https://grab.com/privacy/>


This email contains confidential information 
and is only for the intended recipient(s). If you are not the intended 
recipient(s), please do not disseminate, distribute or copy this email 
Please notify Grab Group immediately if you have received this by mistake 
and delete this email from your system. Email transmission cannot be 
guaranteed to be secure or error-free as any information therein could be 
intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain 
viruses. Grab Group do not accept liability for any errors or omissions in 
the contents of this email arises as a result of email transmission. All 
intellectual property rights in this email and attachments therein shall 
remain vested in Grab Group, unless otherwise provided by law.

Re: Pods are OOMKilled with RocksDB backend after a few checkpoints

Posted by Alexandre Montecucco <al...@grabtaxi.com>.

Hello Yun and David,
Thank you for your answers.
Yes I am using the official image and I configured it with extra overhead
compared to Flink's default values.
Also, the problem does not only appear with RocksDB and memory continuously
goes down, so no matter how much overhead I give, it will end up crashing.

I will Flink 1.14 (was waiting for 1.15 initially) and will try using
jemalloc + jeprofile to understand more and will report here my findings.

Thank you.
Best,
Alex

On Tue, Mar 1, 2022 at 10:04 PM David Morávek <dm...@apache.org> wrote:

> As far as I remember there were still some bits we were not able to
> control regarding memory usage of RocksDB in 1.12. I think it has been
> fixed with 1.14 (there was a fairly long discussion, because it required
> bumping RDB version that has introduced a small performance regression).
>
> Can you try switching to 1.14 by any chance?
>
> Best.
> D.
>
> On Tue, Mar 1, 2022 at 2:43 PM Yun Tang <my...@live.com> wrote:
>
>> Hi Alexandre,
>>
>> Did you use the Flink's official image? One possible reason is that you
>> did not use official image and the image you use did not adopt jemalloc or
>> tcmalloc library to allocate memory in OS.
>> Flink decides to switch to jemalloc as default memory allocator from
>> Flink-1.12.0 [1].
>>
>> If you used the official image, maybe you can use jemalloc + jeprofile to
>> debug the off-heap memory allocation to see what has been leaked.
>> You can find many documentations to talk about this, such as [2].
>>
>> Last but not least, please ensure to leave enough overhead memory to
>> observe the unexpected additional memory usage.
>>
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-19125
>> [2]
>> https://technology.blog.gov.uk/2015/12/11/using-jemalloc-to-get-to-the-bottom-of-a-memory-leak/
>>
>> Best
>> Yun Tang
>>
>>
>> ------------------------------
>> *From:* Alexandre Montecucco <al...@grabtaxi.com>
>> *Sent:* Tuesday, March 1, 2022 16:33
>> *To:* Yun Tang <my...@live.com>
>> *Cc:* user <us...@flink.apache.org>
>> *Subject:* Re: Pods are OOMKilled with RocksDB backend after a few
>> checkpoints
>>
>> Hello Yun Tang,
>>
>> Over the weekend I did a few more tests:
>> - rocksdb backend
>> - rocksdb backend with incremental checkpoints
>> - rocksdb backend, incremental checkpoints and HEAP Timers
>> - heap backend
>> I also tried:
>> - increasing the overhead memory
>> - switch the http library
>> - upgrade Flink minor version to 1.12.7
>>
>> In all scenarios, I have observed the same pattern:
>> - heap memory is within good range
>> - no jvm exception
>> - total K8s pods memory keeps increasing  (happens very fast with rocksdb
>> after a few days with heap) and they all eventually get OOMKilled
>>
>> Because of these I am really confused as to what the root cause can be.
>> My current conclusions:
>> - the issue is in Off Heap memory, otherwise I would get a jvm exception
>> - my app state and timers should not be the problem, otherwise I would
>> have an OutOfMemoryException when using the Heap backend
>> - it feels like RocksDB is only a catalyst as it crashes with Heap as
>> well but slower.
>>
>> HEAP backend (not fully sure why the pod free memory is reduced and
>> recovers during each checkpoint).
>> Pod Free memory goes down over time.
>> [image: image.png]
>>
>> RocksDB backend with incremental checkpoint. Heap usage is a lot lower as
>> expected.
>> Pod free memory decreases even faster.
>> [image: image.png]
>>
>> My app logic is rather simple (roughly kafka -> flatmap -> keyBy -> keyed
>> process function with state and timers -> async operator making http calls
>> -> kafka) and that's the first time I observe this behaviour in a Flink app.
>> I am clueless as to what to try next and where to look.
>>
>>
>> On Tue, Mar 1, 2022 at 3:30 PM Yun Tang <my...@live.com> wrote:
>>
>> Hi Alex,
>>
>> Since current default checkpoint policy in RocksDB state backend is still
>> full snapshot, which is actually creating savepoint format.
>> Current savepoint would scan the whole RocksDB with iterators to write
>> data out, and some intermediate data would be kept in memory.
>>
>> I think you could use incremental checkpoint for RocksDB state backend,
>> which is also what we want to make as default checkpoint policy in the
>> future within Flink.
>> For the overhead memory, you can configure
>> taskmanager.memory.jvm-overhead.min
>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-min>
>>  and taskmanager.memory.jvm-overhead.max
>> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-max>
>> [1] to limit the overhead memory. The task off-heap memory would not take
>> effect for RocksDB.
>>
>> If you want to watch the memory usage, since the managed memory is
>> enabled by default, all instances within one slot would use memory from
>> same block cache [2], you can
>> try state.backend.rocksdb.metrics.block-cache-usage [3].
>> Please keep in mind that all RocksDB instances within one slot would
>> report same block cache usage.
>>
>> [1]
>> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/mem_setup_tm/#detailed-memory-model
>> [2]
>> https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache
>> [3]
>> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#state-backend-rocksdb-metrics-block-cache-usage
>>
>>
>> Best
>> Yun Tang
>>
>>
>> ------------------------------
>> *From:* Alexandre Montecucco <al...@grabtaxi.com>
>> *Sent:* Friday, February 25, 2022 20:14
>> *To:* user <us...@flink.apache.org>
>> *Subject:* Pods are OOMKilled with RocksDB backend after a few
>> checkpoints
>>
>> Hi all,
>>
>> I am trying to reduce the memory usage of a Flink app.
>> There is about 25+Gb of state when persisted to checkpoint/savepoint. And
>> a fair amount of short lived objects as incoming traffic is fairly high.
>> So far, I have 8TM with 20GB each using Flink 1.12. I would like to
>> reduce the amount of memory I give, as the state will continue growing. I
>> start my application from an existing savepoint.
>>
>> Given that CPU is not really an issue, I  switched to RocksDB backend, so
>> that state is serialized and supposedly much more compact in memory.
>> I am setting taskmanager.memory.process.size=20000m and taskmanager.memory.managed.size=6000m
>> (and tried other values ranging from 3000m to 10000m).
>>
>> The issue I observed is that the task manager pod memory is increasing
>> during each checkpoint and the 4th checkpoint fails because most of the
>> pods are OOMKilled. There is no java exception in the logs, so I really
>> suspect it is simply RocksDB using more memory than allocated.
>> I explicitly set state.backend.rocksdb.memory.managed=true to be sure.
>> I tried intervals of 2 minutes and 5 minutes for the checkpoint, and it
>> always seems to fail during the 4th checkpoint.
>>
>> I tried incremental checkpoints and after 30 checkpoints no sign of
>> failure so far.
>>
>> I tried with a few GB of overhead memory but that only delays the issue a
>> bit longer.
>> From the heap usage graph, in all cases, it looks as expected. The heap
>> goes back to a few hundred MB after GC, as the only long lived state is
>> off-heap. Xmx heap is about 12GB but peak usage is at most 6Gb.
>>
>>
>> Am I misconfiguring anything that could explain the OOMKilled pods?
>>
>> Also, what is the best single metric to monitor rocksdb memory usage?  (I
>> tried estimate-live-data-size and size-all-mem-tables but I am not fully
>> sure yet about their exact meaning).
>>
>> Best,
>> Alex
>>
>>
>> By communicating with Grab Inc and/or its subsidiaries, associate
>> companies and jointly controlled entities (“Grab Group”), you are deemed to
>> have consented to the processing of your personal data as set out in the
>> Privacy Notice which can be viewed at https://grab.com/privacy/
>>
>> This email contains confidential information and is only for the intended
>> recipient(s). If you are not the intended recipient(s), please do not
>> disseminate, distribute or copy this email Please notify Grab Group
>> immediately if you have received this by mistake and delete this email from
>> your system. Email transmission cannot be guaranteed to be secure or
>> error-free as any information therein could be intercepted, corrupted,
>> lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do
>> not accept liability for any errors or omissions in the contents of this
>> email arises as a result of email transmission. All intellectual property
>> rights in this email and attachments therein shall remain vested in Grab
>> Group, unless otherwise provided by law.
>>
>>
>>
>> --
>>
>> [image: Grab] <https://htmlsig.com/t/000001BKA99J>
>>
>> [image: Twitter]  <https://htmlsig.com/t/000001BKDVDC> [image: Facebook]
>> <https://htmlsig.com/t/000001BF8J9Q> [image: LinkedIn]
>> <https://htmlsig.com/t/000001BKYJ3R> [image: Instagram]
>> <https://htmlsig.com/t/000001BH4CH1> [image: Youtube]
>> <https://htmlsig.com/t/0000001BMMNPF>
>>
>> Alexandre Montecucco / Grab, Software Developer
>> alexandre.montecucco@grab.com <cl...@grab.com> / 8782 0937
>>
>> Grab
>> 138 Cecil Street, Cecil Court #01-01Singapore 069538
>> https://www.grab.com/ <https://www.grab.com/sg/hitch>
>>
>>
>> By communicating with Grab Inc and/or its subsidiaries, associate
>> companies and jointly controlled entities (“Grab Group”), you are deemed to
>> have consented to the processing of your personal data as set out in the
>> Privacy Notice which can be viewed at https://grab.com/privacy/
>>
>> This email contains confidential information and is only for the intended
>> recipient(s). If you are not the intended recipient(s), please do not
>> disseminate, distribute or copy this email Please notify Grab Group
>> immediately if you have received this by mistake and delete this email from
>> your system. Email transmission cannot be guaranteed to be secure or
>> error-free as any information therein could be intercepted, corrupted,
>> lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do
>> not accept liability for any errors or omissions in the contents of this
>> email arises as a result of email transmission. All intellectual property
>> rights in this email and attachments therein shall remain vested in Grab
>> Group, unless otherwise provided by law.
>>
>

-- 


By communicating with Grab Inc and/or its subsidiaries, associate 
companies and jointly controlled entities (“Grab Group”), you are deemed to 
have consented to the processing of your personal data as set out in the 
Privacy Notice which can be viewed at https://grab.com/privacy/ 
<https://grab.com/privacy/>


This email contains confidential information 
and is only for the intended recipient(s). If you are not the intended 
recipient(s), please do not disseminate, distribute or copy this email 
Please notify Grab Group immediately if you have received this by mistake 
and delete this email from your system. Email transmission cannot be 
guaranteed to be secure or error-free as any information therein could be 
intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain 
viruses. Grab Group do not accept liability for any errors or omissions in 
the contents of this email arises as a result of email transmission. All 
intellectual property rights in this email and attachments therein shall 
remain vested in Grab Group, unless otherwise provided by law.

Re: Pods are OOMKilled with RocksDB backend after a few checkpoints

Posted by David Morávek <dm...@apache.org>.

As far as I remember there were still some bits we were not able to control
regarding memory usage of RocksDB in 1.12. I think it has been fixed with
1.14 (there was a fairly long discussion, because it required bumping RDB
version that has introduced a small performance regression).

Can you try switching to 1.14 by any chance?

Best.
D.

On Tue, Mar 1, 2022 at 2:43 PM Yun Tang <my...@live.com> wrote:

> Hi Alexandre,
>
> Did you use the Flink's official image? One possible reason is that you
> did not use official image and the image you use did not adopt jemalloc or
> tcmalloc library to allocate memory in OS.
> Flink decides to switch to jemalloc as default memory allocator from
> Flink-1.12.0 [1].
>
> If you used the official image, maybe you can use jemalloc + jeprofile to
> debug the off-heap memory allocation to see what has been leaked.
> You can find many documentations to talk about this, such as [2].
>
> Last but not least, please ensure to leave enough overhead memory to
> observe the unexpected additional memory usage.
>
>
> [1] https://issues.apache.org/jira/browse/FLINK-19125
> [2]
> https://technology.blog.gov.uk/2015/12/11/using-jemalloc-to-get-to-the-bottom-of-a-memory-leak/
>
> Best
> Yun Tang
>
>
> ------------------------------
> *From:* Alexandre Montecucco <al...@grabtaxi.com>
> *Sent:* Tuesday, March 1, 2022 16:33
> *To:* Yun Tang <my...@live.com>
> *Cc:* user <us...@flink.apache.org>
> *Subject:* Re: Pods are OOMKilled with RocksDB backend after a few
> checkpoints
>
> Hello Yun Tang,
>
> Over the weekend I did a few more tests:
> - rocksdb backend
> - rocksdb backend with incremental checkpoints
> - rocksdb backend, incremental checkpoints and HEAP Timers
> - heap backend
> I also tried:
> - increasing the overhead memory
> - switch the http library
> - upgrade Flink minor version to 1.12.7
>
> In all scenarios, I have observed the same pattern:
> - heap memory is within good range
> - no jvm exception
> - total K8s pods memory keeps increasing  (happens very fast with rocksdb
> after a few days with heap) and they all eventually get OOMKilled
>
> Because of these I am really confused as to what the root cause can be. My
> current conclusions:
> - the issue is in Off Heap memory, otherwise I would get a jvm exception
> - my app state and timers should not be the problem, otherwise I would
> have an OutOfMemoryException when using the Heap backend
> - it feels like RocksDB is only a catalyst as it crashes with Heap as well
> but slower.
>
> HEAP backend (not fully sure why the pod free memory is reduced and
> recovers during each checkpoint).
> Pod Free memory goes down over time.
> [image: image.png]
>
> RocksDB backend with incremental checkpoint. Heap usage is a lot lower as
> expected.
> Pod free memory decreases even faster.
> [image: image.png]
>
> My app logic is rather simple (roughly kafka -> flatmap -> keyBy -> keyed
> process function with state and timers -> async operator making http calls
> -> kafka) and that's the first time I observe this behaviour in a Flink app.
> I am clueless as to what to try next and where to look.
>
>
> On Tue, Mar 1, 2022 at 3:30 PM Yun Tang <my...@live.com> wrote:
>
> Hi Alex,
>
> Since current default checkpoint policy in RocksDB state backend is still
> full snapshot, which is actually creating savepoint format.
> Current savepoint would scan the whole RocksDB with iterators to write
> data out, and some intermediate data would be kept in memory.
>
> I think you could use incremental checkpoint for RocksDB state backend,
> which is also what we want to make as default checkpoint policy in the
> future within Flink.
> For the overhead memory, you can configure
> taskmanager.memory.jvm-overhead.min
> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-min>
>  and taskmanager.memory.jvm-overhead.max
> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-max>
> [1] to limit the overhead memory. The task off-heap memory would not take
> effect for RocksDB.
>
> If you want to watch the memory usage, since the managed memory is enabled
> by default, all instances within one slot would use memory from same block
> cache [2], you can try state.backend.rocksdb.metrics.block-cache-usage [3].
> Please keep in mind that all RocksDB instances within one slot would
> report same block cache usage.
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/mem_setup_tm/#detailed-memory-model
> [2]
> https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache
> [3]
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#state-backend-rocksdb-metrics-block-cache-usage
>
>
> Best
> Yun Tang
>
>
> ------------------------------
> *From:* Alexandre Montecucco <al...@grabtaxi.com>
> *Sent:* Friday, February 25, 2022 20:14
> *To:* user <us...@flink.apache.org>
> *Subject:* Pods are OOMKilled with RocksDB backend after a few checkpoints
>
> Hi all,
>
> I am trying to reduce the memory usage of a Flink app.
> There is about 25+Gb of state when persisted to checkpoint/savepoint. And
> a fair amount of short lived objects as incoming traffic is fairly high.
> So far, I have 8TM with 20GB each using Flink 1.12. I would like to reduce
> the amount of memory I give, as the state will continue growing. I start
> my application from an existing savepoint.
>
> Given that CPU is not really an issue, I  switched to RocksDB backend, so
> that state is serialized and supposedly much more compact in memory.
> I am setting taskmanager.memory.process.size=20000m and taskmanager.memory.managed.size=6000m
> (and tried other values ranging from 3000m to 10000m).
>
> The issue I observed is that the task manager pod memory is increasing
> during each checkpoint and the 4th checkpoint fails because most of the
> pods are OOMKilled. There is no java exception in the logs, so I really
> suspect it is simply RocksDB using more memory than allocated.
> I explicitly set state.backend.rocksdb.memory.managed=true to be sure.
> I tried intervals of 2 minutes and 5 minutes for the checkpoint, and it
> always seems to fail during the 4th checkpoint.
>
> I tried incremental checkpoints and after 30 checkpoints no sign of
> failure so far.
>
> I tried with a few GB of overhead memory but that only delays the issue a
> bit longer.
> From the heap usage graph, in all cases, it looks as expected. The heap
> goes back to a few hundred MB after GC, as the only long lived state is
> off-heap. Xmx heap is about 12GB but peak usage is at most 6Gb.
>
>
> Am I misconfiguring anything that could explain the OOMKilled pods?
>
> Also, what is the best single metric to monitor rocksdb memory usage?  (I
> tried estimate-live-data-size and size-all-mem-tables but I am not fully
> sure yet about their exact meaning).
>
> Best,
> Alex
>
>
> By communicating with Grab Inc and/or its subsidiaries, associate
> companies and jointly controlled entities (“Grab Group”), you are deemed to
> have consented to the processing of your personal data as set out in the
> Privacy Notice which can be viewed at https://grab.com/privacy/
>
> This email contains confidential information and is only for the intended
> recipient(s). If you are not the intended recipient(s), please do not
> disseminate, distribute or copy this email Please notify Grab Group
> immediately if you have received this by mistake and delete this email from
> your system. Email transmission cannot be guaranteed to be secure or
> error-free as any information therein could be intercepted, corrupted,
> lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do
> not accept liability for any errors or omissions in the contents of this
> email arises as a result of email transmission. All intellectual property
> rights in this email and attachments therein shall remain vested in Grab
> Group, unless otherwise provided by law.
>
>
>
> --
>
> [image: Grab] <https://htmlsig.com/t/000001BKA99J>
>
> [image: Twitter]  <https://htmlsig.com/t/000001BKDVDC> [image: Facebook]
> <https://htmlsig.com/t/000001BF8J9Q> [image: LinkedIn]
> <https://htmlsig.com/t/000001BKYJ3R> [image: Instagram]
> <https://htmlsig.com/t/000001BH4CH1> [image: Youtube]
> <https://htmlsig.com/t/0000001BMMNPF>
>
> Alexandre Montecucco / Grab, Software Developer
> alexandre.montecucco@grab.com <cl...@grab.com> / 8782 0937
>
> Grab
> 138 Cecil Street, Cecil Court #01-01Singapore 069538
> https://www.grab.com/ <https://www.grab.com/sg/hitch>
>
>
> By communicating with Grab Inc and/or its subsidiaries, associate
> companies and jointly controlled entities (“Grab Group”), you are deemed to
> have consented to the processing of your personal data as set out in the
> Privacy Notice which can be viewed at https://grab.com/privacy/
>
> This email contains confidential information and is only for the intended
> recipient(s). If you are not the intended recipient(s), please do not
> disseminate, distribute or copy this email Please notify Grab Group
> immediately if you have received this by mistake and delete this email from
> your system. Email transmission cannot be guaranteed to be secure or
> error-free as any information therein could be intercepted, corrupted,
> lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do
> not accept liability for any errors or omissions in the contents of this
> email arises as a result of email transmission. All intellectual property
> rights in this email and attachments therein shall remain vested in Grab
> Group, unless otherwise provided by law.
>

Re: Pods are OOMKilled with RocksDB backend after a few checkpoints

Posted by Yun Tang <my...@live.com>.

Hi Alexandre,

Did you use the Flink's official image? One possible reason is that you did not use official image and the image you use did not adopt jemalloc or tcmalloc library to allocate memory in OS.
Flink decides to switch to jemalloc as default memory allocator from Flink-1.12.0 [1].

If you used the official image, maybe you can use jemalloc + jeprofile to debug the off-heap memory allocation to see what has been leaked.
You can find many documentations to talk about this, such as [2].

Last but not least, please ensure to leave enough overhead memory to observe the unexpected additional memory usage.


[1] https://issues.apache.org/jira/browse/FLINK-19125
[2] https://technology.blog.gov.uk/2015/12/11/using-jemalloc-to-get-to-the-bottom-of-a-memory-leak/

Best
Yun Tang


________________________________
From: Alexandre Montecucco <al...@grabtaxi.com>
Sent: Tuesday, March 1, 2022 16:33
To: Yun Tang <my...@live.com>
Cc: user <us...@flink.apache.org>
Subject: Re: Pods are OOMKilled with RocksDB backend after a few checkpoints

Hello Yun Tang,

Over the weekend I did a few more tests:
- rocksdb backend
- rocksdb backend with incremental checkpoints
- rocksdb backend, incremental checkpoints and HEAP Timers
- heap backend
I also tried:
- increasing the overhead memory
- switch the http library
- upgrade Flink minor version to 1.12.7

In all scenarios, I have observed the same pattern:
- heap memory is within good range
- no jvm exception
- total K8s pods memory keeps increasing  (happens very fast with rocksdb after a few days with heap) and they all eventually get OOMKilled

Because of these I am really confused as to what the root cause can be. My current conclusions:
- the issue is in Off Heap memory, otherwise I would get a jvm exception
- my app state and timers should not be the problem, otherwise I would have an OutOfMemoryException when using the Heap backend
- it feels like RocksDB is only a catalyst as it crashes with Heap as well but slower.

HEAP backend (not fully sure why the pod free memory is reduced and recovers during each checkpoint).
Pod Free memory goes down over time.
[image.png]

RocksDB backend with incremental checkpoint. Heap usage is a lot lower as expected.
Pod free memory decreases even faster.
[image.png]

My app logic is rather simple (roughly kafka -> flatmap -> keyBy -> keyed process function with state and timers -> async operator making http calls -> kafka) and that's the first time I observe this behaviour in a Flink app.
I am clueless as to what to try next and where to look.


On Tue, Mar 1, 2022 at 3:30 PM Yun Tang <my...@live.com>> wrote:
Hi Alex,

Since current default checkpoint policy in RocksDB state backend is still full snapshot, which is actually creating savepoint format.
Current savepoint would scan the whole RocksDB with iterators to write data out, and some intermediate data would be kept in memory.

I think you could use incremental checkpoint for RocksDB state backend, which is also what we want to make as default checkpoint policy in the future within Flink.
For the overhead memory, you can configure taskmanager.memory.jvm-overhead.min<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-min> and taskmanager.memory.jvm-overhead.max<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-max> [1] to limit the overhead memory. The task off-heap memory would not take effect for RocksDB.

If you want to watch the memory usage, since the managed memory is enabled by default, all instances within one slot would use memory from same block cache [2], you can try state.backend.rocksdb.metrics.block-cache-usage [3].
Please keep in mind that all RocksDB instances within one slot would report same block cache usage.

[1] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/mem_setup_tm/#detailed-memory-model
[2] https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache
[3] https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#state-backend-rocksdb-metrics-block-cache-usage


Best
Yun Tang


________________________________
From: Alexandre Montecucco <al...@grabtaxi.com>>
Sent: Friday, February 25, 2022 20:14
To: user <us...@flink.apache.org>>
Subject: Pods are OOMKilled with RocksDB backend after a few checkpoints

Hi all,

I am trying to reduce the memory usage of a Flink app.
There is about 25+Gb of state when persisted to checkpoint/savepoint. And a fair amount of short lived objects as incoming traffic is fairly high.
So far, I have 8TM with 20GB each using Flink 1.12. I would like to reduce the amount of memory I give, as the state will continue growing. I start my application from an existing savepoint.

Given that CPU is not really an issue, I  switched to RocksDB backend, so that state is serialized and supposedly much more compact in memory.
I am setting taskmanager.memory.process.size=20000m and taskmanager.memory.managed.size=6000m (and tried other values ranging from 3000m to 10000m).

The issue I observed is that the task manager pod memory is increasing during each checkpoint and the 4th checkpoint fails because most of the pods are OOMKilled. There is no java exception in the logs, so I really suspect it is simply RocksDB using more memory than allocated.
I explicitly set state.backend.rocksdb.memory.managed=true to be sure.
I tried intervals of 2 minutes and 5 minutes for the checkpoint, and it always seems to fail during the 4th checkpoint.

I tried incremental checkpoints and after 30 checkpoints no sign of failure so far.

I tried with a few GB of overhead memory but that only delays the issue a bit longer.
From the heap usage graph, in all cases, it looks as expected. The heap goes back to a few hundred MB after GC, as the only long lived state is off-heap. Xmx heap is about 12GB but peak usage is at most 6Gb.


Am I misconfiguring anything that could explain the OOMKilled pods?

Also, what is the best single metric to monitor rocksdb memory usage?  (I tried estimate-live-data-size and size-all-mem-tables but I am not fully sure yet about their exact meaning).

Best,
Alex


By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/

This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.


--

[Grab]<https://htmlsig.com/t/000001BKA99J>

[Twitter] <https://htmlsig.com/t/000001BKDVDC>  [Facebook]  <https://htmlsig.com/t/000001BF8J9Q>  [LinkedIn]  <https://htmlsig.com/t/000001BKYJ3R>  [Instagram]  <https://htmlsig.com/t/000001BH4CH1>  [Youtube]  <https://htmlsig.com/t/0000001BMMNPF>

[https://s3.amazonaws.com/htmlsig-assets/spacer.gif]

Alexandre Montecucco / Grab, Software Developer
alexandre.montecucco@grab.com<ma...@grab.com> / 8782 0937

Grab
138 Cecil Street, Cecil Court #01-01Singapore 069538
https://www.grab.com/<https://www.grab.com/sg/hitch>


By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/

This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.

Re: Pods are OOMKilled with RocksDB backend after a few checkpoints

Posted by Alexandre Montecucco <al...@grabtaxi.com>.

Hello Yun Tang,

Over the weekend I did a few more tests:
- rocksdb backend
- rocksdb backend with incremental checkpoints
- rocksdb backend, incremental checkpoints and HEAP Timers
- heap backend
I also tried:
- increasing the overhead memory
- switch the http library
- upgrade Flink minor version to 1.12.7

In all scenarios, I have observed the same pattern:
- heap memory is within good range
- no jvm exception
- total K8s pods memory keeps increasing  (happens very fast with rocksdb
after a few days with heap) and they all eventually get OOMKilled

Because of these I am really confused as to what the root cause can be. My
current conclusions:
- the issue is in Off Heap memory, otherwise I would get a jvm exception
- my app state and timers should not be the problem, otherwise I would have
an OutOfMemoryException when using the Heap backend
- it feels like RocksDB is only a catalyst as it crashes with Heap as well
but slower.

HEAP backend (not fully sure why the pod free memory is reduced and
recovers during each checkpoint).
Pod Free memory goes down over time.
[image: image.png]

RocksDB backend with incremental checkpoint. Heap usage is a lot lower as
expected.
Pod free memory decreases even faster.
[image: image.png]

My app logic is rather simple (roughly kafka -> flatmap -> keyBy -> keyed
process function with state and timers -> async operator making http calls
-> kafka) and that's the first time I observe this behaviour in a Flink app.
I am clueless as to what to try next and where to look.


On Tue, Mar 1, 2022 at 3:30 PM Yun Tang <my...@live.com> wrote:

> Hi Alex,
>
> Since current default checkpoint policy in RocksDB state backend is still
> full snapshot, which is actually creating savepoint format.
> Current savepoint would scan the whole RocksDB with iterators to write
> data out, and some intermediate data would be kept in memory.
>
> I think you could use incremental checkpoint for RocksDB state backend,
> which is also what we want to make as default checkpoint policy in the
> future within Flink.
> For the overhead memory, you can configure
> taskmanager.memory.jvm-overhead.min
> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-min>
>  and taskmanager.memory.jvm-overhead.max
> <https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-max>
> [1] to limit the overhead memory. The task off-heap memory would not take
> effect for RocksDB.
>
> If you want to watch the memory usage, since the managed memory is enabled
> by default, all instances within one slot would use memory from same block
> cache [2], you can try state.backend.rocksdb.metrics.block-cache-usage [3].
> Please keep in mind that all RocksDB instances within one slot would
> report same block cache usage.
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/mem_setup_tm/#detailed-memory-model
> [2]
> https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache
> [3]
> https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#state-backend-rocksdb-metrics-block-cache-usage
>
>
> Best
> Yun Tang
>
>
> ------------------------------
> *From:* Alexandre Montecucco <al...@grabtaxi.com>
> *Sent:* Friday, February 25, 2022 20:14
> *To:* user <us...@flink.apache.org>
> *Subject:* Pods are OOMKilled with RocksDB backend after a few checkpoints
>
> Hi all,
>
> I am trying to reduce the memory usage of a Flink app.
> There is about 25+Gb of state when persisted to checkpoint/savepoint. And
> a fair amount of short lived objects as incoming traffic is fairly high.
> So far, I have 8TM with 20GB each using Flink 1.12. I would like to reduce
> the amount of memory I give, as the state will continue growing. I start
> my application from an existing savepoint.
>
> Given that CPU is not really an issue, I  switched to RocksDB backend, so
> that state is serialized and supposedly much more compact in memory.
> I am setting taskmanager.memory.process.size=20000m and taskmanager.memory.managed.size=6000m
> (and tried other values ranging from 3000m to 10000m).
>
> The issue I observed is that the task manager pod memory is increasing
> during each checkpoint and the 4th checkpoint fails because most of the
> pods are OOMKilled. There is no java exception in the logs, so I really
> suspect it is simply RocksDB using more memory than allocated.
> I explicitly set state.backend.rocksdb.memory.managed=true to be sure.
> I tried intervals of 2 minutes and 5 minutes for the checkpoint, and it
> always seems to fail during the 4th checkpoint.
>
> I tried incremental checkpoints and after 30 checkpoints no sign of
> failure so far.
>
> I tried with a few GB of overhead memory but that only delays the issue a
> bit longer.
> From the heap usage graph, in all cases, it looks as expected. The heap
> goes back to a few hundred MB after GC, as the only long lived state is
> off-heap. Xmx heap is about 12GB but peak usage is at most 6Gb.
>
>
> Am I misconfiguring anything that could explain the OOMKilled pods?
>
> Also, what is the best single metric to monitor rocksdb memory usage?  (I
> tried estimate-live-data-size and size-all-mem-tables but I am not fully
> sure yet about their exact meaning).
>
> Best,
> Alex
>
>
> By communicating with Grab Inc and/or its subsidiaries, associate
> companies and jointly controlled entities (“Grab Group”), you are deemed to
> have consented to the processing of your personal data as set out in the
> Privacy Notice which can be viewed at https://grab.com/privacy/
>
> This email contains confidential information and is only for the intended
> recipient(s). If you are not the intended recipient(s), please do not
> disseminate, distribute or copy this email Please notify Grab Group
> immediately if you have received this by mistake and delete this email from
> your system. Email transmission cannot be guaranteed to be secure or
> error-free as any information therein could be intercepted, corrupted,
> lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do
> not accept liability for any errors or omissions in the contents of this
> email arises as a result of email transmission. All intellectual property
> rights in this email and attachments therein shall remain vested in Grab
> Group, unless otherwise provided by law.
>


-- 

[image: Grab] <https://htmlsig.com/t/000001BKA99J>

[image: Twitter]  <https://htmlsig.com/t/000001BKDVDC> [image: Facebook]
<https://htmlsig.com/t/000001BF8J9Q> [image: LinkedIn]
<https://htmlsig.com/t/000001BKYJ3R> [image: Instagram]
<https://htmlsig.com/t/000001BH4CH1> [image: Youtube]
<https://htmlsig.com/t/0000001BMMNPF>

Alexandre Montecucco / Grab, Software Developer
alexandre.montecucco@grab.com <cl...@grab.com> / 8782 0937

Grab
138 Cecil Street, Cecil Court #01-01Singapore 069538
https://www.grab.com/ <https://www.grab.com/sg/hitch>

-- 


By communicating with Grab Inc and/or its subsidiaries, associate 
companies and jointly controlled entities (“Grab Group”), you are deemed to 
have consented to the processing of your personal data as set out in the 
Privacy Notice which can be viewed at https://grab.com/privacy/ 
<https://grab.com/privacy/>


This email contains confidential information 
and is only for the intended recipient(s). If you are not the intended 
recipient(s), please do not disseminate, distribute or copy this email 
Please notify Grab Group immediately if you have received this by mistake 
and delete this email from your system. Email transmission cannot be 
guaranteed to be secure or error-free as any information therein could be 
intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain 
viruses. Grab Group do not accept liability for any errors or omissions in 
the contents of this email arises as a result of email transmission. All 
intellectual property rights in this email and attachments therein shall 
remain vested in Grab Group, unless otherwise provided by law.

Re: Pods are OOMKilled with RocksDB backend after a few checkpoints

Posted by Yun Tang <my...@live.com>.

Hi Alex,

Since current default checkpoint policy in RocksDB state backend is still full snapshot, which is actually creating savepoint format.
Current savepoint would scan the whole RocksDB with iterators to write data out, and some intermediate data would be kept in memory.

I think you could use incremental checkpoint for RocksDB state backend, which is also what we want to make as default checkpoint policy in the future within Flink.
For the overhead memory, you can configure taskmanager.memory.jvm-overhead.min<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-min> and taskmanager.memory.jvm-overhead.max<https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-max> [1] to limit the overhead memory. The task off-heap memory would not take effect for RocksDB.

If you want to watch the memory usage, since the managed memory is enabled by default, all instances within one slot would use memory from same block cache [2], you can try state.backend.rocksdb.metrics.block-cache-usage [3].
Please keep in mind that all RocksDB instances within one slot would report same block cache usage.

[1] https://nightlies.apache.org/flink/flink-docs-master/docs/deployment/memory/mem_setup_tm/#detailed-memory-model
[2] https://github.com/facebook/rocksdb/wiki/Write-Buffer-Manager#cost-memory-used-in-memtable-to-block-cache
[3] https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#state-backend-rocksdb-metrics-block-cache-usage


Best
Yun Tang


________________________________
From: Alexandre Montecucco <al...@grabtaxi.com>
Sent: Friday, February 25, 2022 20:14
To: user <us...@flink.apache.org>
Subject: Pods are OOMKilled with RocksDB backend after a few checkpoints

Hi all,

I am trying to reduce the memory usage of a Flink app.
There is about 25+Gb of state when persisted to checkpoint/savepoint. And a fair amount of short lived objects as incoming traffic is fairly high.
So far, I have 8TM with 20GB each using Flink 1.12. I would like to reduce the amount of memory I give, as the state will continue growing. I start my application from an existing savepoint.

Given that CPU is not really an issue, I  switched to RocksDB backend, so that state is serialized and supposedly much more compact in memory.
I am setting taskmanager.memory.process.size=20000m and taskmanager.memory.managed.size=6000m (and tried other values ranging from 3000m to 10000m).

The issue I observed is that the task manager pod memory is increasing during each checkpoint and the 4th checkpoint fails because most of the pods are OOMKilled. There is no java exception in the logs, so I really suspect it is simply RocksDB using more memory than allocated.
I explicitly set state.backend.rocksdb.memory.managed=true to be sure.
I tried intervals of 2 minutes and 5 minutes for the checkpoint, and it always seems to fail during the 4th checkpoint.

I tried incremental checkpoints and after 30 checkpoints no sign of failure so far.

I tried with a few GB of overhead memory but that only delays the issue a bit longer.
From the heap usage graph, in all cases, it looks as expected. The heap goes back to a few hundred MB after GC, as the only long lived state is off-heap. Xmx heap is about 12GB but peak usage is at most 6Gb.


Am I misconfiguring anything that could explain the OOMKilled pods?

Also, what is the best single metric to monitor rocksdb memory usage?  (I tried estimate-live-data-size and size-all-mem-tables but I am not fully sure yet about their exact meaning).

Best,
Alex


By communicating with Grab Inc and/or its subsidiaries, associate companies and jointly controlled entities (“Grab Group”), you are deemed to have consented to the processing of your personal data as set out in the Privacy Notice which can be viewed at https://grab.com/privacy/

This email contains confidential information and is only for the intended recipient(s). If you are not the intended recipient(s), please do not disseminate, distribute or copy this email Please notify Grab Group immediately if you have received this by mistake and delete this email from your system. Email transmission cannot be guaranteed to be secure or error-free as any information therein could be intercepted, corrupted, lost, destroyed, delayed or incomplete, or contain viruses. Grab Group do not accept liability for any errors or omissions in the contents of this email arises as a result of email transmission. All intellectual property rights in this email and attachments therein shall remain vested in Grab Group, unless otherwise provided by law.