You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@flink.apache.org by Kevin Lam <ke...@shopify.com> on 2021/09/30 15:37:30 UTC

RocksDB: Spike in Memory Usage Post Restart

Hi all,

We're debugging an issue with OOMs that occurs on our jobs shortly after a
restore from checkpoint. Our application is running on kubernetes and uses
RocksDB as it's state backend.

We reproduced the issue on a small cluster of 2 task managers. If we killed
a single task manager, we noticed that after restoring from checkpoint, the
untouched task manager has an elevated memory footprint (see the blue line
for the surviving task manager):

[image: image.png]
If we kill the newest TM (yellow line) again, after restoring the surviving
task manager gets OOM killed.

We looked at the OOMKiller Report and it seems that the memory is not
coming from the JVM but we're unsure of the source. It seems like something
is allocating native memory that the JVM is not aware of.

We're suspicious of RocksDB. Has anyone seen this kind of issue before? Is
it possible there's some kind of memory pressure or memory leak coming from
RocksDB that only presents itself when a job is restarted? Perhaps
something isn't cleaned up?

Any help would be appreciated.

Re: RocksDB: Spike in Memory Usage Post Restart

Posted by Yun Tang <my...@live.com>.

Hi Yaroslav,

I don't think disable block cache is the correct way to manage the memory usage of RocksDB. Moreover, it cannot actually limit the memory usage indeed.

RocksDB actually have two parts of memory usage per column family: the part for writing, which is used by write buffers and the part for reading, which is used by reading index/filter/data blocks.
Apart from the memory model of RocksDB itself, Flink actually does not limit the number of states within one operator (which means does not limit the number of column families in one RocksDB instance) and also enable one slot to contain different operators, which means does not limit the number of RocksDB instances in one slot. Due to these reasons, disabling managed memory and block cache is actually not the correct way. Maybe you did not meet the OOM problem in some jobs but cannot ensure all jobs could behave well.

I think we should still focus on finding who actually takes too much additional memory. First of all, you should give some additional space for taskmanager.memory.jvm-overhead.max and taskmanager.memory.jvm-overhead.min. And use re-built jemalloc [1] to pass to container's running environment. This debug phase is common for native development and Apache Doris also have guide for this but using tcmalloc [3]. We could analysis the prof call stack to figure out what native memory is used.

[1] https://technology.blog.gov.uk/2015/12/11/using-jemalloc-to-get-to-the-bottom-of-a-memory-leak/
[2] https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#forwarding-environment-variables
[3] https://doris.apache.org/master/en/developer-guide/debug-tool.html#memory

Best
Yun Tang

________________________________
From: Yaroslav Tkachenko <ya...@shopify.com>
Sent: Monday, October 11, 2021 3:04
To: Yun Tang <my...@live.com>
Cc: Ammon Diether <ad...@gmail.com>; Kevin Lam <ke...@shopify.com>; Fabian Paul <fa...@ververica.com>; user <us...@flink.apache.org>
Subject: Re: RocksDB: Spike in Memory Usage Post Restart

A quick update on this, we were able to fix the memory leak by disabling block cache in RocksDB with:

state.backend.rocksdb.options-factory: xxx.NoBlockCacheRocksDbOptionsFactory
state.backend.rocksdb.memory.managed: false

Where NoBlockCacheRocksDbOptionsFactory essentially does:

val blockBasedTableConfig = new BlockBasedTableConfig()
blockBasedTableConfig.setNoBlockCache(true)
// Needed in order to disable block cache
blockBasedTableConfig.setCacheIndexAndFilterBlocks(false)
blockBasedTableConfig.setCacheIndexAndFilterBlocksWithHighPriority(false)
blockBasedTableConfig.setPinL0FilterAndIndexBlocksInCache(false)

We did NOT see big performance degradation when running on SSDs.

On Fri, Oct 8, 2021 at 3:41 AM Yun Tang <my...@live.com>> wrote:
Hi Kevin,

Sorry for late jumping in as we were in a vocation holiday.

Since you already refer the doc https://erikwramner.files.wordpress.com/2017/10/native-memory-leaks-in-java.pdf, have you ever figured out the native call via jemalloc and jeprof?

From my experience, there could be two general kinds of native memory leak (we do not consider the stack or static memory as they should not consume much in general cases, and the case of mmap memory usage):

  1.  Malloced memory does not free in time or forget to free. This could be figured out via tool jemalloc or tcmalloc by passing to Flink's container environment [1] to set related flag. You could increase taskmanager.memory.jvm-overhead.max [2] and taskmanager.memory.jvm-overhead.min[3] to leave enough space to figure out what occupied the memory. We have observed that unzipping configuration files too frequently could also consume too much native memory.
  2.  Even the native program has freed the memory but the underlying memory allocator did not return memory to OS in time. The default allocator in glibc does not behave well compared with jemalloc and tcmalloc in this area, that's why we try to change default memory allocator to jemalloc [4]. You could use 'pmap' or cat cgroup info to see whether the running process has included jemalloc.so to see whether the memory allocator works as expected.

[1] https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#forwarding-environment-variables
[2] https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-max
[3] https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-min
[4] https://issues.apache.org/jira/browse/FLINK-19125

Best
Yun Tang






________________________________
From: Ammon Diether <ad...@gmail.com>>
Sent: Thursday, October 7, 2021 12:39
To: Kevin Lam <ke...@shopify.com>>
Cc: Fabian Paul <fa...@ververica.com>>; user <us...@flink.apache.org>>
Subject: Re: RocksDB: Spike in Memory Usage Post Restart

I don't mean to derail or take away from this thread only to second that I am seeing the same behavior.  We are using Flink Stateful Functions 3.0.0  Flink 12.1 in  K8 environment.

In the graph a little after 15:00 a few of the taskmanagers (5 of 64) were moved to a different node and restarted (Kubernetes Scale down was the reason).  The new taskmangers spun up, but the long running taskmanager's memory unexpectedly goes up.
[image.png]

On Wed, Oct 6, 2021 at 9:52 AM Kevin Lam <ke...@shopify.com>> wrote:
Hi Fabian,

Yes I can tell you a bit more about the job we are seeing the problem with. I'll simplify things a bit but this captures the essence:

1. Input datastreams are from a few kafka sources that we intend to join.
2. We wrap the datastreams we want to join into a common container class and key them on the join key.
3. Union and process the datastreams with a KeyedProcessFunction which holds the latest value seen for each source in ValueStates, and emits an output that is the function of the stored ValueStates each time a new value comes in.
4. We have to support arbitrarily late arriving data, so we don't window, and just keep everything in ValueState.
5. The state we want to support is very large, on the order of several TBs.

On Wed, Oct 6, 2021 at 10:50 AM Fabian Paul <fa...@ververica.com>> wrote:
Hi Kevin,

Since you are seeing the problem across multiple Flink versions and with the default RocksDb and custom configuration it might be related
 to something else. A lot of different components can allocate direct memory i.e. some filesystem implementations, the connectors or some user grpc dependency.


Can you tell use a bit more about the job you are seeing the problem with?

Best,
Fabian

Re: RocksDB: Spike in Memory Usage Post Restart

Posted by Yaroslav Tkachenko <ya...@shopify.com>.

A quick update on this, we were able to fix the memory leak by disabling
block cache in RocksDB with:

state.backend.rocksdb.options-factory: xxx.NoBlockCacheRocksDbOptionsFactory
state.backend.rocksdb.memory.managed: false

Where NoBlockCacheRocksDbOptionsFactory essentially does:

val blockBasedTableConfig = new BlockBasedTableConfig()
blockBasedTableConfig.setNoBlockCache(true)
// Needed in order to disable block cache
blockBasedTableConfig.setCacheIndexAndFilterBlocks(false)
blockBasedTableConfig.setCacheIndexAndFilterBlocksWithHighPriority(false)
blockBasedTableConfig.setPinL0FilterAndIndexBlocksInCache(false)

We did NOT see big performance degradation when running on SSDs.

On Fri, Oct 8, 2021 at 3:41 AM Yun Tang <my...@live.com> wrote:

> Hi Kevin,
>
> Sorry for late jumping in as we were in a vocation holiday.
>
> Since you already refer the doc
> https://erikwramner.files.wordpress.com/2017/10/native-memory-leaks-in-java.pdf,
> have you ever figured out the native call via jemalloc and jeprof?
>
> From my experience, there could be two general kinds of native memory leak
> (we do not consider the stack or static memory as they should not consume
> much in general cases, and the case of mmap memory usage):
>
>    1. Malloced memory does not free in time or forget to free. This could
>    be figured out via tool jemalloc or tcmalloc by passing to Flink's
>    container environment [1] to set related flag. You could
>    increase taskmanager.memory.jvm-overhead.max [2]
>    and taskmanager.memory.jvm-overhead.min[3] to leave enough space to figure
>    out what occupied the memory. We have observed that unzipping configuration
>    files too frequently could also consume too much native memory.
>    2. Even the native program has freed the memory but the underlying
>    memory allocator did not return memory to OS in time. The default allocator
>    in glibc does not behave well compared with jemalloc and tcmalloc in this
>    area, that's why we try to change default memory allocator to jemalloc [4].
>    You could use 'pmap' or cat cgroup info to see whether the running process
>    has included jemalloc.so to see whether the memory allocator works as
>    expected.
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#forwarding-environment-variables
>
> [2]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-max
> [3]
> https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-min
> [4] https://issues.apache.org/jira/browse/FLINK-19125
>
> Best
> Yun Tang
>
>
>
>
>
>
> ------------------------------
> *From:* Ammon Diether <ad...@gmail.com>
> *Sent:* Thursday, October 7, 2021 12:39
> *To:* Kevin Lam <ke...@shopify.com>
> *Cc:* Fabian Paul <fa...@ververica.com>; user <us...@flink.apache.org>
> *Subject:* Re: RocksDB: Spike in Memory Usage Post Restart
>
> I don't mean to derail or take away from this thread only to second that I
> am seeing the same behavior.  We are using Flink Stateful Functions 3.0.0
> Flink 12.1 in  K8 environment.
>
> In the graph a little after 15:00 a few of the taskmanagers (5 of 64) were
> moved to a different node and restarted (Kubernetes Scale down was the
> reason).  The new taskmangers spun up, but the long running
> taskmanager's memory unexpectedly goes up.
> [image: image.png]
>
> On Wed, Oct 6, 2021 at 9:52 AM Kevin Lam <ke...@shopify.com> wrote:
>
> Hi Fabian,
>
> Yes I can tell you a bit more about the job we are seeing the problem
> with. I'll simplify things a bit but this captures the essence:
>
> 1. Input datastreams are from a few kafka sources that we intend to join.
> 2. We wrap the datastreams we want to join into a common container class
> and key them on the join key.
> 3. Union and process the datastreams with a KeyedProcessFunction which
> holds the latest value seen for each source in ValueStates, and emits an
> output that is the function of the stored ValueStates each time a new value
> comes in.
> 4. We have to support arbitrarily late arriving data, so we don't window,
> and just keep everything in ValueState.
> 5. The state we want to support is very large, on the order of several
> TBs.
>
> On Wed, Oct 6, 2021 at 10:50 AM Fabian Paul <fa...@ververica.com>
> wrote:
>
> Hi Kevin,
>
> Since you are seeing the problem across multiple Flink versions and with
> the default RocksDb and custom configuration it might be related
>  to something else. A lot of different components can allocate direct
> memory i.e. some filesystem implementations, the connectors or some user
> grpc dependency.
>
>
> Can you tell use a bit more about the job you are seeing the problem with?
>
> Best,
> Fabian
>
>

Re: RocksDB: Spike in Memory Usage Post Restart

Posted by Yun Tang <my...@live.com>.

Hi Kevin,

Sorry for late jumping in as we were in a vocation holiday.

Since you already refer the doc https://erikwramner.files.wordpress.com/2017/10/native-memory-leaks-in-java.pdf, have you ever figured out the native call via jemalloc and jeprof?

From my experience, there could be two general kinds of native memory leak (we do not consider the stack or static memory as they should not consume much in general cases, and the case of mmap memory usage):

  1.  Malloced memory does not free in time or forget to free. This could be figured out via tool jemalloc or tcmalloc by passing to Flink's container environment [1] to set related flag. You could increase taskmanager.memory.jvm-overhead.max [2] and taskmanager.memory.jvm-overhead.min[3] to leave enough space to figure out what occupied the memory. We have observed that unzipping configuration files too frequently could also consume too much native memory.
  2.  Even the native program has freed the memory but the underlying memory allocator did not return memory to OS in time. The default allocator in glibc does not behave well compared with jemalloc and tcmalloc in this area, that's why we try to change default memory allocator to jemalloc [4]. You could use 'pmap' or cat cgroup info to see whether the running process has included jemalloc.so to see whether the memory allocator works as expected.

[1] https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#forwarding-environment-variables
[2] https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-max
[3] https://ci.apache.org/projects/flink/flink-docs-master/docs/deployment/config/#taskmanager-memory-jvm-overhead-min
[4] https://issues.apache.org/jira/browse/FLINK-19125

Best
Yun Tang






________________________________
From: Ammon Diether <ad...@gmail.com>
Sent: Thursday, October 7, 2021 12:39
To: Kevin Lam <ke...@shopify.com>
Cc: Fabian Paul <fa...@ververica.com>; user <us...@flink.apache.org>
Subject: Re: RocksDB: Spike in Memory Usage Post Restart

I don't mean to derail or take away from this thread only to second that I am seeing the same behavior.  We are using Flink Stateful Functions 3.0.0  Flink 12.1 in  K8 environment.

In the graph a little after 15:00 a few of the taskmanagers (5 of 64) were moved to a different node and restarted (Kubernetes Scale down was the reason).  The new taskmangers spun up, but the long running taskmanager's memory unexpectedly goes up.
[image.png]

On Wed, Oct 6, 2021 at 9:52 AM Kevin Lam <ke...@shopify.com>> wrote:
Hi Fabian,

Yes I can tell you a bit more about the job we are seeing the problem with. I'll simplify things a bit but this captures the essence:

1. Input datastreams are from a few kafka sources that we intend to join.
2. We wrap the datastreams we want to join into a common container class and key them on the join key.
3. Union and process the datastreams with a KeyedProcessFunction which holds the latest value seen for each source in ValueStates, and emits an output that is the function of the stored ValueStates each time a new value comes in.
4. We have to support arbitrarily late arriving data, so we don't window, and just keep everything in ValueState.
5. The state we want to support is very large, on the order of several TBs.

On Wed, Oct 6, 2021 at 10:50 AM Fabian Paul <fa...@ververica.com>> wrote:
Hi Kevin,

Since you are seeing the problem across multiple Flink versions and with the default RocksDb and custom configuration it might be related
 to something else. A lot of different components can allocate direct memory i.e. some filesystem implementations, the connectors or some user grpc dependency.


Can you tell use a bit more about the job you are seeing the problem with?

Best,
Fabian

Re: RocksDB: Spike in Memory Usage Post Restart

Posted by Yaroslav Tkachenko <ya...@shopify.com>.

And we also found this amazing issue
https://github.com/facebook/rocksdb/issues/4112 that makes me wonder why we
don't see more complaints :)

A similar issue that was closed, but not resolved
https://issues.apache.org/jira/browse/FLINK-21986

On Wed, Oct 6, 2021 at 9:39 PM Ammon Diether <ad...@gmail.com> wrote:

> I don't mean to derail or take away from this thread only to second that I
> am seeing the same behavior.  We are using Flink Stateful Functions 3.0.0
> Flink 12.1 in  K8 environment.
>
> In the graph a little after 15:00 a few of the taskmanagers (5 of 64) were
> moved to a different node and restarted (Kubernetes Scale down was the
> reason).  The new taskmangers spun up, but the long running
> taskmanager's memory unexpectedly goes up.
> [image: image.png]
>
> On Wed, Oct 6, 2021 at 9:52 AM Kevin Lam <ke...@shopify.com> wrote:
>
>> Hi Fabian,
>>
>> Yes I can tell you a bit more about the job we are seeing the problem
>> with. I'll simplify things a bit but this captures the essence:
>>
>> 1. Input datastreams are from a few kafka sources that we intend to join.
>> 2. We wrap the datastreams we want to join into a common container class
>> and key them on the join key.
>> 3. Union and process the datastreams with a KeyedProcessFunction which
>> holds the latest value seen for each source in ValueStates, and emits an
>> output that is the function of the stored ValueStates each time a new value
>> comes in.
>> 4. We have to support arbitrarily late arriving data, so we don't window,
>> and just keep everything in ValueState.
>> 5. The state we want to support is very large, on the order of several
>> TBs.
>>
>> On Wed, Oct 6, 2021 at 10:50 AM Fabian Paul <fa...@ververica.com>
>> wrote:
>>
>>> Hi Kevin,
>>>
>>> Since you are seeing the problem across multiple Flink versions and with
>>> the default RocksDb and custom configuration it might be related
>>>  to something else. A lot of different components can allocate direct
>>> memory i.e. some filesystem implementations, the connectors or some user
>>> grpc dependency.
>>>
>>>
>>> Can you tell use a bit more about the job you are seeing the problem
>>> with?
>>>
>>> Best,
>>> Fabian
>>>
>>>