You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@flink.apache.org by Trystan <en...@gmail.com> on 2022/04/20 15:49:51 UTC

RocksDB efficiency and keyby

Hello,

We have a job where its main purpose is to track whether or not we've
previously seen a particular event - that's it. If it's new, we save it to
an external database. If we've seen it, we block the write. There's a 3-day
TTL to manage the state size. The downstream db can tolerate new data
slipping through and reject the write - we mainly use the state to reduce
writes.

We're starting to see some performance issues, even after adding 50%
capacity to the job. After some number of days/weeks, it eventually goes
into a constant backpressure situation. I'm wondering if there's something
we can do to improve efficiency.

1. According to the flamegraph, 60-70% of the time is spent in RocksDB.get
2. The state is just a ValueState<Boolean>. I assume this is the
smallest/most efficient state. The keyby is extremely high cardinality -
are we better off with a lower cardinality and a MapState<String, Boolean>
.contains() check?
3. Current configs: taskmanager.memory.process.size:
4g, taskmanager.memory.managed.fraction: 0.8 (increased from 0.6, didn't
see much change)
4. Estimated num keys tops out somewhere around 9-10B. Estimated live data
size somewhere around 250 GB. Attempting to switch to heap state
immediately ran into OOM (parallelism: 120, 8gb memory each).

And perhaps the answer is just "scale out" :) but if there are any signals
to know when we've reached the limit of current scale, it'd be great to
know what signals to look for!

Thanks!
Trystan

Re: RocksDB efficiency and keyby

Posted by Yun Tang <my...@live.com>.
Hi Trystan,

You can use async-profiler[1] to detect the CPU stack within RocksDB to see what happened, maybe you can try to enable partitioned index & filters[2] if the call stack is occupied by loading index or filter block.

[1] https://github.com/jvm-profiling-tools/async-profiler
[2] https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#state-backend-rocksdb-memory-partitioned-index-filters

Best
Yun Tang


________________________________
From: Yaroslav Tkachenko <ya...@goldsky.io>
Sent: Thursday, April 21, 2022 0:44
To: Trystan <en...@gmail.com>
Cc: user <us...@flink.apache.org>
Subject: Re: RocksDB efficiency and keyby

Yep, I'd give it another try. EBS could be too slow in some use-cases.

On Wed, Apr 20, 2022 at 9:39 AM Trystan <en...@gmail.com>> wrote:
Thanks for the info! We're running EBS gp2 volumes... awhile back we tested local SSDs with a different job and didn't notice any gains, but that was likely due to an under-optimized job where the bottleneck was elsewhere

On Wed, Apr 20, 2022, 11:08 AM Yaroslav Tkachenko <ya...@goldsky.io>> wrote:
Hey Trystan,

Based on my personal experience, good disk IO for RocksDB matters a lot. Are you using the fastest SSD storage you can get for RocskDB folders?

For example, when running on GCP, we noticed 10x throughput improvement by switching RocksDB storage to https://cloud.google.com/compute/docs/disks/local-ssd

On Wed, Apr 20, 2022 at 8:50 AM Trystan <en...@gmail.com>> wrote:
Hello,

We have a job where its main purpose is to track whether or not we've previously seen a particular event - that's it. If it's new, we save it to an external database. If we've seen it, we block the write. There's a 3-day TTL to manage the state size. The downstream db can tolerate new data slipping through and reject the write - we mainly use the state to reduce writes.

We're starting to see some performance issues, even after adding 50% capacity to the job. After some number of days/weeks, it eventually goes into a constant backpressure situation. I'm wondering if there's something we can do to improve efficiency.

1. According to the flamegraph, 60-70% of the time is spent in RocksDB.get
2. The state is just a ValueState<Boolean>. I assume this is the smallest/most efficient state. The keyby is extremely high cardinality - are we better off with a lower cardinality and a MapState<String, Boolean> .contains() check?
3. Current configs: taskmanager.memory.process.size: 4g, taskmanager.memory.managed.fraction: 0.8 (increased from 0.6, didn't see much change)
4. Estimated num keys tops out somewhere around 9-10B. Estimated live data size somewhere around 250 GB. Attempting to switch to heap state immediately ran into OOM (parallelism: 120, 8gb memory each).

And perhaps the answer is just "scale out" :) but if there are any signals to know when we've reached the limit of current scale, it'd be great to know what signals to look for!

Thanks!
Trystan

Re: RocksDB efficiency and keyby

Posted by Yaroslav Tkachenko <ya...@goldsky.io>.
Yep, I'd give it another try. EBS could be too slow in some use-cases.

On Wed, Apr 20, 2022 at 9:39 AM Trystan <en...@gmail.com> wrote:

> Thanks for the info! We're running EBS gp2 volumes... awhile back we
> tested local SSDs with a different job and didn't notice any gains, but
> that was likely due to an under-optimized job where the bottleneck was
> elsewhere
>
> On Wed, Apr 20, 2022, 11:08 AM Yaroslav Tkachenko <ya...@goldsky.io>
> wrote:
>
>> Hey Trystan,
>>
>> Based on my personal experience, good disk IO for RocksDB matters a lot.
>> Are you using the fastest SSD storage you can get for RocskDB folders?
>>
>> For example, when running on GCP, we noticed *10x* throughput
>> improvement by switching RocksDB storage to
>> https://cloud.google.com/compute/docs/disks/local-ssd
>>
>> On Wed, Apr 20, 2022 at 8:50 AM Trystan <en...@gmail.com> wrote:
>>
>>> Hello,
>>>
>>> We have a job where its main purpose is to track whether or not we've
>>> previously seen a particular event - that's it. If it's new, we save it to
>>> an external database. If we've seen it, we block the write. There's a 3-day
>>> TTL to manage the state size. The downstream db can tolerate new data
>>> slipping through and reject the write - we mainly use the state to reduce
>>> writes.
>>>
>>> We're starting to see some performance issues, even after adding 50%
>>> capacity to the job. After some number of days/weeks, it eventually goes
>>> into a constant backpressure situation. I'm wondering if there's something
>>> we can do to improve efficiency.
>>>
>>> 1. According to the flamegraph, 60-70% of the time is spent in
>>> RocksDB.get
>>> 2. The state is just a ValueState<Boolean>. I assume this is the
>>> smallest/most efficient state. The keyby is extremely high cardinality -
>>> are we better off with a lower cardinality and a MapState<String, Boolean>
>>> .contains() check?
>>> 3. Current configs: taskmanager.memory.process.size:
>>> 4g, taskmanager.memory.managed.fraction: 0.8 (increased from 0.6, didn't
>>> see much change)
>>> 4. Estimated num keys tops out somewhere around 9-10B. Estimated live
>>> data size somewhere around 250 GB. Attempting to switch to heap state
>>> immediately ran into OOM (parallelism: 120, 8gb memory each).
>>>
>>> And perhaps the answer is just "scale out" :) but if there are any
>>> signals to know when we've reached the limit of current scale, it'd be
>>> great to know what signals to look for!
>>>
>>> Thanks!
>>> Trystan
>>>
>>

Re: RocksDB efficiency and keyby

Posted by Trystan <en...@gmail.com>.
Thanks for the info! We're running EBS gp2 volumes... awhile back we tested
local SSDs with a different job and didn't notice any gains, but that was
likely due to an under-optimized job where the bottleneck was elsewhere

On Wed, Apr 20, 2022, 11:08 AM Yaroslav Tkachenko <ya...@goldsky.io>
wrote:

> Hey Trystan,
>
> Based on my personal experience, good disk IO for RocksDB matters a lot.
> Are you using the fastest SSD storage you can get for RocskDB folders?
>
> For example, when running on GCP, we noticed *10x* throughput improvement
> by switching RocksDB storage to
> https://cloud.google.com/compute/docs/disks/local-ssd
>
> On Wed, Apr 20, 2022 at 8:50 AM Trystan <en...@gmail.com> wrote:
>
>> Hello,
>>
>> We have a job where its main purpose is to track whether or not we've
>> previously seen a particular event - that's it. If it's new, we save it to
>> an external database. If we've seen it, we block the write. There's a 3-day
>> TTL to manage the state size. The downstream db can tolerate new data
>> slipping through and reject the write - we mainly use the state to reduce
>> writes.
>>
>> We're starting to see some performance issues, even after adding 50%
>> capacity to the job. After some number of days/weeks, it eventually goes
>> into a constant backpressure situation. I'm wondering if there's something
>> we can do to improve efficiency.
>>
>> 1. According to the flamegraph, 60-70% of the time is spent in RocksDB.get
>> 2. The state is just a ValueState<Boolean>. I assume this is the
>> smallest/most efficient state. The keyby is extremely high cardinality -
>> are we better off with a lower cardinality and a MapState<String, Boolean>
>> .contains() check?
>> 3. Current configs: taskmanager.memory.process.size:
>> 4g, taskmanager.memory.managed.fraction: 0.8 (increased from 0.6, didn't
>> see much change)
>> 4. Estimated num keys tops out somewhere around 9-10B. Estimated live
>> data size somewhere around 250 GB. Attempting to switch to heap state
>> immediately ran into OOM (parallelism: 120, 8gb memory each).
>>
>> And perhaps the answer is just "scale out" :) but if there are any
>> signals to know when we've reached the limit of current scale, it'd be
>> great to know what signals to look for!
>>
>> Thanks!
>> Trystan
>>
>

Re: RocksDB efficiency and keyby

Posted by Yaroslav Tkachenko <ya...@goldsky.io>.
Hey Trystan,

Based on my personal experience, good disk IO for RocksDB matters a lot.
Are you using the fastest SSD storage you can get for RocskDB folders?

For example, when running on GCP, we noticed *10x* throughput improvement
by switching RocksDB storage to
https://cloud.google.com/compute/docs/disks/local-ssd

On Wed, Apr 20, 2022 at 8:50 AM Trystan <en...@gmail.com> wrote:

> Hello,
>
> We have a job where its main purpose is to track whether or not we've
> previously seen a particular event - that's it. If it's new, we save it to
> an external database. If we've seen it, we block the write. There's a 3-day
> TTL to manage the state size. The downstream db can tolerate new data
> slipping through and reject the write - we mainly use the state to reduce
> writes.
>
> We're starting to see some performance issues, even after adding 50%
> capacity to the job. After some number of days/weeks, it eventually goes
> into a constant backpressure situation. I'm wondering if there's something
> we can do to improve efficiency.
>
> 1. According to the flamegraph, 60-70% of the time is spent in RocksDB.get
> 2. The state is just a ValueState<Boolean>. I assume this is the
> smallest/most efficient state. The keyby is extremely high cardinality -
> are we better off with a lower cardinality and a MapState<String, Boolean>
> .contains() check?
> 3. Current configs: taskmanager.memory.process.size:
> 4g, taskmanager.memory.managed.fraction: 0.8 (increased from 0.6, didn't
> see much change)
> 4. Estimated num keys tops out somewhere around 9-10B. Estimated live data
> size somewhere around 250 GB. Attempting to switch to heap state
> immediately ran into OOM (parallelism: 120, 8gb memory each).
>
> And perhaps the answer is just "scale out" :) but if there are any signals
> to know when we've reached the limit of current scale, it'd be great to
> know what signals to look for!
>
> Thanks!
> Trystan
>