You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "fanrui (Jira)" <ji...@apache.org> on 2020/07/03 02:28:00 UTC
[jira] [Commented] (FLINK-18473) Optimize RocksDB disk load balancing strategy

    [ https://issues.apache.org/jira/browse/FLINK-18473?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17150680#comment-17150680 ] 

fanrui commented on FLINK-18473:
--------------------------------

In the case of large samples, random strategies can also ensure load balancing. In the case of small samples, the global Round Robin strategy should be better.
Usually, there are not many jobs in a large state, so RocksDB selects a disk as a small sample.

In our production environment, a small number of disks are allocated to multiple large-scale RocksDB instances, and other disks are not allocated to RocksDB instances. Disk IO became the bottleneck of the task.The global Round Robin strategy solves this problem in our production environment.

> Optimize RocksDB disk load balancing strategy
> ---------------------------------------------
>
>                 Key: FLINK-18473
>                 URL: https://issues.apache.org/jira/browse/FLINK-18473
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / State Backends
>    Affects Versions: 1.12.0
>            Reporter: fanrui
>            Priority: Major
>
> In general, bigdata servers have many disks. For large-state jobs, if multiple slots are running on a TM, then each slot will create a RocksDB instance. We hope that multiple RocksDB instances use different disks to achieve load balancing.
> h3. The problem of current load balancing strategy:
> When the current RocksDB is initialized, a random value nextDirectory is generated according to the number of RocksDB dir: [code link|https://github.com/apache/flink/blob/2d371eb5ac9a3e485d3665cb9a740c65e2ba2ac6/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/RocksDBStateBackend.java#L441]
> {code:java}
> nextDirectory = new Random().nextInt(initializedDbBasePaths.length);
> {code}
> Different slots generate different RocksDBStateBackend objects, so each slot will generate its own *nextDirectory*. The random algorithm used here, so the random value generated by different slots may be the same. For example: the current RocksDB dir is configured with 10 disks, the *nextDirectory* generated by slot0 and slot1 are both 5, then slot0 and slot1 will use the same disk. This disk will be under a lot of pressure, other disks will not be under pressure.
> h3. Optimization ideas:
> *{{nextDirectory}}* should belong to slot sharing, the initial value of *{{nextDirectory}}* cannot be 0, it is still generated by random. But define *nextDirectory* as +_{{static AtomicInteger()}}_+ and execute +_{{nextDirectory.incrementAndGet()}}_+ every time RocksDBKeyedStateBackend is applied for. 
> {{nextDirectory}} takes the remainder of {{initializedDbBasePaths.length}} to decide which disk to use.
> Is there any problem with the above ideas?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)