You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Hangxiang Yu (Jira)" <ji...@apache.org> on 2024/04/02 02:10:00 UTC

[jira] [Commented] (FLINK-34975) FLIP-427: ForSt - Disaggregated State Store

    [ https://issues.apache.org/jira/browse/FLINK-34975?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17832991#comment-17832991 ] 

Hangxiang Yu commented on FLINK-34975:
--------------------------------------

Hi, [~kkrugler].

Thanks a lot for sharing the interesting topic. 

I just took a quick look and also found some interesting techniques (e.g. leverage unified C++ interfaces, io_uring for networking, MTU resolver) which should be helpful when we optimize ForSt in the future.

I think we could consider it in the next milestone.

 

> FLIP-427: ForSt - Disaggregated State Store
> -------------------------------------------
>
>                 Key: FLINK-34975
>                 URL: https://issues.apache.org/jira/browse/FLINK-34975
>             Project: Flink
>          Issue Type: New Feature
>          Components: Runtime / State Backends
>            Reporter: Hangxiang Yu
>            Assignee: Hangxiang Yu
>            Priority: Major
>             Fix For: 2.0.0
>
>
> This is a sub-FLIP for the disaggregated state management and its related work, please read the [FLIP-423|https://cwiki.apache.org/confluence/x/R4p3EQ] first to know the whole story.
> As described in FLIP-423, there are some tough issues about embedded state backend on local file system, respecially when dealing with extremely large state:
>  # {*}Constraints of local disk space complicate the prediction of storage requirements, potentially leading to job failures{*}: Especially in cloud native deployment mode, pre-allocated local disks typically face strict capacity constraints, making it challenging to forecast the size requirements of job states. Over-provisioning disk space results in unnecessary resource overhead, while under-provisioning risks job failure due to insufficient space.
>  # *The tight coupling of compute and storage resources leads to underutilization and increased waste:* Jobs can generally be categorized as either CPU-intensive or IO-intensive. In a coupled architecture, CPU-intensive jobs leave a significant portion of storage resources underutilized, whereas IO-intensive jobs result in idle computing resources.
> By considering remote storage as the primary storage, all working states are maintained on the remote file system, which brings several advantages:
>  # *Remote storages e.g. S3/HDFS typically offer elastic scalability, theoretically providing unlimited space.*
>  # *The allocation of remote storage resources can be optimized by reducing them for CPU-intensive jobs and augmenting them for IO-intensive jobs, thus enhancing overall resource utilization.*
>  # *This architecture facilitates a highly efficient and lightweight process for checkpointing, recovery, and rescaling through fast copy or simple move.*
> This FLIP aims to realize disaggregated state for our new key-value store named *ForSt* which evloves from RocksDB and supports remote file system. This makes Flink get rid of the disadvantages by coupled state architecture and embrace the scalable as well as flexible cloud-native storage.
> Please see [FLIP-427 |https://cwiki.apache.org/confluence/x/T4p3EQ]for more details.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)