You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Fan Hong (Jira)" <ji...@apache.org> on 2023/04/14 09:50:00 UTC

[jira] [Updated] (FLINK-31809) Improve efficiency of ListStateWithCache#snapshotState

     [ https://issues.apache.org/jira/browse/FLINK-31809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Fan Hong updated FLINK-31809:
-----------------------------
    Description: 
In the current implementation of {{{}ListStateWithCache{}}}, the {{snapshotState}} function writes the full data to the file system every time, even if the stored data has not changed since initialization. This can result in high IO costs, especially when working with large data sets. Additionally, this method is called in the same thread as operators, which can negatively impact job efficiency.

Furthermore, when using local file systems, the full data is also written to Flink state storage, which doubles the costs.

To address these issues, an incremental snapshot approach should be considered to reduce IO and network costs.

  was:
Current `ListStateWithCache#snapshotState` supports distributed file systems and local file systems. However, in both cases, full data is written to the filesystem (`
dataCacheWriter.writeSegmentsToFiles()`) when `snapshotState` is called. 
 
Moreover, when local file system is used, full data is written to Flink state storage right now, which doubles the costs.


> Improve efficiency of ListStateWithCache#snapshotState
> ------------------------------------------------------
>
>                 Key: FLINK-31809
>                 URL: https://issues.apache.org/jira/browse/FLINK-31809
>             Project: Flink
>          Issue Type: Improvement
>          Components: Library / Machine Learning
>            Reporter: Fan Hong
>            Priority: Major
>
> In the current implementation of {{{}ListStateWithCache{}}}, the {{snapshotState}} function writes the full data to the file system every time, even if the stored data has not changed since initialization. This can result in high IO costs, especially when working with large data sets. Additionally, this method is called in the same thread as operators, which can negatively impact job efficiency.
> Furthermore, when using local file systems, the full data is also written to Flink state storage, which doubles the costs.
> To address these issues, an incremental snapshot approach should be considered to reduce IO and network costs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)