You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2019/09/05 08:30:00 UTC

[jira] [Assigned] (FLINK-13962) Task state handles leak if the task fails before deploying

     [ https://issues.apache.org/jira/browse/FLINK-13962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Till Rohrmann reassigned FLINK-13962:
-------------------------------------

    Assignee: Zhu Zhu

> Task state handles leak if the task fails before deploying
> ----------------------------------------------------------
>
>                 Key: FLINK-13962
>                 URL: https://issues.apache.org/jira/browse/FLINK-13962
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0, 1.10.0
>            Reporter: Zhu Zhu
>            Assignee: Zhu Zhu
>            Priority: Major
>
> Currently the taskRestore field of an _Execution_ is reset to null in task deployment stage.
> The purpose of it is "allows the JobManagerTaskRestore instance to be garbage collected. Furthermore, it won't be archived along with the Execution in the ExecutionVertex in case of a restart. This is especially important when setting state.backend.fs.memory-threshold to larger values because every state below this threshold will be stored in the meta state files and, thus, also the JobManagerTaskRestore instances." (From FLINK-9693)
>  
> However, if a task fails before it comes to the deployment stage(e.g. fails due to slot allocation timeout), the _taskRestore_ field will remain non-null and will be archived in prior executions. 
> This may result in large JM heap cost in certain cases and lead to continuous JM full GCs.
>  
> I’d propose to set the _taskRestore_ field to be null before moving an _Execution_ to prior executions.
> We may keep the logic which sets the _taskRestore_ field to be null after task deployment which allows it to be GC'ed earlier in normal cases.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)