You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Biao Liu (Jira)" <ji...@apache.org> on 2020/04/02 17:32:00 UTC
[jira] [Comment Edited] (FLINK-16931) Large _metadata file lead to JobManager not responding when restart

    [ https://issues.apache.org/jira/browse/FLINK-16931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17073927#comment-17073927 ] 

Biao Liu edited comment on FLINK-16931 at 4/2/20, 5:31 PM:
-----------------------------------------------------------

[~trohrmann], my pleasure :)

I could share some context here. We have discussed this scenario before refactoring the whole threading model of {{CheckpointCoordinator}}, see FLINK-13497 and FLINK-13698. Although this scenario is not the cause of FLINK-13497, we think there is a risk of heartbeat timeout. At that time, we decided to treat it as a follow-up issue. However we haven't file any ticket for it yet.

After FLINK-13698, most parts of the non-IO operations of {{CheckpointCoordinator}} are executed in main thread executor, except the initialization part which causes this problem. One of the final targets is putting all IO operations of {{CheckpointCoordinator}} into IO thread executor, others are executed in main thread executor. To achieve this, some synchronous operations must be refactored into asynchronous ways. I think that's what we need to do here.


was (Author: sleepy):
[~trohrmann], my pleasure :)

I could share some context here. We have discussed this scenario before refactoring the whole threading model of {{CheckpointCoordinator}}, see FLINK-13497 and FLINK-13698. Although this scenario is not the cause of FLINK-13497, we think there is risk of heartbeat timeout. At that time, we decided to treat it as a follow-up issue. However we haven't file any ticket for it yet.

After FLINK-13698, most parts of the non-IO operations of {{CheckpointCoordinator}} are executed in main thread executor, except the initialization part which causes this problem. One of the final targets is putting all IO operations of {{CheckpointCoordinator}} into IO thread executor, others are executed in main thread executor. To achieve this, some synchronous operations must be refactored into asynchronous ways. I think that's what we need to do here.

> Large _metadata file lead to JobManager not responding when restart
> -------------------------------------------------------------------
>
>                 Key: FLINK-16931
>                 URL: https://issues.apache.org/jira/browse/FLINK-16931
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing, Runtime / Coordination
>    Affects Versions: 1.9.2, 1.10.0, 1.11.0
>            Reporter: Lu Niu
>            Priority: Critical
>             Fix For: 1.11.0
>
>
> When _metadata file is big, JobManager could never recover from checkpoint. It fall into a loop that fetch checkpoint -> JM timeout -> restart. Here is related log: 
> {code:java}
>  2020-04-01 17:08:25,689 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Recovering checkpoints from ZooKeeper.
>  2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Found 3 checkpoints in ZooKeeper.
>  2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to fetch 3 checkpoints from storage.
>  2020-04-01 17:08:25,698 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to retrieve checkpoint 50.
>  2020-04-01 17:08:48,589 INFO org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore - Trying to retrieve checkpoint 51.
>  2020-04-01 17:09:12,775 INFO org.apache.flink.yarn.YarnResourceManager - The heartbeat of JobManager with id 02500708baf0bb976891c391afd3d7d5 timed out.
> {code}
> Digging into the code, looks like ExecutionGraph::restart runs in JobMaster main thread and finally calls ZooKeeperCompletedCheckpointStore::retrieveCompletedCheckpoint which download file form DFS. The main thread is basically blocked for a while because of this. One possible solution is to making the downloading part async. More things might need to consider as the original change tries to make it single-threaded. [https://github.com/apache/flink/pull/7568]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)