You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Flink Jira Bot (Jira)" <ji...@apache.org> on 2022/07/06 10:41:00 UTC

[jira] [Updated] (FLINK-26394) CheckpointCoordinator.isTriggering can not be reset if a checkpoint expires while the checkpointCoordinator task is queuing in the SourceCoordinator executor.

     [ https://issues.apache.org/jira/browse/FLINK-26394?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Flink Jira Bot updated FLINK-26394:
-----------------------------------
    Labels: pull-request-available stale-assigned  (was: pull-request-available)

I am the [Flink Jira Bot|https://github.com/apache/flink-jira-bot/] and I help the community manage its development. I see this issue is assigned but has not received an update in 30 days, so it has been labeled "stale-assigned".
If you are still working on the issue, please remove the label and add a comment updating the community on your progress.  If this issue is waiting on feedback, please consider this a reminder to the committer/reviewer. Flink is a very active project, and so we appreciate your patience.
If you are no longer working on the issue, please unassign yourself so someone else may work on it.


> CheckpointCoordinator.isTriggering can not be reset if a checkpoint expires while the checkpointCoordinator task is queuing in the SourceCoordinator executor.
> --------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: FLINK-26394
>                 URL: https://issues.apache.org/jira/browse/FLINK-26394
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.16.0
>            Reporter: Gen Luo
>            Assignee: Gen Luo
>            Priority: Major
>              Labels: pull-request-available, stale-assigned
>
> We found a job can no longer trigger checkpoints or savepoints after recovering from a checkpoint timeout failure. After investigation, we found that the `isTriggering` flag is CheckpointCoordinator is true while no checkpoint is actually doing, and the root cause is as following:
>  
>  # The job uses a source whose coordinator needs to scan a table while requesting splits, which may cost more than 10min. The source coordinator executor thread will be occupied by `handleSplitRequest`, and `checkpointCoordinator` task of the first checkpoint will be queued after it.
>  # 10min later, the checkpoint is expired, removing the pending checkpoint from the coordinator, and triggering a global failover. But the `isTriggering` is not reset here. It can only be reset after the checkpoint completable future is done, which is now holding only by the `checkpointCoordinator` task in the queue, along with the PendingCheckpoint.
>  # Then the job failover, and the RecreateOnResetOperatorCoordinator will recreate a new SourceCoordinator, and close the previous coordinator asynchronously. Timeout for the closing is fixed to 60s. SourceCoordinator will try to `shutdown` the coordinator executor then `awaitTermination`. If the tasks are done within 60s, nothing wrong will happen.
>  # But if the closing method is stuck for more than 60s (which in this case is actually stuck in the `handleSplitRequest`), the async closing thread will be interrupted and SourceCoordinator will `shutdownNow` the executor. All tasks queuing will be discarded, including the `checkpointCoordinator` task.
>  # Then the checkpoint completable future will never complete and the `isTriggering` flag will never be reset.
>  
> I see that the closing part of SourceCoordinator is recently refactored. But I find the new implementation also has this issue. And since it calls `shutdownNow` directly, the issue should be easier to encounter.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)