You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Piotr Nowojski (JIRA)" <ji...@apache.org> on 2019/06/14 07:09:01 UTC

[jira] [Comment Edited] (FLINK-8871) Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager

    [ https://issues.apache.org/jira/browse/FLINK-8871?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16863760#comment-16863760 ] 

Piotr Nowojski edited comment on FLINK-8871 at 6/14/19 7:08 AM:
----------------------------------------------------------------

Thanks for bringing this to my attention [~yunta]. Unfortunately there are some more conflicts in the pipeline. Probably nothing that is completely incompatible but we are working on the same classes. In order to finalize [~sunhaibotb]'s `TwoInputStreamSelectable`, I have to rework `CheckpointBarrierHandler` implementation quite a bit. I two biggest sources of conflicts:

# I'm splitting the existing code of {{BarrierBuffer}} into two separate classes - I'm moving the code that actually performs the alignment to a separate class {{CheckpointBarrierAligner}}, leaving {{BarrierBuffer}} a quite thin class. This is in order to support two separate instances of {{BarrierBuffer}} for {{TwoInputStreamOperator}} - that means the code tracking alignment, blocking channels etc must be extracted two another class, that will be shared by those two {{BarrierBufffer}} instances.
# I'm planning to deduplicate the code by a bit and completely remove {{BarrierTracker}} class and unify it with {{BarrierBuffer}}. You can even see this problem in your PR, where you are implementing handling of {{private final NavigableSet<Long> abortedCheckpointIds;}} twice in those two classes. 


was (Author: pnowojski):
Thanks for bringing this to my attention [~yunta]. Unfortunately there are some more conflicts in the pipeline. Probably nothing that is completely incompatible but we are working on the same classes. In order to finalize [~sunhaibotb]'s `TwoInputStreamSelectable`, I have to rework `CheckpointBarrierHandler` implementation quite a bit. I two biggest sources of conflicts:
1. I'm splitting the existing code of {{BarrierBuffer}} into two separate classes - I'm moving the code that actually performs the alignment to a separate class {{CheckpointBarrierAligner}}, leaving {{BarrierBuffer}} a quite thin class. This is in order to support two separate instances of {{BarrierBuffer}} for {{TwoInputStreamOperator}} - that means the code tracking alignment, blocking channels etc must be extracted two another class, that will be shared by those two {{BarrierBufffer}} instances.
2. I'm planning to deduplicate the code by a bit and completely remove {{BarrierTracker}} class and unify it with {{BarrierBuffer}}. You can even see this problem in your PR, where you are implementing handling of {{private final NavigableSet<Long> abortedCheckpointIds;}} twice in those two classes. 

> Checkpoint cancellation is not propagated to stop checkpointing threads on the task manager
> -------------------------------------------------------------------------------------------
>
>                 Key: FLINK-8871
>                 URL: https://issues.apache.org/jira/browse/FLINK-8871
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.3.2, 1.4.1, 1.5.0, 1.6.0, 1.7.0
>            Reporter: Stefan Richter
>            Assignee: Yun Tang
>            Priority: Critical
>              Labels: pull-request-available
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> Flink currently lacks any form of feedback mechanism from the job manager / checkpoint coordinator to the tasks when it comes to failing a checkpoint. This means that running snapshots on the tasks are also not stopped even if their owning checkpoint is already cancelled. Two examples for cases where this applies are checkpoint timeouts and local checkpoint failures on a task together with a configuration that does not fail tasks on checkpoint failure. Notice that those running snapshots do no longer account for the maximum number of parallel checkpoints, because their owning checkpoint is considered as cancelled.
> Not stopping the task's snapshot thread can lead to a problematic situation where the next checkpoints already started, while the abandoned checkpoint thread from a previous checkpoint is still lingering around running. This scenario can potentially cascade: many parallel checkpoints will slow down checkpointing and make timeouts even more likely.
>  
> A possible solution is introducing a {{cancelCheckpoint}} method  as counterpart to the {{triggerCheckpoint}} method in the task manager gateway, which is invoked by the checkpoint coordinator as part of cancelling the checkpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)