You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Piotr Nowojski (Jira)" <ji...@apache.org> on 2020/05/27 15:16:00 UTC
[jira] [Closed] (FLINK-17351) CheckpointCoordinator and CheckpointFailureManager ignores checkpoint timeouts

     [ https://issues.apache.org/jira/browse/FLINK-17351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Piotr Nowojski closed FLINK-17351.
----------------------------------
    Release Note: Checkpoint timeouts will now be treated as normal checkpoint failures and checked against `setTolerableCheckpointFailureNumber(...)`.
      Resolution: Fixed

merged commit 3ec768a into apache:master, f9641411ca into apache:release-1.11

Fix will not be back-ported to previous releases, as it's changing semantic of the configuration (setTolerableCheckpointFailureNumber, setFailTaskOnCheckpointError)

> CheckpointCoordinator and CheckpointFailureManager ignores checkpoint timeouts
> ------------------------------------------------------------------------------
>
>                 Key: FLINK-17351
>                 URL: https://issues.apache.org/jira/browse/FLINK-17351
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.9.2, 1.10.0
>            Reporter: Piotr Nowojski
>            Assignee: Yuan Mei
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.11.0
>
>
> As described in point 2: https://issues.apache.org/jira/browse/FLINK-17327?focusedCommentId=17090576&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17090576
> (copy of description from above linked comment):
> The logic in how {{CheckpointCoordinator}} handles checkpoint timeouts is broken. In your [~qinjunjerry] examples, your job should have failed after first checkpoint failure, but checkpoints were time outing on CheckpointCoordinator after 5 seconds, before {{FlinkKafkaProducer}} was detecting Kafka failure after 2 minutes. Those timeouts were not checked against {{setTolerableCheckpointFailureNumber(...)}} limit, so the job was keep going with many timed out checkpoints. Now funny thing happens: FlinkKafkaProducer detects Kafka failure. Funny thing is that it depends where the failure was detected:
> a) on processing record? no problem, job will failover immediately once failure is detected (in this example after 2 minutes)
> b) on checkpoint? heh, the failure is reported to {{CheckpointCoordinator}} *and gets ignored, as PendingCheckpoint has already been discarded 2 minutes ago* :) So theoretically the checkpoints can keep failing forever and the job will not restart automatically, unless something else fails.
> Even more funny things can happen if we mix FLINK-17350 . or b) with intermittent external system failure. Sink reports an exception, transaction was lost/aborted, Sink is in failed state, but if there will be a happy coincidence that it manages to accept further records, this exception can be lost and all of the records in those failed checkpoints will be lost forever as well. In all of the examples that [~qinjunjerry] posted it hasn't happened. {{FlinkKafkaProducer}} was not able to recover after the initial failure and it was keep throwing exceptions until the job finally failed (but much later then it should have). And that's not guaranteed anywhere.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)