You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Kostas Kloudas (JIRA)" <ji...@apache.org> on 2019/07/31 08:49:00 UTC

[jira] [Closed] (FLINK-12858) Potential distributed deadlock in case of synchronous savepoint failure

     [ https://issues.apache.org/jira/browse/FLINK-12858?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Kostas Kloudas closed FLINK-12858.
----------------------------------
    Resolution: Fixed

Merged on master with 7b4d4b9114e6deff5d7c41925936665db53ecd8a
and on release-1.9 with 9407e1b4a56c853688d7395da448d7d107c6fb76

> Potential distributed deadlock in case of synchronous savepoint failure
> -----------------------------------------------------------------------
>
>                 Key: FLINK-12858
>                 URL: https://issues.apache.org/jira/browse/FLINK-12858
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.9.0
>            Reporter: Alex
>            Assignee: Alex
>            Priority: Blocker
>             Fix For: 1.9.0
>
>
> Current implementation of stop-with-savepoint (FLINK-11458) would lock the thread (on {{syncSavepointLatch}}) that carries {{StreamTask.performCheckpoint()}}. For non-source tasks, this thread is implied to be the task's main thread (stop-with-savepoint deliberately stops any activity in the task's main thread).
> Unlocking happens either when the task is cancelled or when the corresponding checkpoint is acknowledged.
> It's possible, that other downstream tasks of the same Flink job "soft" fail the checkpoint/savepoint due to various reasons (for example, due to max buffered bytes {{BarrierBuffer.checkSizeLimit()}}. In such case, the checkpoint abortion would be notified to JM . But it looks like, the checkpoint coordinator would handle such abortion as usual and assume that the Flink job continues running.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)