You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Chesnay Schepler (Jira)" <ji...@apache.org> on 2022/06/09 10:48:00 UTC
[jira] [Updated] (FLINK-27972) Race condition between task/savepoint notification failure

     [ https://issues.apache.org/jira/browse/FLINK-27972?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Chesnay Schepler updated FLINK-27972:
-------------------------------------
    Description: 
When a task throws an exception in notifyCheckpointComplete we send 2 messages to the JobManager:
1) we inform the CheckpointCoordinator about the failed savepoint
2) we inform the scheduler about the failed task.

Depending on how these arrive the adaptive scheduler exhibits different behaviors. If 1) arrives first it properly informs the user about the created savepoint which might contain uncommitted transactions; if 2) arrives first it just restarts the job.

I'm not sure how big of an issue the latter case is, but it does invalidate FLINK-26923.

In any case we might want to consider having the StopWithSavepoint state wait until the savepoint future has failed before doing anything else.

  was:
When a task throws an exception in notifyCheckpointComplete we send 2 messages to the JobManager:
1) we inform the CheckpointCoordinator about the failed savepoint
2) we inform the scheduler about the failed task.

Depending on how these arrive the adaptive scheduler exhibits different behaviors. If 1) arrives first it properly informs the user about the created savepoint which might contain uncommitted transactions; if 2) arrives first it just restarts the job.

I'm not sure how big of an issue the latter case is.

In any case we might want to consider having the StopWithSavepoint state wait until the savepoint future has failed before doing anything else.


> Race condition between task/savepoint notification failure
> ----------------------------------------------------------
>
>                 Key: FLINK-27972
>                 URL: https://issues.apache.org/jira/browse/FLINK-27972
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.0
>            Reporter: Chesnay Schepler
>            Priority: Major
>
> When a task throws an exception in notifyCheckpointComplete we send 2 messages to the JobManager:
> 1) we inform the CheckpointCoordinator about the failed savepoint
> 2) we inform the scheduler about the failed task.
> Depending on how these arrive the adaptive scheduler exhibits different behaviors. If 1) arrives first it properly informs the user about the created savepoint which might contain uncommitted transactions; if 2) arrives first it just restarts the job.
> I'm not sure how big of an issue the latter case is, but it does invalidate FLINK-26923.
> In any case we might want to consider having the StopWithSavepoint state wait until the savepoint future has failed before doing anything else.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)