You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by "Zhu Zhu (Jira)" <ji...@apache.org> on 2019/10/11 10:13:00 UTC

[jira] [Created] (FLINK-14375) Avoid to trigger failover on a non-effective task failure notification

Zhu Zhu created FLINK-14375:
-------------------------------

             Summary: Avoid to trigger failover on a non-effective task failure notification
                 Key: FLINK-14375
                 URL: https://issues.apache.org/jira/browse/FLINK-14375
             Project: Flink
          Issue Type: Sub-task
          Components: Runtime / Coordination
    Affects Versions: 1.10.0
            Reporter: Zhu Zhu
             Fix For: 1.10.0


The DefaultScheduler triggers failover if a task is notified to be FAILED. However, in the case the multiple tasks in the same region fail together, it will trigger multiple failovers. The later triggered failovers are useless, lead to concurrent failovers and will increase the restart attempts count.

To avoid that, I'd propose to check that the effectiveness of a task failure to decide whether to trigger a failover, namely checking in {{DefaultScheduler#maybeHandleTaskFailure()}} to see whether the vertex state is really FAILED rather than CANCELED.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)