You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2016/06/13 09:44:21 UTC

[jira] [Commented] (FLINK-4046) Failing a restarting job can get stuck in JobStatus.FAILING

    [ https://issues.apache.org/jira/browse/FLINK-4046?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15327080#comment-15327080 ] 

ASF GitHub Bot commented on FLINK-4046:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/2095

    [FLINK-4046] [runtime] Add direct state transition from RESTARTING to FAILED

    A job can get stuck in FAILING if fail is called on a restarting job which has
    not yet reset its ExecutionJobVertices, because these vertices would not call
    jobVertexInFinalState. This method, however, must be called in order to transition
    from FAILING to FAILED. In order to solve the problem, this PR introduces a direct
    state transition from `RESTARTING` to `FAILED`, if `fail` is called when being in state 
    `RESTARTING`.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink fixFailWhileRestarting

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/2095.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #2095
    
----
commit 094c6b59eb92cb5a0f3bf41aa92aab399ba4127c
Author: Till Rohrmann <tr...@apache.org>
Date:   2016-06-09T13:54:01Z

    [FLINK-4046] [runtime] Add direct state transition from RESTARTING to FAILED
    
    A job can get stuck in FAILING if fail is called on a restarting job which has
    not yet reset its ExecutionJobVertices, because these vertices would not call
    jobVertexInFinalState. This method, however, must be called in order to transition
    from FAILING to FAILED.

----


> Failing a restarting job can get stuck in JobStatus.FAILING
> -----------------------------------------------------------
>
>                 Key: FLINK-4046
>                 URL: https://issues.apache.org/jira/browse/FLINK-4046
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Runtime
>    Affects Versions: 1.1.0
>            Reporter: Till Rohrmann
>             Fix For: 1.1.0
>
>
> When a job is in state {{RESTARTING}}, then it can happen that all of its {{ExecutionJobVertices}} are in a final state (if they have not been reset). When calling {{fail}} on this {{ExecutionGraph}} will transition the state to {{FAILING}} and call cancel on all {{ExecutionJobVertices}}. The job state {{FAILING}} can only be left iff all {{ExecutionJobVertices}} have reached a final state. The notification of this final state is only sent to the {{ExecutionGraph}} when all subtasks of an {{ExecutionJobVertex}} have transitioned to a final state. However, this won't happen because the {{ExeuctionJobVertices}} are already in a final state. The result is that a job can get stuck in the state {{FAILING}} if {{fail}} is called on a {{RESTARTING}} job.
> I propose to add a direct transition from {{RESTARTING}} to {{FAILED}} as it is the case for the {{cancel}} call (transition from {{RESTARTING}} to {{CANCELED}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)