You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/11/17 15:22:11 UTC

[jira] [Commented] (FLINK-3011) Cannot cancel failing/restarting streaming job from the command line

    [ https://issues.apache.org/jira/browse/FLINK-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15008719#comment-15008719 ] 

ASF GitHub Bot commented on FLINK-3011:
---------------------------------------

GitHub user uce opened a pull request:

    https://github.com/apache/flink/pull/1369

    [FLINK-3011, 3019, 3028] Cancel jobs in RESTARTING state

    This addresses issues with cancelling jobs, which are in the `RESTARTING` state. A job enters this state  after a failure as soon as all job vertices are in their final state. It then stays in this state until it is redeployed (e.g. default 100s currently). In this state, the job cannot be cancelled. If the failure is permanent (for example missing slots), the job can never be cancelled.
    
    This PR includes changes to the ExecutionGraph and to the clients:
    
    **ExecutionGraph** (FLINK-3011)
    - Remove the state transition from `FAILED` to `RESTARTING` in `restart()`. This was breaking the semantics of `FAILED` being a terminal state. It was only relevant for a test as far as I can tell.
    - When cancelling during restarts, two job states are relevant:
      - `RESTARTING`: try to set the state directly to `CANCELED` as all vertices have been already failed when the job enters the `RESTARTING` state. If the state transition to `CANCELED` succeeds, the restart will be ignored with a log message.
      - `FAILING`: try to set the state to `CANCELLING` and wait for the failing of the vertices to finish. This will finish the cancellation as usual in `jobVertexInFinalState()`. 
    
    When reviewing the `cancel()`, `jobVertexInFinalState()`, and `restart()` methods are relevant.
    
    **CLIFrontend** (FLINK-3019)
    - List restarting jobs with scheduled jobs
    
    ```
    $ bin/flink list
    No running jobs.
    ---------------- Scheduled/Restarting Jobs -------------------
    17.11.2015 15:14:01 : 4b3fa06c88e5a2a4963241e7afca7b7d : Streaming WordCount (RESTARTING)
    --------------------------------------------------------------
    ```
    
    **WebFrontend** (FLINK-3028)
    - Show the cancel button if the job is restarting. It was only displayed for running or created jobs before.
    
    ---
    
    I want to merge this for 0.10.1 and 1.0.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uce/flink 3011-restart

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1369.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1369
    
----
commit 0c5a3306808bec5b9a833703adbcd9f45bbe6de5
Author: Ufuk Celebi <uc...@apache.org>
Date:   2015-11-16T15:18:20Z

    [FLINK-3011] [runtime] Disallow ExecutionGraph state transition from FAILED to RESTARTING
    
    Removes the possibility to go from FAILED state back to RESTARTING. This was only used in a test
    case. It was a breaking the terminal state semantics of the FAILED state.

commit 19c602b2ce7686237d8611645a4662aa2b2a0cef
Author: Ufuk Celebi <uc...@apache.org>
Date:   2015-11-17T10:40:54Z

    [FLINK-3011] [runtime, tests] Translate ExecutionGraphRestartTest to Java

commit e13dd1bac7029af6ae4157af226131a10f5d02d0
Author: Ufuk Celebi <uc...@apache.org>
Date:   2015-11-17T10:56:42Z

    [FLINK-3011] [runtime] Fix cancel during restart

commit 657e34f31fe9c6325900f42c36257b5c5d2019be
Author: Ufuk Celebi <uc...@apache.org>
Date:   2015-11-17T13:11:44Z

    [FLINK-3019] [client] List restarting jobs with scheduled jobs

commit 8b2850610aff1197d204bdb7d790df8fb6b5df4c
Author: Ufuk Celebi <uc...@apache.org>
Date:   2015-11-17T13:51:15Z

    [FLINK-3028] [runtime-web] Show cancel button for restarting jobs

----


> Cannot cancel failing/restarting streaming job from the command line
> --------------------------------------------------------------------
>
>                 Key: FLINK-3011
>                 URL: https://issues.apache.org/jira/browse/FLINK-3011
>             Project: Flink
>          Issue Type: Bug
>          Components: Command-line client
>    Affects Versions: 0.10.0, 1.0.0
>            Reporter: Gyula Fora
>            Assignee: Ufuk Celebi
>            Priority: Critical
>
> I cannot seem to be able to cancel a failing/restarting job from the command line client. The job cannot be rescheduled so it keeps failing:
> The exception I get:
> 13:58:11,240 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Status of job 0c895d22c632de5dfe16c42a9ba818d5 (player-id) changed to RESTARTING.
> 13:58:25,234 INFO  org.apache.flink.runtime.jobmanager.JobManager                - Trying to cancel job with ID 0c895d22c632de5dfe16c42a9ba818d5.
> 13:58:25,561 WARN  akka.remote.ReliableDeliverySupervisor                        - Association with remote system [akka.tcp://flink@127.0.0.1:42012] has failed, address is now gated for [5000] ms. Reason is: [Disassociated].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)