You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by uce <gi...@git.apache.org> on 2015/11/17 15:21:31 UTC

[GitHub] flink pull request: [FLINK-3011, 3019, 3028] Cancel jobs in RESTAR...

GitHub user uce opened a pull request:

    https://github.com/apache/flink/pull/1369

    [FLINK-3011, 3019, 3028] Cancel jobs in RESTARTING state

    This addresses issues with cancelling jobs, which are in the `RESTARTING` state. A job enters this state  after a failure as soon as all job vertices are in their final state. It then stays in this state until it is redeployed (e.g. default 100s currently). In this state, the job cannot be cancelled. If the failure is permanent (for example missing slots), the job can never be cancelled.
    
    This PR includes changes to the ExecutionGraph and to the clients:
    
    **ExecutionGraph** (FLINK-3011)
    - Remove the state transition from `FAILED` to `RESTARTING` in `restart()`. This was breaking the semantics of `FAILED` being a terminal state. It was only relevant for a test as far as I can tell.
    - When cancelling during restarts, two job states are relevant:
      - `RESTARTING`: try to set the state directly to `CANCELED` as all vertices have been already failed when the job enters the `RESTARTING` state. If the state transition to `CANCELED` succeeds, the restart will be ignored with a log message.
      - `FAILING`: try to set the state to `CANCELLING` and wait for the failing of the vertices to finish. This will finish the cancellation as usual in `jobVertexInFinalState()`. 
    
    When reviewing the `cancel()`, `jobVertexInFinalState()`, and `restart()` methods are relevant.
    
    **CLIFrontend** (FLINK-3019)
    - List restarting jobs with scheduled jobs
    
    ```
    $ bin/flink list
    No running jobs.
    ---------------- Scheduled/Restarting Jobs -------------------
    17.11.2015 15:14:01 : 4b3fa06c88e5a2a4963241e7afca7b7d : Streaming WordCount (RESTARTING)
    --------------------------------------------------------------
    ```
    
    **WebFrontend** (FLINK-3028)
    - Show the cancel button if the job is restarting. It was only displayed for running or created jobs before.
    
    ---
    
    I want to merge this for 0.10.1 and 1.0.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/uce/flink 3011-restart

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/1369.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #1369
    
----
commit 0c5a3306808bec5b9a833703adbcd9f45bbe6de5
Author: Ufuk Celebi <uc...@apache.org>
Date:   2015-11-16T15:18:20Z

    [FLINK-3011] [runtime] Disallow ExecutionGraph state transition from FAILED to RESTARTING
    
    Removes the possibility to go from FAILED state back to RESTARTING. This was only used in a test
    case. It was a breaking the terminal state semantics of the FAILED state.

commit 19c602b2ce7686237d8611645a4662aa2b2a0cef
Author: Ufuk Celebi <uc...@apache.org>
Date:   2015-11-17T10:40:54Z

    [FLINK-3011] [runtime, tests] Translate ExecutionGraphRestartTest to Java

commit e13dd1bac7029af6ae4157af226131a10f5d02d0
Author: Ufuk Celebi <uc...@apache.org>
Date:   2015-11-17T10:56:42Z

    [FLINK-3011] [runtime] Fix cancel during restart

commit 657e34f31fe9c6325900f42c36257b5c5d2019be
Author: Ufuk Celebi <uc...@apache.org>
Date:   2015-11-17T13:11:44Z

    [FLINK-3019] [client] List restarting jobs with scheduled jobs

commit 8b2850610aff1197d204bdb7d790df8fb6b5df4c
Author: Ufuk Celebi <uc...@apache.org>
Date:   2015-11-17T13:51:15Z

    [FLINK-3028] [runtime-web] Show cancel button for restarting jobs

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3011, 3019, 3028] Cancel jobs in RESTAR...

Posted by uce <gi...@git.apache.org>.
Github user uce commented on the pull request:

    https://github.com/apache/flink/pull/1369#issuecomment-158126406
  
    Failed test is unrelated. Merging...


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3011, 3019, 3028] Cancel jobs in RESTAR...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/flink/pull/1369


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3011, 3019, 3028] Cancel jobs in RESTAR...

Posted by uce <gi...@git.apache.org>.
Github user uce commented on the pull request:

    https://github.com/apache/flink/pull/1369#issuecomment-158047558
  
    I've updated the PR as you suggested and rebased on the current master. Waiting for Travis. After that I, think it is ready to be merged to `release-0.10` and `master`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3011, 3019, 3028] Cancel jobs in RESTAR...

Posted by gyfora <gi...@git.apache.org>.
Github user gyfora commented on the pull request:

    https://github.com/apache/flink/pull/1369#issuecomment-157383381
  
    I verified the intended behaviour on a cluster application, it works for cancelling from both the command line and also from the web interface.
    
    +1 from my side, this is a critical fix for production environments


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3011, 3019, 3028] Cancel jobs in RESTAR...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1369#issuecomment-157721702
  
    Pretty nice otherwise, +1


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

[GitHub] flink pull request: [FLINK-3011, 3019, 3028] Cancel jobs in RESTAR...

Posted by StephanEwen <gi...@git.apache.org>.
Github user StephanEwen commented on the pull request:

    https://github.com/apache/flink/pull/1369#issuecomment-157719028
  
    Looks good. I am wondering, though, whether RESTARTING jobs should be rather listed among RUNNING jobs?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---