You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/03/27 10:17:00 UTC

[jira] [Commented] (FLINK-9097) Jobs can be dropped in HA when job submission fails

    [ https://issues.apache.org/jira/browse/FLINK-9097?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16415378#comment-16415378 ] 

ASF GitHub Bot commented on FLINK-9097:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/5774

    [FLINK-9097] Fail fatally if job submission fails when recovering jobs

    ## What is the purpose of the change
    
    In order to not drop jobs, we have to fail fatally if a job submission fails when
    recovering jobs. In HA mode, this will restart the Dispatcher and let it retry
    to recover all jobs.
    
    This PR is based on #5746.
    
    cc @GJL 
    
    ## Brief change log
    
    - Restructured `Dispatcher#submitJob` method
    - Registered callback to listen to job submission result
    - Fail `Dispatcher` if job submission result is a failure if recovering a job
    
    ## Verifying this change
    
    - Added `DispatcherTest#testJobSubmissionErrorAfterJobRecovery`
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
      - The S3 file system connector: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink reintroduceFatalErrorHandler

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5774.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5774
    
----
commit 6c69755c077e78c46a40a0d9f35e435d7ef1618b
Author: Till Rohrmann <tr...@...>
Date:   2018-03-22T09:46:04Z

    [hotfix] Extend TestingFatalErrorHandler to return an error future

commit 080132d8ec938eddb545ff3a80d0039402c48e94
Author: Till Rohrmann <tr...@...>
Date:   2018-03-22T09:46:28Z

    [hotfix] Add BiFunctionWithException

commit ff19155c1bcca8610ca78ae41fa607ede94ddffc
Author: Till Rohrmann <tr...@...>
Date:   2018-03-21T21:36:33Z

    [FLINK-8943] [ha] Fail Dispatcher if jobs cannot be recovered from HA store
    
    In HA mode, the Dispatcher should fail if it cannot recover the persisted jobs. The idea
    is that another Dispatcher will be brought up and tries it again. This is better than
    simply dropping the not recovered jobs.

commit 4656c2adeb93500c02d63adbfb90b8eecabb474b
Author: Till Rohrmann <tr...@...>
Date:   2018-03-27T07:45:13Z

    [hotfix] Re-introduce FatalErrorHandler to JobManagerRunner

commit 93bb2799c08b23398b6927fb599770260fad2c8f
Author: Till Rohrmann <tr...@...>
Date:   2018-03-27T08:00:56Z

    [hotfix] Correct JavaDocs in SubmittedJobGraphStore and add Nullable annotation

commit 2ba75d09e38d23c242e147adf613af91328b219a
Author: Till Rohrmann <tr...@...>
Date:   2018-03-27T08:59:54Z

    [FLINK-9097] Fail fatally if job submission fails when recovering jobs
    
    In order to not drop jobs, we have to fail fatally if a job submission fails when
    recovering jobs. In HA mode, this will restart the Dispatcher and let it retry
    to recover all jobs.

----


> Jobs can be dropped in HA when job submission fails
> ---------------------------------------------------
>
>                 Key: FLINK-9097
>                 URL: https://issues.apache.org/jira/browse/FLINK-9097
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.0
>
>
> Jobs can be dropped in HA mode if the job submission step fails. In such a case, we should fail fatally to let the {{Dispatcher}} restart and retry to recover all jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)