You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/01/25 14:20:00 UTC

[jira] [Commented] (FLINK-8488) Dispatcher does not recover Jobs

    [ https://issues.apache.org/jira/browse/FLINK-8488?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16339268#comment-16339268 ] 

ASF GitHub Bot commented on FLINK-8488:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/5363

    [FLINK-8488] [flip6] Fix Dispatcher job recovery bug

    ## What is the purpose of the change
    
    Instead of only accepting job submissions if the RunningJobRegistry signals
    that the job's JobSchedulingStatus is PENDING, the Dispatcher now also accepts
    if the job's JobSchedulingStatus is RUNNING. Only if the job is marked as
    DONE, it will be rejected.
    
    ## Verifying this change
    
    - Added `DispatcherTest#testJobRecovery`
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
      - The S3 file system connector: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink fixFlip6JobRecovery

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/5363.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #5363
    
----
commit f4b56840a52091984655cd20a8939e6cace92ddf
Author: Till Rohrmann <tr...@...>
Date:   2018-01-25T14:15:57Z

    [FLINK-8488] [flip6] Fix Dispatcher job recovery bug
    
    Instead of only accepting job submissions if the RunningJobRegistry signals
    that the job's JobSchedulingStatus is PENDING, the Dispatcher now also accepts
    if the job's JobSchedulingStatus is RUNNING. Only if the job is marked as
    DONE, it will be rejected.

----


> Dispatcher does not recover Jobs
> --------------------------------
>
>                 Key: FLINK-8488
>                 URL: https://issues.apache.org/jira/browse/FLINK-8488
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>         Environment: 776af4a882c85926fc0764b702fec717c675e34c
>            Reporter: Gary Yao
>            Assignee: Till Rohrmann
>            Priority: Blocker
>              Labels: flip-6
>             Fix For: 1.5.0
>
>
> Dispatcher does not recover jobs on failover (FLIP-6 mode).
> *Steps to reproduce*:
>  # {{bin/start-cluster.sh flip6}}
>  # {{bin/flink run -p1 -flip6 examples/batch/WordCount.jar --input /path/to/largefile.txt}}
>  # Wait until job is running, then run {{bin/jobmanager.sh stop flip6 && bin/jobmanager.sh start flip6}} to restart the master.
>  # Wait until leader is elected and verify that no jobs are running.
> *Analysis*
>  * Dispatcher checks on {{submitJob}} whether the job scheduling status is {{PENDING}} and only then allows resubmission of the job. However, the job is marked as {{RUNNING}} in ZooKeeper.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)