You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2018/06/12 14:24:00 UTC
[jira] [Commented] (FLINK-9494) Race condition in Dispatcher with concurrent granting and revoking of leaderhship

    [ https://issues.apache.org/jira/browse/FLINK-9494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16509684#comment-16509684 ] 

ASF GitHub Bot commented on FLINK-9494:
---------------------------------------

GitHub user tillrohrmann opened a pull request:

    https://github.com/apache/flink/pull/6155

    [FLINK-9494] Fix race condition in Dispatcher with granting and revoking leadership

    ## What is the purpose of the change
    
    The race condition was caused by the fact that the job recovery is executed outside
    of the main thread. Only after the recovery finishes, the Dispatcher will set the new
    fencing token and start the recovered jobs. The problem arose if in between these two
    operations the Dispatcher gets its leadership revoked. Then it could happen that the
    Dispatcher tries to run the recovered jobs even though it no longer holds the leadership.
    
    The race condition is solved by checking first whether we still hold the leadership which
    is identified by the given leader session id.
    
    This PR is based on #6154.
    
    cc @StefanRRichter 
    
    ## Brief change log
    
    - After recovering the jobs, check whether the leader session is still valid before assigning the fencing token and the starting the recovered jobs
    
    ## Verifying this change
    
    - Added `DispatcherHATest`
    
    ## Does this pull request potentially affect one of the following parts:
    
      - Dependencies (does it add or upgrade a dependency): (no)
      - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no)
      - The serializers: (no)
      - The runtime per-record code paths (performance sensitive): (no)
      - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes)
      - The S3 file system connector: (no)
    
    ## Documentation
    
      - Does this pull request introduce a new feature? (no)
      - If yes, how is the feature documented? (not applicable)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tillrohrmann/flink fixDispatcherRaceCondition

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/flink/pull/6155.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6155
    
----
commit 54831c6a07dfcd5691fd732148a7c559514362ec
Author: Till Rohrmann <tr...@...>
Date:   2018-06-12T13:22:50Z

    [hotfix] Remove Scala dependency from ZooKeeperLeaderElectionTest

commit e50034088f4d85fe457e7015f162a6f86b1de9e7
Author: Till Rohrmann <tr...@...>
Date:   2018-06-12T12:24:59Z

    [FLINK-9573] Extend LeaderElectionService#hasLeadership to take leader session id
    
    The new LeaderElectionService#hasLeadership also takes the leader session id and verifies whether
    this is the correct leader session id associated with the leadership.

commit 104b46bd7848cd431afc564f6d3bb364a5257cf9
Author: Till Rohrmann <tr...@...>
Date:   2018-06-12T12:40:13Z

    [hotfix] Fix checkstyle violations in SingleLeaderElectionService

commit b5f9ee50dc208b767d56bc74a1705b853b10aa7c
Author: Till Rohrmann <tr...@...>
Date:   2018-06-12T12:26:15Z

    [hotfix] Make TestingDispatcher and TestingJobManagerRunnerFactory standalone testing classes
    
    Refactors the DispatcherTests and moves the TestingDispatcher and the TestingJobManagerRunnerFactory
    to be top level classes. This makes it easier to reuse them.

commit 190de76b21c3710d0fcbe5d66018c5853707cc33
Author: Till Rohrmann <tr...@...>
Date:   2018-06-12T12:27:30Z

    [FLINK-9494] Fix race condition in Dispatcher with granting and revoking leadership
    
    The race condition was caused by the fact that the job recovery is executed outside
    of the main thread. Only after the recovery finishes, the Dispatcher will set the new
    fencing token and start the recovered jobs. The problem arose if in between these two
    operations the Dispatcher gets its leadership revoked. Then it could happen that the
    Dispatcher tries to run the recovered jobs even though it no longer holds the leadership.
    
    The race condition is solved by checking first whether we still hold the leadership which
    is identified by the given leader session id.

----


> Race condition in Dispatcher with concurrent granting and revoking of leaderhship
> ---------------------------------------------------------------------------------
>
>                 Key: FLINK-9494
>                 URL: https://issues.apache.org/jira/browse/FLINK-9494
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.5.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.6.0, 1.5.1
>
>
> The {{Dispatcher}} contains a race condition when an instance is granted leadership and then quickly afterwards gets the leadership revoked. The problem is that we don't check in the recovered jobs future callback that we still have the leadership. This can lead to a corrupted state of the {{Dispatcher}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)