You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/07/01 14:44:00 UTC

[jira] [Commented] (FLINK-5893) Race condition in removing previous JobManagerRegistration in ResourceManager

    [ https://issues.apache.org/jira/browse/FLINK-5893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16071280#comment-16071280 ] 

ASF GitHub Bot commented on FLINK-5893:
---------------------------------------

Github user tillrohrmann commented on the issue:

    https://github.com/apache/flink/pull/3399
  
    Really good fix @zhijiangW. Merging this PR.


> Race condition in removing previous JobManagerRegistration in ResourceManager
> -----------------------------------------------------------------------------
>
>                 Key: FLINK-5893
>                 URL: https://issues.apache.org/jira/browse/FLINK-5893
>             Project: Flink
>          Issue Type: Bug
>          Components: ResourceManager
>            Reporter: zhijiang
>            Assignee: zhijiang
>
> The map of {{JobManagerRegistration}} in ResourceManager is not thread-safe, and currently there may be two threads to operate the map concurrently to bring unexpected results.
> The scenario is like this :
>  - {{registerJobManager}}: When the job leader changes and the new JobManager leader registers to ResourceManager, the new {{JobManagerRegistration}} will replace the old one in the map with the same key {{JobID}}. This process is triggered by rpc thread.
>  - Meanwhile, the {{JobLeaderIdService}} in ResourceManager could be aware of job leader change and trigger the action {{jobLeaderLostLeadership}} in another thread. In this action, it will remove the previous {{JobManagerRegistration}} from the map by {{JobID}}, but the old {{JobManagerRegistration}} may be already replaced by the new one from {{registerJobManager}}.
> In summary, this race condition may cause the new {{JobManagerRegistration}} removed from ResourceManager, resulting in exception when request slot from ResourceManager. It can occur in small probability when running JobManager failure ITCase.
> Consider the solution of this issue, the {{jobLeaderLostLeadership}} can be scheduled by {{runAsync}} in rpc thread and no need to bring extra lock for the map.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)