You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2017/02/23 09:40:44 UTC
[jira] [Commented] (FLINK-5893) Race condition in removing previous
JobManagerRegistration in ResourceManager
[ https://issues.apache.org/jira/browse/FLINK-5893?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15880183#comment-15880183 ]
ASF GitHub Bot commented on FLINK-5893:
---------------------------------------
GitHub user zhijiangW opened a pull request:
https://github.com/apache/flink/pull/3399
[FLINK-5893][ResourceManager]Fix the bug of race condition for removing previous JobManagerRegistration in ResourceManager
The map of **JobManagerRegistration** in ResourceManager is not thread-safe, and currently there may be two threads to operate the map concurrently to bring unexpected results.
The scenario is like this :
- **registerJobManager**: When the job leader changes and the new JobManager leader registers to ResourceManager, the new **JobManagerRegistration** will replace the old one in the map with the same key **JobID**. This process is triggered by rpc thread.
- Meanwhile, the **JobLeaderIdService** in ResourceManager could be aware of job leader change and trigger the action **jobLeaderLostLeadership** in another thread. In this action, it will remove the previous **JobManagerRegistration** from the map by **JobID**, but the old **JobManagerRegistration** may be already replaced by the new one from **registerJobManager**.
In summary, this race condition may cause the new **JobManagerRegistration** removed from ResourceManager, resulting in exception when request slot from ResourceManager. It can occur in small probability when running JobManager failure ITCase.
Consider the solution of this issue, the **jobLeaderLostLeadership** can be scheduled by **runAsync** in rpc thread.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/zhijiangW/flink FLINK-5893
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/flink/pull/3399.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3399
----
commit 28e783821063509123c33db873e00755ebcbb561
Author: 淘江 <ta...@alibaba-inc.com>
Date: 2017-02-23T09:30:24Z
[FLINK-5893][ResourceManager]Fix the bug of race condition for removing previous JobManagerRegistration in ResourceManager
----
> Race condition in removing previous JobManagerRegistration in ResourceManager
> -----------------------------------------------------------------------------
>
> Key: FLINK-5893
> URL: https://issues.apache.org/jira/browse/FLINK-5893
> Project: Flink
> Issue Type: Bug
> Components: ResourceManager
> Reporter: zhijiang
> Assignee: zhijiang
>
> The map of {{JobManagerRegistration}} in ResourceManager is not thread-safe, and currently there may be two threads to operate the map concurrently to bring unexpected results.
> The scenario is like this :
> - {{registerJobManager}}: When the job leader changes and the new JobManager leader registers to ResourceManager, the new {{JobManagerRegistration}} will replace the old one in the map with the same key {{JobID}}. This process is triggered by rpc thread.
> - Meanwhile, the {{JobLeaderIdService}} in ResourceManager could be aware of job leader change and trigger the action {{jobLeaderLostLeadership}} in another thread. In this action, it will remove the previous {{JobManagerRegistration}} from the map by {{JobID}}, but the old {{JobManagerRegistration}} may be already replaced by the new one from {{registerJobManager}}.
> In summary, this race condition may cause the new {{JobManagerRegistration}} removed from ResourceManager, resulting in exception when request slot from ResourceManager. It can occur in small probability when running JobManager failure ITCase.
> Consider the solution of this issue, the {{jobLeaderLostLeadership}} can be scheduled by {{runAsync}} in rpc thread and no need to bring extra lock for the map.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)