You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "sunjincheng (JIRA)" <ji...@apache.org> on 2019/06/17 08:59:00 UTC

[jira] [Comment Edited] (FLINK-12865) State inconsistency between RM and TM on the slot status

    [ https://issues.apache.org/jira/browse/FLINK-12865?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865447#comment-16865447 ] 

sunjincheng edited comment on FLINK-12865 at 6/17/19 8:58 AM:
--------------------------------------------------------------

Hi [~gaoyunhaii], Thanks for report this issue and help to fix it! :)

I want to know is there any abnormal information? If I understand correctly that it should not happen frequently. right?

The reason I asked this question is that I want the evaluator to be a blocker released in 1.8.1.  If so, we need to fix it as soon as possible and mark it as Critical.


was (Author: sunjincheng121):
Is there any abnormal information? If I understand correctly that it should not happen frequently. right?

The reason I asked this question is that I want the evaluator to be a blocker released in 1.8.1.  If so, we need to fix it as soon as possible and mark it as Critical.

> State inconsistency between RM and TM on the slot status
> --------------------------------------------------------
>
>                 Key: FLINK-12865
>                 URL: https://issues.apache.org/jira/browse/FLINK-12865
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>            Reporter: Yun Gao
>            Assignee: Yun Gao
>            Priority: Major
>
> There may be state inconsistency between TM and RM due to race condition and message loss:
>  # When TM sends heartbeat, it retrieve SlotReport in the main thread, but sends the heartbeat in another thread. There may be cases that the slot on TM is FREE initially and SlotReport read the FREE state, then RM requests slot and mark the slot as allocated, and the SlotReport finally override the allocated status at the RM side wrongly.
>  # When RM requests slot, TM received the requests but the acknowledge message get lot. Then RM will think this slot is free. 
>  Both the problems may cause RM marks an ALLOCATED slot as FREE. This may currently cause additional retries till the state is synchronized after the next heartbeat, and for the inaccurate resource statistics for the fine-grained resource management in the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)