You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (JIRA)" <ji...@apache.org> on 2019/06/16 14:06:00 UTC

[jira] [Comment Edited] (FLINK-12863) Race condition between slot offerings and AllocatedSlotReport

    [ https://issues.apache.org/jira/browse/FLINK-12863?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16865016#comment-16865016 ] 

Till Rohrmann edited comment on FLINK-12863 at 6/16/19 2:05 PM:
----------------------------------------------------------------

This race condition causes the {{YARNSessionCapacitySchedulerITCase.perJobYarnClusterWithParallelism}} to fail. An instance of this test failure can be found here https://api.travis-ci.org/v3/job/546108501/log.txt.


was (Author: till.rohrmann):
This race condition causes the {{YARNSessionCapacitySchedulerITCase. perJobYarnClusterWithParallelism}} to fail. An instance of this test failure can be found here https://api.travis-ci.org/v3/job/546108501/log.txt.

> Race condition between slot offerings and AllocatedSlotReport
> -------------------------------------------------------------
>
>                 Key: FLINK-12863
>                 URL: https://issues.apache.org/jira/browse/FLINK-12863
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.0
>            Reporter: Till Rohrmann
>            Assignee: Till Rohrmann
>            Priority: Critical
>             Fix For: 1.7.3, 1.9.0, 1.8.1
>
>
> With FLINK-11059 we introduced the {{AllocatedSlotReport}} which is used by the {{TaskExecutor}} to synchronize its internal view on slot allocations with the view of the {{JobMaster}}. It seems that there is a race condition between offering slots and receiving the report because the {{AllocatedSlotReport}} is sent by the {{HeartbeatManagerSenderImpl}} from a separate thread. 
> Due to that it can happen that we generate an {{AllocatedSlotReport}} just before getting new slots offered. Since the report is sent from a different thread, it can then happen that the response to the slot offerings is sent earlier than the {{AllocatedSlotReport}}. Consequently, we might receive an outdated slot report on the {{TaskExecutor}} causing active slots to be released.
> In order to solve the problem I propose to add a fencing token to the {{AllocatedSlotReport}} which is being updated whenever we offer new slots to the {{JobMaster}}. When we receive the {{AllocatedSlotReport}} on the {{TaskExecutor}} we compare the current slot report fencing token with the received one and only process the report if they are equal. Otherwise we wait for the next heartbeat to send us an up to date {{AllocatedSlotReport}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)