You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2020/01/24 10:40:00 UTC

[jira] [Resolved] (FLINK-14701) Slot leaks if SharedSlotOversubscribedException happens

     [ https://issues.apache.org/jira/browse/FLINK-14701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Till Rohrmann resolved FLINK-14701.
-----------------------------------
    Fix Version/s:     (was: 1.11.0)
                   1.10.0
       Resolution: Fixed

Fixed via

master: bd5901f3351d1f70a8b0edcfee8015f91b099f3a
1.10.0: 27d4cc4b66bf7485d5d26660f9640a1e8862a9c2
1.9.2: 0bdd21a36136538b554d7e4bd7ce80bc3c461596

> Slot leaks if SharedSlotOversubscribedException happens
> -------------------------------------------------------
>
>                 Key: FLINK-14701
>                 URL: https://issues.apache.org/jira/browse/FLINK-14701
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.9.2
>            Reporter: Zhu Zhu
>            Assignee: Zhu Zhu
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.9.2, 1.10.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> If a {{SharedSlotOversubscribedException}} happens, the {{MultiTaskSlot}} will release some of its child {{SingleTaskSlot}}. The triggered releasing will trigger a re-allocation of the task slot right inside {{SingleTaskSlot#release(...)}}. So that a previous allocation in {{SloSharingManager#allTaskSlots}} will be replaced by the new allocation because they share the same {{slotRequestId}}.
> However, the {{SingleTaskSlot#release(...)}} will then invoke {{MultiTaskSlot#releaseChild}} to release the previous allocation with the {{slotRequestId}}, which will unexpectedly remove the new allocation from the {{SloSharingManager}}.
> In this way, slot leak happens because the pending slot request is not tracked by the {{SloSharingManager}} and cannot be released when its payload terminates.
> A test case {{testNoSlotLeakOnSharedSlotOversubscribedException}} which exhibits this issue can be found in this [commit|https://github.com/zhuzhurk/flink/commit/9024e2e9eb4bd17f371896d6dbc745bc9e585e14].
> The slot leak blocks the TPC-DS queries on flink 1.10, see FLINK-14674.
> To solve it, I'd propose to strengthen the {{MultiTaskSlot#releaseChild}} to only remove its true child task slot from the {{SloSharingManager}}, i.e. add a check {{if (child == allTaskSlots.get(child.getSlotRequestId()))}} before invoking {{allTaskSlots.remove(child.getSlotRequestId())}}.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)