You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@flink.apache.org by "Till Rohrmann (Jira)" <ji...@apache.org> on 2020/12/15 10:10:00 UTC

[jira] [Closed] (FLINK-19832) Improve handling of immediately failed physical slot in SlotSharingExecutionSlotAllocator

     [ https://issues.apache.org/jira/browse/FLINK-19832?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Till Rohrmann closed FLINK-19832.
---------------------------------
    Resolution: Fixed

Fixed via

1.13.0:
58ed205390078929c0691196cf692029c837ea9c
f097b9387e99876f7b2a02a049a1b2783554390d

1.12.1:
b4155ecd2185b82cff713d4382f5245d661ec353
3e8448c65a50bc94676e6a916d9a6b18b12b7210

> Improve handling of immediately failed physical slot in SlotSharingExecutionSlotAllocator
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-19832
>                 URL: https://issues.apache.org/jira/browse/FLINK-19832
>             Project: Flink
>          Issue Type: Improvement
>          Components: Runtime / Coordination
>    Affects Versions: 1.12.0
>            Reporter: Andrey Zagrebin
>            Assignee: Andrey Zagrebin
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.13.0, 1.12.1
>
>
> Improve handling of immediately failed physical slot in SlotSharingExecutionSlotAllocator
> If a physical slot future the immediately fails for a new SharedSlot in SlotSharingExecutionSlotAllocator#getOrAllocateSharedSlot but we continue to add logical slots to this SharedSlot, eventually, the logical slot also fails and gets removed from {{the SharedSlot}} which gets released (state RELEASED). The subsequent logical slot addings in the loop of {{allocateLogicalSlotsFromSharedSlots}} will fail the scheduling
> with the ALLOCATED state check because it will be RELEASED.
> The subsequent bulk timeout check will also not find the SharedSlot and fail with NPE.
> Hence, such SharedSlot with the immediately failed physical slot future should not be kept in the SlotSharingExecutionSlotAllocator and the logical slot requests depending on it can be immediately returned failed. The bulk timeout check does not need to be started because if some physical (and its logical) slot requests failed then the whole bulk will be canceled by scheduler.
> If the last assumption is not true for the future scheduling, this bulk failure might need additional explicit pending requests cancelation. We expect to refactor it for the declarative scheduling anyways.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)