You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@yunikorn.apache.org by "Wilfred Spiegelenburg (Jira)" <ji...@apache.org> on 2021/03/02 04:20:00 UTC

[jira] [Commented] (YUNIKORN-551) node removal races for lock during scheduling

    [ https://issues.apache.org/jira/browse/YUNIKORN-551?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17293334#comment-17293334 ] 

Wilfred Spiegelenburg commented on YUNIKORN-551:
------------------------------------------------

The application placement holds on to the partition lock during the whole placement processing. This is not needed as the application object is not added to the partition until the last action. All manipulation is done on the queues and or the app itself not on the partition.

Moving the write lock to the point that it is really needed should decrease the write lock time. It does require the placement manager to use a read locked version of the getQueue call.

> node removal races for lock during scheduling
> ---------------------------------------------
>
>                 Key: YUNIKORN-551
>                 URL: https://issues.apache.org/jira/browse/YUNIKORN-551
>             Project: Apache YuniKorn
>          Issue Type: Bug
>          Components: core - scheduler
>    Affects Versions: 0.10
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Wilfred Spiegelenburg
>            Priority: Blocker
>
> A more complicated version of the dead lock mentioned in YUNIKORN-481.
> In this case the scheduler is racing with the node removal which in turn removes allocations from the application. The locks taken are al short term locks but it could happen that the application being scheduled also has an allocation on a node being removed.
> Scheduling requires the write locked app to request a read lock on the partition to get all known nodes. The partition write locks while removing the node from its internal list and keeps hold of that write lock while removing the allocations which tries to lock the app.
> The partition should have released the lock immediately after the node was removed from the list as the rest of the updates are not modifying the partition object.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@yunikorn.apache.org
For additional commands, e-mail: issues-help@yunikorn.apache.org