You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@tez.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2014/03/05 20:14:44 UTC

[jira] [Updated] (TEZ-915) TaskScheduler can get hung when all headroom is used and it cannot utilize existing new containers

     [ https://issues.apache.org/jira/browse/TEZ-915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Bikas Saha updated TEZ-915:
---------------------------

    Attachment: TEZ-915.1.patch

Attaching patch that fixes the preemption loop to fix this case.

It also changes the logic to first try to free unused containers before preempting running tasks.

Added unit test and verified manually with help from [~tassapola]

[~hitesh] review please

> TaskScheduler can get hung when all headroom is used and it cannot utilize existing new containers
> --------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-915
>                 URL: https://issues.apache.org/jira/browse/TEZ-915
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: TEZ-915.1.patch
>
>
> If there are pending unmatched requests and reused containers then those containers are released to create space for new allocations that may match.
> However, if there are pending unmatched requests and new containers then those dont end up being released and the scheduler may hang.
> One scenario where this could happen is when we get a pri4 container for a pri4 request. Before we match that, we also get a pri1 request (lets say for failed re-execution). Now the pri1 tasks is the highest pri and we always scheduled that first. However, it may not match the container. If there is no headroom, the RM will not give us a new pr1 container and we will hang.
> The above case needs to be handled in the preemption logic. When we release the pri4 container we need to make a new request for that resource in order to ensure that the RM will give it back to us again after it has allocated the pri1 container because currently the RM thinks it has satisfied our initial pri4 request.



--
This message was sent by Atlassian JIRA
(v6.2#6252)