You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Hitesh Shah (JIRA)" <ji...@apache.org> on 2014/03/05 20:46:48 UTC

[jira] [Commented] (TEZ-915) TaskScheduler can get hung when all headroom is used and it cannot utilize existing new containers

    [ https://issues.apache.org/jira/browse/TEZ-915?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13921299#comment-13921299 ] 

Hitesh Shah commented on TEZ-915:
---------------------------------

+1. Looks good. I think we should move to task based scheduling soon to avoid these kind of problems. 

> TaskScheduler can get hung when all headroom is used and it cannot utilize existing new containers
> --------------------------------------------------------------------------------------------------
>
>                 Key: TEZ-915
>                 URL: https://issues.apache.org/jira/browse/TEZ-915
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Bikas Saha
>            Assignee: Bikas Saha
>         Attachments: TEZ-915.1.patch
>
>
> If there are pending unmatched requests and reused containers then those containers are released to create space for new allocations that may match.
> However, if there are pending unmatched requests and new containers then those dont end up being released and the scheduler may hang.
> One scenario where this could happen is when we get a pri4 container for a pri4 request. Before we match that, we also get a pri1 request (lets say for failed re-execution). Now the pri1 tasks is the highest pri and we always scheduled that first. However, it may not match the container. If there is no headroom, the RM will not give us a new pr1 container and we will hang.
> The above case needs to be handled in the preemption logic. When we release the pri4 container we need to make a new request for that resource in order to ensure that the RM will give it back to us again after it has allocated the pri1 container because currently the RM thinks it has satisfied our initial pri4 request.



--
This message was sent by Atlassian JIRA
(v6.2#6252)