You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Bikas Saha (JIRA)" <ji...@apache.org> on 2013/08/14 00:42:48 UTC

[jira] [Commented] (TEZ-344) Support delayed scheduling for re-used containers

    [ https://issues.apache.org/jira/browse/TEZ-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13738956#comment-13738956 ] 

Bikas Saha commented on TEZ-344:
--------------------------------

Looks good overall.

ContainerManager.start() and stop() could move to start() and stop() of TaskScheduler? Probably before asyncClient.start() and asyncClient.stop().

DelayedContainerManager.stop() should probably release all pending containers.

My understanding was that the delay was mainly to prevent premature non-node-local assignments. Unless I am reading the patch wrong, it looks like node/rack/* assignments are tried and then after that if the container is unassigned then its put into the delayed container list. What am I missing? I was expecting some interaction with the rackLocal/non-local flags.

What happens to the delayedcontainer that gets polled but then fails to get assigned? Doesnt look like its being added back the queue. 

Will the following suggestion help simplify things a bit?
When container is deallocated directly add it to the delayedContainers queue. Then call delayedContainers.allocaterAll(). Call delayedContainers.allocateAll() when new task shows up (like in the patch) and also on every heartbeat from the RM (the getProgress() callback from asyncClient) . In every call to delayedContainers.allocateAll() it cycles through all containers and tries to allocate if within grace period or else releases them (can be done in the last assignAllocatedContainers() call by passing true flag for release). This assumes that RM heartbeat is smaller than grace period which should be the case or else we are not giving RM a chance to give us new containers that are a better match than the containers we want to reuse.

We have 2 booleans releaseUnassigned and queueUnassigned. Is it possible for an unassigned container to get neither released nor queued and thus get lost? The suggestion in the above comment probably removes the need for queueUnassigned since queuing will be done outside of assignAllocatedContainers().
                
> Support delayed scheduling for re-used containers
> -------------------------------------------------
>
>                 Key: TEZ-344
>                 URL: https://issues.apache.org/jira/browse/TEZ-344
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Siddharth Seth
>            Assignee: Siddharth Seth
>              Labels: TEZ-0.2.0
>         Attachments: TEZ-344.wip.txt
>
>
> This, for now, is primarily to help with testing of Tez on clusters.
> Would have to go in with a warning since this could cause jobs to hang / run for a long time.
> Longer term, this can be enhanced to set limits on how long to wait before assigning non-local tasks.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira