You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Siddharth Seth (JIRA)" <ji...@apache.org> on 2016/11/10 01:21:58 UTC

[jira] [Commented] (TEZ-3491) Tez job can hang due to container priority inversion

    [ https://issues.apache.org/jira/browse/TEZ-3491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15652613#comment-15652613 ] 

Siddharth Seth commented on TEZ-3491:
-------------------------------------

[~jlowe] - the patch looks good to me. +1. Like you pointed out, it's not ideal to wait for timeouts and then make container requests again. 
What scenario led to this situation. YARN ended up handing out a lower priority ask before a higher priority ask?, or did the AM decide to release containers that it had obtained at a higher priority, or multiple DAGs in the same AM?

You had mentioned an alternate approach / potential improvement to solve the same problem in an offline discussion. Have a little more context after looking at the code again. Would be useful if you could add some notes about that on the jira.

For sessions, where containers can be used across multiple DAGs - I think the logic of avoiding lower priority container for a higher priority task because of the risk of 'polluting' the container goes for a toss. The held containers could be at any priority level. Maybe we should have a mode where the container priority check is not enforced. This would work for most submissions. It'll be problematic if the same file, same jar or conflicting jars are specified as LRs for different Vertices / DAGs.

> Tez job can hang due to container priority inversion
> ----------------------------------------------------
>
>                 Key: TEZ-3491
>                 URL: https://issues.apache.org/jira/browse/TEZ-3491
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.7.1
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>            Priority: Critical
>         Attachments: TEZ-3491.001.patch
>
>
> If the Tez AM receives containers at a lower priority than the highest priority task being requested then it fails to assign the container to any task.  In addition if the container is new then it refuses to release it if there are any pending tasks.  If it takes too long for the higher priority requests to be fulfilled (e.g.: the lower priority containers are filling the queue) then eventually YARN will expire the unused lower priority containers since they were never launched.  The Tez AM then never re-requests these lower priority containers and the job hangs because the AM is waiting for containers from the RM that the RM already sent and expired.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)