You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@tez.apache.org by "Jason Lowe (JIRA)" <ji...@apache.org> on 2017/12/14 23:23:00 UTC

[jira] [Commented] (TEZ-3770) DAG-aware YARN task scheduler

    [ https://issues.apache.org/jira/browse/TEZ-3770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16291781#comment-16291781 ] 

Jason Lowe commented on TEZ-3770:
---------------------------------

Thanks for the review, Sidd!  Apologies for the long delay, got sidetracked with other work.

bq. If I'm reading the code right. New containers which cannot be assigned immediately are released?

It's a problem with delayed containers.  This comment in the old scheduler sorta sums it up:
{code}
              // this container is of lower priority and given to us by the RM for
              // a task that will be matched after the current top priority. Keep 
              // this container for those pending tasks since the RM is not going
              // to give this container to us again
{code}
The old scheduler holding onto the lower priority containers that cannot be assigned led to TEZ-3535.  This scheduler tries to avoid that issue.

bq. Not sure if priority is being considered while doing this. i.e. is it possible there's a pending higher priority request which has not yet been allocated to an idle container (primarily races in timing)? Think this is handled since an attempt is made to allocate a container the moment the task assigned to it is de-allocated.

Yes, we cover that case by going through the list of pending tasks immediately after releasing a container.

bq. This is broken for newly assigned containers?

It's not broken, IMHO, since again we don't want to hold onto lower priority containers.  This problem happens in practice because a lower priority vertex requests resources just before a higher priority vertex.  Due to the async nature of sending requests to the RM, that means we could make the requests for lower priority containers before sending the request for higher priority ones.  It's a natural race in the DAG.  I chose to treat the race as, "if the RM thinks this is the highest priority request to allocate then that priority won the race."  That's why the code ignores pending request priorities when new containers arrive.  The same is not true when deciding to reuse a container.

bq. TaskRequest oldRequest = requests.put -> Is it possible for old to not be null? A single request to allocate a single attempt.

This hardens the scheduler in case someone re-requests a task.  If we don't account for it and it somehow were to happen in practice then the preemption logic and other metadata tracking the DAG could get out of sync.  Not sure it can happen in practice today, just some defensive programming.

bq. Would be nice to have some more documentation or an example of how this ends up working.

I'll update the patch to add more documentation on how the bitsets are used and how we end up using them to prevent scheduling of lower priority descendants when reusing containers.

bq. Does it rely on the way priorities are assigned?, the kind of topological sort? When reading this, it seems to block off a large chunk of requests at a lower priority.

Yes, this is intended to block off lower priority requests that are descendents of this vertex in the DAG.

bq. Different code paths for the allocation of a delayed container and when a new task request comes in. Assuming this is a result of attempting to not place a YARN request if a container can be assigned immediately? Not sure if more re-use is possible across the various assign methods.

Yes, we try to assign a requested task to a container immediately before requesting the AMRM layer to reduce churn on the AMRM protocol.  I'll look into reuse opportunities when I put up the revised patch.

bq. The default out of box behaviour will always generate different vertices at different priority levels at the moment. The old behaviour was to generate the same priority if distance from root was the same. Is moving back to the old behaviour an option - given descendent information is now known).

No, this is not possible because of the bug/limitation in YARN.  Before YARN-4789 there is no way to correlate a request to an allocation, so all allocations are lumped by priority.  When they are lumped, YARN only supports one resource request per priority.  So we can't go back to the old multiple-vertices-at-the-same-priority behavior in Tez until we can make different resource requests at the same priority in YARN.

bq. Didn't go into enough details to figure out if an attempt is made to run through an entire tree before moving over to an unrelated tree

Not sure what you mean by tree here -- disconnected parts of the DAG?  If so the scheduler doesn't directly concern itself with whether there is just one tree or multiple trees, it's only concerned about what request priorities are "unblocked" (i.e.: do not have ancestors in the DAG at higher priority that are making requests) and trying to satisfy those, preempting active tasks that are descendants of those requesting tasks if necessary.  TEZ-394 is more along the lines of dealing with separate trees in the DAG, but that's more an effort of assigning priorities properly so the scheduler does the right thing with those priorities.

bq. In tryAssignReuseContainer - if a container cannot be assigned immediately, will it be released?

If it exhausts the levels of retention (i.e.: trying to assign for node, then rack, the any) then yes, assuming that container doesn't then get held for idle session stuff.  The idea here is to be a good citizen on the cluster.  We don't want to hold onto resources we have no idea what to do with since there are often other jobs trying to do things.  I suppose we could make this decision be based on headroom, but as you mentioned headroom is notoriously incorrect at times.  This might be best tackled in a followup JIRA.

Per our offline discussion, it might be better to get this in and then consider reworking the scheduler API.

I'll work on improving the documentation and looking for better code reuse in the container reuse assignment paths.


> DAG-aware YARN task scheduler
> -----------------------------
>
>                 Key: TEZ-3770
>                 URL: https://issues.apache.org/jira/browse/TEZ-3770
>             Project: Apache Tez
>          Issue Type: New Feature
>            Reporter: Jason Lowe
>            Assignee: Jason Lowe
>         Attachments: TEZ-3770.001.patch
>
>
> There are cases where priority alone does not convey the relationship between tasks, and this can cause problems when scheduling or preempting tasks.  If the YARN task scheduler was aware of the relationship between tasks then it could make smarter decisions when trying to assign tasks to containers or preempt running tasks to schedule pending tasks.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)