You are viewing a plain text version of this content. The canonical link for it is here.
Posted to yarn-issues@hadoop.apache.org by "Sandy Ryza (JIRA)" <ji...@apache.org> on 2013/08/01 01:33:50 UTC
[jira] [Commented] (YARN-972) Allow requests and scheduling for fractional virtual cores

    [ https://issues.apache.org/jira/browse/YARN-972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13725837#comment-13725837 ] 

Sandy Ryza commented on YARN-972:
---------------------------------

I should have probably tried to be more clear about what I think the goals of virtual cores are from a more zoomed-out perspective before arguing about their specifics.

The problems we have been considering solving with virtual cores are:
1. "Many of the jobs on my cluster are computational simulations that use many threads per task.  Many of the other jobs on my cluster are distcp's that are primarily I/O bound.  Many of the jobs on my cluster are MapReduce that do something like apply a transformation to text, which are single-threaded, but can saturate a core.  How can we schedule these to maximize utilization and minimize harmful interference?" 

2. "I recently added machines with more or beefier CPUs to my cluster.  I would like to run more concurrent tasks on these machines than on other machines."

3. "I recently added machines with more or beefier CPUs to my cluster.  I would like my jobs to run at predictable speeds."

4. "CPUs vary widely in the world, but I would like to be able to take my job to another cluster and have it run at a similar speed."

I think (1) is the main problem we should be trying to solve.  (2) is also important, and much easier to think about when the new machines have a higher number of cores, but not substantially more powerful cores.  Luckily, the trend is towards more cores per machine, not more powerful cores.  I think we should not be trying to solve (3) and (4). There are too many variables, the real-world utility is too small, and the goals are unrealistic. The features proposed in YARN-796 are better approaches to handling this.

To these ends, here is how think resource configurations should be used:

A task should request virtual cores equal to the number of cores it thinks it can saturate.  A task that runs in a single thread, no matter how CPU-intensive it is, should request a single virtual core.  A task that is inherently I/O-bound, like a distcp or simple grep, should request less than a single virtual core.  A task that can take advantage of multiple threads should request a number of cores equal to the number of threads it intends to take advantage of.

NodeManagers should be configured with virtual cores equal to the number of physical cores on the node.  If the speed of a aingle core varies widely within a cluster (maybe by a factor of two or more), an administrator can consider configuing more virtual cores than physical cores on the faster nodes, with the acknowledgement that task performance will still not be predictable.

Virtual cores should not be used as a proxy for other resources, such as disk I/O or network I/O.  We should ultimately add in disk I/O and possibly network I/O as another first-class resource, but in the mean time a config to limit the number of containers per node seems doesn't seem unreasonable. 

As Arun points out, we can realize this vision equivalently by saying that one physical core is always equal to 1000 virtual cores.  However, to me this seems like an unnecessary layer of indirection for the user, and obscures the fact that virtual cores are meant to model parallelism before processing power.  If our only reason for considering this is perfomance, we should and can handle this internally.  I am not obstinately opposed to going this route, but if we do I think a name like "core thousandths" would be more clear.

Thoughts?
                
> Allow requests and scheduling for fractional virtual cores
> ----------------------------------------------------------
>
>                 Key: YARN-972
>                 URL: https://issues.apache.org/jira/browse/YARN-972
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: api, scheduler
>    Affects Versions: 2.0.5-alpha
>            Reporter: Sandy Ryza
>            Assignee: Sandy Ryza
>
> As this idea sparked a fair amount of discussion on YARN-2, I'd like to go deeper into the reasoning.
> Currently the virtual core abstraction hides two orthogonal goals.  The first is that a cluster might have heterogeneous hardware and that the processing power of different makes of cores can vary wildly.  The second is that a different (combinations of) workloads can require different levels of granularity.  E.g. one admin might want every task on their cluster to use at least a core, while another might want applications to be able to request quarters of cores.  The former would configure a single vcore per core.  The latter would configure four vcores per core.
> I don't think that the abstraction is a good way of handling the second goal.  Having a virtual cores refer to different magnitudes of processing power on different clusters will make the difficult problem of deciding how many cores to request for a job even more confusing.
> Can we not handle this with dynamic oversubscription?
> Dynamic oversubscription, i.e. adjusting the number of cores offered by a machine based on measured CPU-consumption, should work as a complement to fine-granularity scheduling.  Dynamic oversubscription is never going to be perfect, as the amount of CPU a process consumes can vary widely over its lifetime.  A task that first loads a bunch of data over the network and then performs complex computations on it will suffer if additional CPU-heavy tasks are scheduled on the same node because its initial CPU-utilization was low.  To guard against this, we will need to be conservative with how we dynamically oversubscribe.  If a user wants to explicitly hint to the scheduler that their task will not use much CPU, the scheduler should be able to take this into account.
> On YARN-2, there are concerns that including floating point arithmetic in the scheduler will slow it down.  I question this assumption, and it is perhaps worth debating, but I think we can sidestep the issue by multiplying CPU-quantities inside the scheduler by a decently sized number like 1000 and keep doing the computations on integers.
> The relevant APIs are marked as evolving, so there's no need for the change to delay 2.1.0-beta.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira