You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@aurora.apache.org by se...@apache.org on 2016/09/03 22:03:44 UTC
aurora git commit: Extend the resource isolation and oversubscription
documentation
Repository: aurora
Updated Branches:
refs/heads/master e32f4fbd1 -> 18533141c
Extend the resource isolation and oversubscription documentation
I had to answer a couple of questions regarding these over the recent weeks and thought it might make sense to update the docs accordingly.
Reviewed at https://reviews.apache.org/r/51602/
Project: http://git-wip-us.apache.org/repos/asf/aurora/repo
Commit: http://git-wip-us.apache.org/repos/asf/aurora/commit/18533141
Tree: http://git-wip-us.apache.org/repos/asf/aurora/tree/18533141
Diff: http://git-wip-us.apache.org/repos/asf/aurora/diff/18533141
Branch: refs/heads/master
Commit: 18533141cdf7d4a1e0ce016073274f169548f354
Parents: e32f4fb
Author: Stephan Erb <se...@apache.org>
Authored: Sun Sep 4 00:02:44 2016 +0200
Committer: Stephan Erb <se...@apache.org>
Committed: Sun Sep 4 00:02:44 2016 +0200
----------------------------------------------------------------------
docs/features/multitenancy.md | 1 +
docs/features/resource-isolation.md | 54 +++++++++++++++++---------------
docs/operations/configuration.md | 45 +++++++++++++++++++++-----
3 files changed, 67 insertions(+), 33 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/aurora/blob/18533141/docs/features/multitenancy.md
----------------------------------------------------------------------
diff --git a/docs/features/multitenancy.md b/docs/features/multitenancy.md
index cb45beb..301170d 100644
--- a/docs/features/multitenancy.md
+++ b/docs/features/multitenancy.md
@@ -40,6 +40,7 @@ Configuration Tiers
Tier is a predefined bundle of task configuration options. Aurora schedules tasks and assigns them
resources based on their tier assignment. The default scheduler tier configuration allows for
3 tiers:
+
- `revocable`: The `revocable` tier requires the task to run with [revocable](resource-isolation.md#oversubscription)
resources.
- `preemptible`: Setting the task\u2019s tier to `preemptible` allows for the possibility of that task
http://git-wip-us.apache.org/repos/asf/aurora/blob/18533141/docs/features/resource-isolation.md
----------------------------------------------------------------------
diff --git a/docs/features/resource-isolation.md b/docs/features/resource-isolation.md
index 59da823..01c5b40 100644
--- a/docs/features/resource-isolation.md
+++ b/docs/features/resource-isolation.md
@@ -1,6 +1,9 @@
Resources Isolation and Sizing
==============================
+This document assumes Aurora and Mesos have been configured
+using our [recommended resource isolation settings](../operations/configuration.md#resource-isolation).
+
- [Isolation](#isolation)
- [Sizing](#sizing)
- [Oversubscription](#oversubscription)
@@ -11,11 +14,13 @@ Isolation
Aurora is a multi-tenant system; a single software instance runs on a
server, serving multiple clients/tenants. To share resources among
-tenants, it implements isolation of:
+tenants, it leverages Mesos for isolation of:
* CPU
+* GPU
* memory
* disk space
+* ports
CPU is a soft limit, and handled differently from memory and disk space.
Too low a CPU value results in throttling your application and
@@ -24,10 +29,10 @@ application goes over these values, it's killed.
### CPU Isolation
-Mesos uses a quota based CPU scheduler (the *Completely Fair Scheduler*)
-to provide consistent and predictable performance. This is effectively
-a guarantee of resources -- you receive at least what you requested, but
-also no more than you've requested.
+Mesos can be configured to use a quota based CPU scheduler (the *Completely*
+*Fair Scheduler*) to provide consistent and predictable performance.
+This is effectively a guarantee of resources -- you receive at least what
+you requested, but also no more than you've requested.
The scheduler gives applications a CPU quota for every 100 ms interval.
When an application uses its quota for an interval, it is throttled for
@@ -103,11 +108,11 @@ will be killed shortly after. This is subject to change.
### GPU Isolation
-GPU isolation will be supported for Nvidia devices starting from Mesos 0.29.0.
+GPU isolation will be supported for Nvidia devices starting from Mesos 1.0.
Access to the allocated units will be exclusive with no sharing between tasks
-allowed (e.g. no fractional GPU allocation). Until official documentation is released,
-see [Mesos design document](https://docs.google.com/document/d/10GJ1A80x4nIEo8kfdeo9B11PIbS1xJrrB4Z373Ifkpo/edit#heading=h.w84lz7p4eexl)
-for more details.
+allowed (e.g. no fractional GPU allocation). For more details, see the
+[Mesos design document](https://docs.google.com/document/d/10GJ1A80x4nIEo8kfdeo9B11PIbS1xJrrB4Z373Ifkpo/edit#heading=h.w84lz7p4eexl)
+and the [Mesos agent configuration](http://mesos.apache.org/documentation/latest/configuration/).
### Other Resources
@@ -154,26 +159,23 @@ into the application's sandbox space.
GPU is highly dependent on your application requirements and is only limited
by the number of physical GPU units available on a target box.
+
Oversubscription
----------------
-**WARNING**: This feature is currently in alpha status. Do not use it in production clusters!
-
-Mesos [supports a concept of revocable tasks](http://mesos.apache.org/documentation/latest/oversubscription/)
-by oversubscribing machine resources by the amount deemed safe to not affect the existing
-non-revocable tasks. Aurora now supports revocable jobs via a `tier` setting set to `revocable`
-value.
-
-The Aurora scheduler must be configured to receive revocable offers from Mesos and accept revocable
-jobs. If not configured properly revocable tasks will never get assigned to hosts and will stay in
-`PENDING`. Set these scheduler flag to allow receiving revocable Mesos offers:
-
- -receive_revocable_resources=true
-
-Specify a tier configuration file path (unless you want to use the [default](../../src/main/resources/org/apache/aurora/scheduler/tiers.json)):
+Mesos supports [oversubscription of machine resources](http://mesos.apache.org/documentation/latest/oversubscription/)
+via the concept of revocable tasks. In contrast to non-revocable tasks, revocable tasks are best-effort.
+Mesos reserves the right to throttle or even kill them if they might affect existing high-priority
+user-facing services.
- -tier_config=path/to/tiers/config.json
+As of today, the only revocable resource supported by Aurora are CPU resources. A job can opt-in to
+use those by specifying the `revocable` [Configuration Tier](../features/multitenancy.md#configuration-tiers).
+A revocable job will only be scheduled using revocable CPU resources, even if there are plenty of
+non-revocable resources available.
+The Aurora scheduler must be [configured to receive revocable offers](../operations/configuration.md#resource-isolation)
+from Mesos and accept revocable jobs. If not configured properly revocable tasks will never get
+assigned to hosts and will stay in `PENDING`.
-See the [Configuration Reference](../reference/configuration.md) for details on how to mark a job
-as being revocable.
+For details on how to mark a job as being revocable, see the
+[Configuration Reference](../reference/configuration.md).
http://git-wip-us.apache.org/repos/asf/aurora/blob/18533141/docs/operations/configuration.md
----------------------------------------------------------------------
diff --git a/docs/operations/configuration.md b/docs/operations/configuration.md
index 350ea77..85787b0 100644
--- a/docs/operations/configuration.md
+++ b/docs/operations/configuration.md
@@ -90,17 +90,48 @@ or truncating of the replicated log used by Aurora. In that case, see the docume
Configuration options for the Aurora scheduler backup manager.
-### `-backup_interval`
-The interval on which the scheduler writes local storage backups. The default is every hour.
+* `-backup_interval`: The interval on which the scheduler writes local storage backups. The default is every hour.
+* `-backup_dir`: Directory to write backups to.
+* `-max_saved_backups`: Maximum number of backups to retain before deleting the oldest backup(s).
-### `-backup_dir`
-Directory to write backups to.
-### `-max_saved_backups`
-Maximum number of backups to retain before deleting the oldest backup(s).
+## Resource Isolation
+For proper CPU, memory, and disk isolation as mentioned in our [enduser documentation](../features/resource-isolation.md),
+we recommend to add the following isolators to the `--isolation` flag of the Mesos agent:
-## Process Logs
+* `cgroups/cpu`
+* `cgroups/mem`
+* `disk/du`
+
+In addition, we recommend to set the following [agent flags](http://mesos.apache.org/documentation/latest/configuration/):
+
+* `--cgroups_limit_swap` to enable memory limits on both memory and swap instead of just memory.
+ Alternatively, you could disable swap on your agent hosts.
+* `--cgroups_enable_cfs` to enable hard limits on CPU resources via the CFS bandwidth limiting
+ feature.
+* `--enforce_container_disk_quota` to enable disk quota enforcement for containers.
+
+To enable the optional GPU support in Mesos, please see the GPU related flags in the
+[Mesos configuration](http://mesos.apache.org/documentation/latest/configuration/).
+To enable the corresponding feature in Aurora, you have to start the scheduler with the
+flag
+
+ -allow_gpu_resource=true
+
+If you want to use revocable resources, first follow the
+[Mesos oversubscription documentation](http://mesos.apache.org/documentation/latest/oversubscription/)
+and then set set this Aurora scheduler flag to allow receiving revocable Mesos offers:
+
+ -receive_revocable_resources=true
+
+Unless you want to use the [default](../../src/main/resources/org/apache/aurora/scheduler/tiers.json)
+tier configuration, you will also have to specify a file path:
+
+ -tier_config=path/to/tiers/config.json
+
+
+## Thermos Process Logs
### Log destination
By default, Thermos will write process stdout/stderr to log files in the sandbox. Process object