You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@aurora.apache.org by se...@apache.org on 2016/09/03 22:03:44 UTC

aurora git commit: Extend the resource isolation and oversubscription documentation

Repository: aurora
Updated Branches:
  refs/heads/master e32f4fbd1 -> 18533141c


Extend the resource isolation and oversubscription documentation

I had to answer a couple of questions regarding these over the recent weeks and thought it might make sense to update the docs accordingly.

Reviewed at https://reviews.apache.org/r/51602/


Project: http://git-wip-us.apache.org/repos/asf/aurora/repo
Commit: http://git-wip-us.apache.org/repos/asf/aurora/commit/18533141
Tree: http://git-wip-us.apache.org/repos/asf/aurora/tree/18533141
Diff: http://git-wip-us.apache.org/repos/asf/aurora/diff/18533141

Branch: refs/heads/master
Commit: 18533141cdf7d4a1e0ce016073274f169548f354
Parents: e32f4fb
Author: Stephan Erb <se...@apache.org>
Authored: Sun Sep 4 00:02:44 2016 +0200
Committer: Stephan Erb <se...@apache.org>
Committed: Sun Sep 4 00:02:44 2016 +0200

----------------------------------------------------------------------
 docs/features/multitenancy.md       |  1 +
 docs/features/resource-isolation.md | 54 +++++++++++++++++---------------
 docs/operations/configuration.md    | 45 +++++++++++++++++++++-----
 3 files changed, 67 insertions(+), 33 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/aurora/blob/18533141/docs/features/multitenancy.md
----------------------------------------------------------------------
diff --git a/docs/features/multitenancy.md b/docs/features/multitenancy.md
index cb45beb..301170d 100644
--- a/docs/features/multitenancy.md
+++ b/docs/features/multitenancy.md
@@ -40,6 +40,7 @@ Configuration Tiers
 Tier is a predefined bundle of task configuration options. Aurora schedules tasks and assigns them
 resources based on their tier assignment. The default scheduler tier configuration allows for
 3 tiers:
+
  - `revocable`: The `revocable` tier requires the task to run with [revocable](resource-isolation.md#oversubscription)
  resources.
  - `preemptible`: Setting the task\u2019s tier to `preemptible` allows for the possibility of that task

http://git-wip-us.apache.org/repos/asf/aurora/blob/18533141/docs/features/resource-isolation.md
----------------------------------------------------------------------
diff --git a/docs/features/resource-isolation.md b/docs/features/resource-isolation.md
index 59da823..01c5b40 100644
--- a/docs/features/resource-isolation.md
+++ b/docs/features/resource-isolation.md
@@ -1,6 +1,9 @@
 Resources Isolation and Sizing
 ==============================
 
+This document assumes Aurora and Mesos have been configured
+using our [recommended resource isolation settings](../operations/configuration.md#resource-isolation).
+
 - [Isolation](#isolation)
 - [Sizing](#sizing)
 - [Oversubscription](#oversubscription)
@@ -11,11 +14,13 @@ Isolation
 
 Aurora is a multi-tenant system; a single software instance runs on a
 server, serving multiple clients/tenants. To share resources among
-tenants, it implements isolation of:
+tenants, it leverages Mesos for isolation of:
 
 * CPU
+* GPU
 * memory
 * disk space
+* ports
 
 CPU is a soft limit, and handled differently from memory and disk space.
 Too low a CPU value results in throttling your application and
@@ -24,10 +29,10 @@ application goes over these values, it's killed.
 
 ### CPU Isolation
 
-Mesos uses a quota based CPU scheduler (the *Completely Fair Scheduler*)
-to provide consistent and predictable performance.  This is effectively
-a guarantee of resources -- you receive at least what you requested, but
-also no more than you've requested.
+Mesos can be configured to use a quota based CPU scheduler (the *Completely*
+*Fair Scheduler*) to provide consistent and predictable performance.
+This is effectively a guarantee of resources -- you receive at least what
+you requested, but also no more than you've requested.
 
 The scheduler gives applications a CPU quota for every 100 ms interval.
 When an application uses its quota for an interval, it is throttled for
@@ -103,11 +108,11 @@ will be killed shortly after. This is subject to change.
 
 ### GPU Isolation
 
-GPU isolation will be supported for Nvidia devices starting from Mesos 0.29.0.
+GPU isolation will be supported for Nvidia devices starting from Mesos 1.0.
 Access to the allocated units will be exclusive with no sharing between tasks
-allowed (e.g. no fractional GPU allocation). Until official documentation is released,
-see [Mesos design document](https://docs.google.com/document/d/10GJ1A80x4nIEo8kfdeo9B11PIbS1xJrrB4Z373Ifkpo/edit#heading=h.w84lz7p4eexl)
-for more details.
+allowed (e.g. no fractional GPU allocation). For more details, see the
+[Mesos design document](https://docs.google.com/document/d/10GJ1A80x4nIEo8kfdeo9B11PIbS1xJrrB4Z373Ifkpo/edit#heading=h.w84lz7p4eexl)
+and the [Mesos agent configuration](http://mesos.apache.org/documentation/latest/configuration/).
 
 ### Other Resources
 
@@ -154,26 +159,23 @@ into the application's sandbox space.
 GPU is highly dependent on your application requirements and is only limited
 by the number of physical GPU units available on a target box.
 
+
 Oversubscription
 ----------------
 
-**WARNING**: This feature is currently in alpha status. Do not use it in production clusters!
-
-Mesos [supports a concept of revocable tasks](http://mesos.apache.org/documentation/latest/oversubscription/)
-by oversubscribing machine resources by the amount deemed safe to not affect the existing
-non-revocable tasks. Aurora now supports revocable jobs via a `tier` setting set to `revocable`
-value.
-
-The Aurora scheduler must be configured to receive revocable offers from Mesos and accept revocable
-jobs. If not configured properly revocable tasks will never get assigned to hosts and will stay in
-`PENDING`. Set these scheduler flag to allow receiving revocable Mesos offers:
-
-    -receive_revocable_resources=true
-
-Specify a tier configuration file path (unless you want to use the [default](../../src/main/resources/org/apache/aurora/scheduler/tiers.json)):
+Mesos supports [oversubscription of machine resources](http://mesos.apache.org/documentation/latest/oversubscription/)
+via the concept of revocable tasks. In contrast to non-revocable tasks, revocable tasks are best-effort.
+Mesos reserves the right to throttle or even kill them if they might affect existing high-priority
+user-facing services.
 
-    -tier_config=path/to/tiers/config.json
+As of today, the only revocable resource supported by Aurora are CPU resources. A job can opt-in to
+use those by specifying the `revocable` [Configuration Tier](../features/multitenancy.md#configuration-tiers).
+A revocable job will only be scheduled using revocable CPU resources, even if there are plenty of
+non-revocable resources available.
 
+The Aurora scheduler must be [configured to receive revocable offers](../operations/configuration.md#resource-isolation)
+from Mesos and accept revocable jobs. If not configured properly revocable tasks will never get
+assigned to hosts and will stay in `PENDING`.
 
-See the [Configuration Reference](../reference/configuration.md) for details on how to mark a job
-as being revocable.
+For details on how to mark a job as being revocable, see the
+[Configuration Reference](../reference/configuration.md).

http://git-wip-us.apache.org/repos/asf/aurora/blob/18533141/docs/operations/configuration.md
----------------------------------------------------------------------
diff --git a/docs/operations/configuration.md b/docs/operations/configuration.md
index 350ea77..85787b0 100644
--- a/docs/operations/configuration.md
+++ b/docs/operations/configuration.md
@@ -90,17 +90,48 @@ or truncating of the replicated log used by Aurora. In that case, see the docume
 
 Configuration options for the Aurora scheduler backup manager.
 
-### `-backup_interval`
-The interval on which the scheduler writes local storage backups.  The default is every hour.
+*  `-backup_interval`: The interval on which the scheduler writes local storage backups.  The default is every hour.
+*  `-backup_dir`: Directory to write backups to.
+* `-max_saved_backups`: Maximum number of backups to retain before deleting the oldest backup(s).
 
-### `-backup_dir`
-Directory to write backups to.
 
-### `-max_saved_backups`
-Maximum number of backups to retain before deleting the oldest backup(s).
+## Resource Isolation
 
+For proper CPU, memory, and disk isolation as mentioned in our [enduser documentation](../features/resource-isolation.md),
+we recommend to add the following isolators to the `--isolation` flag of the Mesos agent:
 
-## Process Logs
+* `cgroups/cpu`
+* `cgroups/mem`
+* `disk/du`
+
+In addition, we recommend to set the following [agent flags](http://mesos.apache.org/documentation/latest/configuration/):
+
+* `--cgroups_limit_swap` to enable memory limits on both memory and swap instead of just memory.
+  Alternatively, you could disable swap on your agent hosts.
+* `--cgroups_enable_cfs` to enable hard limits on CPU resources via the CFS bandwidth limiting
+  feature.
+* `--enforce_container_disk_quota` to enable disk quota enforcement for containers.
+
+To enable the optional GPU support in Mesos, please see the GPU related flags in the
+[Mesos configuration](http://mesos.apache.org/documentation/latest/configuration/).
+To enable the corresponding feature in Aurora, you have to start the scheduler with the
+flag
+
+    -allow_gpu_resource=true
+
+If you want to use revocable resources, first follow the
+[Mesos oversubscription documentation](http://mesos.apache.org/documentation/latest/oversubscription/)
+and then set set this Aurora scheduler flag to allow receiving revocable Mesos offers:
+
+    -receive_revocable_resources=true
+
+Unless you want to use the [default](../../src/main/resources/org/apache/aurora/scheduler/tiers.json)
+tier configuration, you will also have to specify a file path:
+
+    -tier_config=path/to/tiers/config.json
+
+
+## Thermos Process Logs
 
 ### Log destination
 By default, Thermos will write process stdout/stderr to log files in the sandbox. Process object