You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@aurora.apache.org by re...@apache.org on 2018/09/11 05:25:50 UTC

svn commit: r1840514 [11/11] - in /aurora/site: data/ publish/ publish/blog/ publish/community/ publish/documentation/0.10.0/ publish/documentation/0.10.0/build-system/ publish/documentation/0.10.0/client-cluster-configuration/ publish/documentation/0....

Modified: aurora/site/source/community.html.md.erb
URL: http://svn.apache.org/viewvc/aurora/site/source/community.html.md.erb?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/community.html.md.erb (original)
+++ aurora/site/source/community.html.md.erb Tue Sep 11 05:25:44 2018
@@ -4,8 +4,10 @@
   <div class="col-md-4">
     <h3>Contributing</h3>
     <h4 name="reportbugs">Report or track a bug</h4>
-    <p>Bugs can be reported on our <a href="http://issues.apache.org/jira/browse/AURORA">JIRA</a>.
-       In order to create a new issue, you'll need register for an account.</p>
+    <p>Bugs can be reported on our <a href="http://issues.apache.org/jira/browse/AURORA">JIRA</a>
+       or raise an issue on our <a href="https://github.com/apache/aurora/issues">GitHub</a> repository.</p>
+    <p>In order to create a new issue on JIRA you'll need register for an accountr. A GitHub account is
+       requried to raise issues on our repository.</p>
 
     <h4 name="contribute">Submit a patch</h4>
     <p>Please read our <a href="/documentation/latest/contributing/">contribution guide</a>
@@ -18,9 +20,6 @@
        <code>mesos.slack.com</code></a>.</p>
     <p>To request an invite for slack please click <a href="https://mesos-slackin.herokuapp.com/">here</a>.</p>
     <p>All slack communication is publicly archived <a href="http://mesos.slackarchive.io/aurora/">here</a>.</p>
-    <h4 name="ircchannel">IRC</h4>
-    <p>There is also a two way mirror between Slack and IRC via the #aurora channel on <code>irc.freenode.net</code>.
-       If you're new to IRC, we suggest trying a <a href="http://webchat.freenode.net/?channels=#aurora">web-based client</a>.</p>
   </div>
   <div class="col-md-4">
     <h3>Mailing lists</h3>

Modified: aurora/site/source/documentation/latest/additional-resources/tools.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/additional-resources/tools.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/additional-resources/tools.md (original)
+++ aurora/site/source/documentation/latest/additional-resources/tools.md Tue Sep 11 05:25:44 2018
@@ -21,4 +21,4 @@ Various tools integrate with Aurora. Is
   - [aurora-packaging](https://github.com/apache/aurora-packaging), the source of the official Aurora packages
 
 * Thrift Clients:
-  - [gorealis](https://github.com/rdelval/gorealis) for communicating with the scheduler using Go
+  - [gorealis](https://github.com/paypal/gorealis) for communicating with the scheduler using Go

Modified: aurora/site/source/documentation/latest/contributing.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/contributing.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/contributing.md (original)
+++ aurora/site/source/documentation/latest/contributing.md Tue Sep 11 05:25:44 2018
@@ -2,7 +2,7 @@
 
 First things first, you'll need the source! The Aurora source is available from Apache git:
 
-    git clone https://git-wip-us.apache.org/repos/asf/aurora
+    git clone https://gitbox.apache.org/repos/asf/aurora
 
 Read the Style Guides
 ---------------------
@@ -36,8 +36,8 @@ Post a review with `rbt`, fill out the f
 
     ./rbt post -o
 
-If you're unsure about who to add as a reviewer, you can default to adding Zameer Manji (zmanji) and
-Joshua Cohen (jcohen). They will take care of finding an appropriate reviewer for the patch.
+If you're unsure about who to add as a reviewer, you can default to adding Stephan Erb (StephanErb) and
+Renan DelValle (rdelvalle). They will take care of finding an appropriate reviewer for the patch.
 
 Once you've done this, you probably want to mark the associated Jira issue as Reviewable.
 

Modified: aurora/site/source/documentation/latest/development/committers-guide.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/development/committers-guide.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/development/committers-guide.md (original)
+++ aurora/site/source/documentation/latest/development/committers-guide.md Tue Sep 11 05:25:44 2018
@@ -29,7 +29,7 @@ and that key will need to be added to ou
 
 2. Add your gpg key to the Apache Aurora KEYS file:
 
-               git clone https://git-wip-us.apache.org/repos/asf/aurora.git
+               git clone https://gitbox.apache.org/repos/asf/aurora
                (gpg --list-sigs <KEY ID> && gpg --armor --export <KEY ID>) >> KEYS
                git add KEYS && git commit -m "Adding gpg key for <APACHE ID>"
                ./rbt post -o -g

Modified: aurora/site/source/documentation/latest/development/db-migration.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/development/db-migration.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/development/db-migration.md (original)
+++ aurora/site/source/documentation/latest/development/db-migration.md Tue Sep 11 05:25:44 2018
@@ -14,7 +14,7 @@ When adding or altering tables or changi
 [schema.sql](../../src/main/resources/org/apache/aurora/scheduler/storage/db/schema.sql), a new
 migration class should be created under the org.apache.aurora.scheduler.storage.db.migration
 package. The class should implement the [MigrationScript](https://github.com/mybatis/migrations/blob/master/src/main/java/org/apache/ibatis/migration/MigrationScript.java)
-interface (see [V001_TestMigration](https://github.com/apache/aurora/blob/rel/0.20.0/src/test/java/org/apache/aurora/scheduler/storage/db/testmigration/V001_TestMigration.java)
+interface (see [V001_TestMigration](https://github.com/apache/aurora/blob/master/src/test/java/org/apache/aurora/scheduler/storage/db/testmigration/V001_TestMigration.java)
 as an example). The upgrade and downgrade scripts are defined in this class. When restoring a
 snapshot the list of migrations on the classpath is compared to the list of applied changes in the
 DB. Any changes that have not yet been applied are executed and their downgrade script is stored

Modified: aurora/site/source/documentation/latest/development/thrift.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/development/thrift.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/development/thrift.md (original)
+++ aurora/site/source/documentation/latest/development/thrift.md Tue Sep 11 05:25:44 2018
@@ -6,7 +6,7 @@ client/server RPC protocol as well as fo
 correctly handling additions and renames of the existing members, field removals must be done
 carefully to ensure backwards compatibility and provide predictable deprecation cycle. This
 document describes general guidelines for making Thrift schema changes to the existing fields in
-[api.thrift](https://github.com/apache/aurora/blob/rel/0.20.0/api/src/main/thrift/org/apache/aurora/gen/api.thrift).
+[api.thrift](https://github.com/apache/aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift).
 
 It is highly recommended to go through the
 [Thrift: The Missing Guide](http://diwakergupta.github.io/thrift-missing-guide/) first to refresh on
@@ -33,7 +33,7 @@ communicate with scheduler/client from v
 * Add a new field as an eventual replacement of the old one and implement a dual read/write
 anywhere the old field is used. If a thrift struct is mapped in the DB store make sure both columns
 are marked as `NOT NULL`
-* Check [storage.thrift](https://github.com/apache/aurora/blob/rel/0.20.0/api/src/main/thrift/org/apache/aurora/gen/storage.thrift) to see if
+* Check [storage.thrift](https://github.com/apache/aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/storage.thrift) to see if
 the affected struct is stored in Aurora scheduler storage. If so, it's almost certainly also
 necessary to perform a [DB migration](../db-migration/).
 * Add a deprecation jira ticket into the vCurrent+1 release candidate

Modified: aurora/site/source/documentation/latest/features/job-updates.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/features/job-updates.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/features/job-updates.md (original)
+++ aurora/site/source/documentation/latest/features/job-updates.md Tue Sep 11 05:25:44 2018
@@ -70,7 +70,7 @@ acknowledging ("heartbeating") job updat
 service updates where explicit job health monitoring is vital during the entire job update
 lifecycle. Such job updates would rely on an external service (or a custom client) periodically
 pulsing an active coordinated job update via a
-[pulseJobUpdate RPC](https://github.com/apache/aurora/blob/rel/0.20.0/api/src/main/thrift/org/apache/aurora/gen/api.thrift).
+[pulseJobUpdate RPC](https://github.com/apache/aurora/blob/master/api/src/main/thrift/org/apache/aurora/gen/api.thrift).
 
 A coordinated update is defined by setting a positive
 [pulse_interval_secs](../../reference/configuration/#updateconfig-objects) value in job configuration
@@ -84,6 +84,19 @@ progress until the first pulse arrives.
 provided the pulse interval has not expired.
 
 
+SLA-Aware Updates
+-----------------
+
+Updates can take advantage of [Custom SLA Requirements](../../features/sla-requirements/) and
+specify the `sla_aware=True` option within
+[UpdateConfig](../../reference/configuration/#updateconfig-objects) to only update instances if
+the action will maintain the task's SLA requirements. This feature allows updates to avoid killing
+too many instances in the face of unexpected failures outside of the update range.
+
+See the [Using the `sla_aware` option](../../reference/configuration/#using-the-sla-aware-option)
+for more information on how to use this feature.
+
+
 Canary Deployments
 ------------------
 

Modified: aurora/site/source/documentation/latest/features/sla-metrics.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/features/sla-metrics.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/features/sla-metrics.md (original)
+++ aurora/site/source/documentation/latest/features/sla-metrics.md Tue Sep 11 05:25:44 2018
@@ -63,7 +63,7 @@ relevant to uptime calculations. By appl
 transition records, we can build a deterministic downtime trace for every given service instance.
 
 A task going through a state transition carries one of three possible SLA meanings
-(see [SlaAlgorithm.java](https://github.com/apache/aurora/blob/rel/0.20.0/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java) for
+(see [SlaAlgorithm.java](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/sla/SlaAlgorithm.java) for
 sla-to-task-state mapping):
 
 * Task is UP: starts a period where the task is considered to be up and running from the Aurora
@@ -110,7 +110,7 @@ metric that helps track the dependency o
 * Per job - `sla_<job_key>_mtta_ms`
 * Per cluster - `sla_cluster_mtta_ms`
 * Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
-[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.20.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+[ResourceBag.java](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
   * By CPU:
     * `sla_cpu_small_mtta_ms`
     * `sla_cpu_medium_mtta_ms`
@@ -147,7 +147,7 @@ for a task.*
 * Per job - `sla_<job_key>_mtts_ms`
 * Per cluster - `sla_cluster_mtts_ms`
 * Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
-[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.20.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+[ResourceBag.java](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
   * By CPU:
     * `sla_cpu_small_mtts_ms`
     * `sla_cpu_medium_mtts_ms`
@@ -182,7 +182,7 @@ reflecting on the overall time it takes
 * Per job - `sla_<job_key>_mttr_ms`
 * Per cluster - `sla_cluster_mttr_ms`
 * Per instance size (small, medium, large, x-large, xx-large). Size are defined in:
-[ResourceBag.java](https://github.com/apache/aurora/blob/rel/0.20.0/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
+[ResourceBag.java](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/resources/ResourceBag.java)
   * By CPU:
     * `sla_cpu_small_mttr_ms`
     * `sla_cpu_medium_mttr_ms`

Modified: aurora/site/source/documentation/latest/index.html.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/index.html.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/index.html.md (original)
+++ aurora/site/source/documentation/latest/index.html.md Tue Sep 11 05:25:44 2018
@@ -28,6 +28,7 @@ Description of important Aurora features
  * [Services](features/services/)
  * [Service Discovery](features/service-discovery/)
  * [SLA Metrics](features/sla-metrics/)
+ * [SLA Requirements](features/sla-requirements/)
  * [Webhooks](features/webhooks/)
 
 ## Operators

Modified: aurora/site/source/documentation/latest/operations/backup-restore.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/operations/backup-restore.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/operations/backup-restore.md (original)
+++ aurora/site/source/documentation/latest/operations/backup-restore.md Tue Sep 11 05:25:44 2018
@@ -18,74 +18,63 @@ so any tasks that have been rescheduled
 Instructions below have been verified in [Vagrant environment](../../getting-started/vagrant/) and with minor
 syntax/path changes should be applicable to any Aurora cluster.
 
-## Preparation
-
 Follow these steps to prepare the cluster for restoring from a backup:
 
-* Stop all scheduler instances
+##  Preparation
 
-* Consider blocking external traffic on a port defined in `-http_port` for all schedulers to
-prevent users from interacting with the scheduler during the restoration process. This will help
-troubleshooting by reducing the scheduler log noise and prevent users from making changes that will
-be erased after the backup snapshot is restored.
-
-* Configure `aurora_admin` access to run all commands listed in
-  [Restore from backup](#restore-from-backup) section locally on the leading scheduler:
-  * Make sure the [clusters.json](../../reference/client-cluster-configuration/) file configured to
-    access scheduler directly. Set `scheduler_uri` setting and remove `zk`. Since leader can get
-    re-elected during the restore steps, consider doing it on all scheduler replicas.
-  * Depending on your particular security approach you will need to either turn off scheduler
-    authorization by removing scheduler `-http_authentication_mechanism` flag or make sure the
-    direct scheduler access is properly authorized. E.g.: in case of Kerberos you will need to make
-    a `/etc/hosts` file change to match your local IP to the scheduler URL configured in keytabs:
-
-        <local_ip> <scheduler_domain_in_keytabs>
-
-* Next steps are required to put scheduler into a partially disabled state where it would still be
-able to accept storage recovery requests but unable to schedule or change task states. This may be
-accomplished by updating the following scheduler configuration options:
-  * Set `-mesos_master_address` to a non-existent zk address. This will prevent scheduler from
-    registering with Mesos. E.g.: `-mesos_master_address=zk://localhost:1111/mesos/master`
-  * `-max_registration_delay` - set to sufficiently long interval to prevent registration timeout
-    and as a result scheduler suicide. E.g: `-max_registration_delay=360mins`
-  * Make sure `-reconciliation_initial_delay` option is set high enough (e.g.: `365days`) to
-    prevent accidental task GC. This is important as scheduler will attempt to reconcile the cluster
-    state and will kill all tasks when restarted with an empty Mesos replicated log.
-
-* Restart all schedulers
-
-## Cleanup and re-initialize Mesos replicated log
-
-Get rid of the corrupted files and re-initialize Mesos replicated log:
-
-* Stop schedulers
-* Delete all files under `-native_log_file_path` on all schedulers
-* Initialize Mesos replica's log file: `sudo mesos-log initialize --path=<-native_log_file_path>`
-* Start schedulers
+* Stop all scheduler instances.
 
-## Restore from backup
+* Pick a backup to use for rehydrating the mesos-replicated log. Backups can be found in the
+directory given to the scheduler as the `-backup_dir` argument. Backups are stored in the format
+`scheduler-backup-<yyyy-MM-dd-HH-mm>`.
 
-At this point the scheduler is ready to rehydrate from the backup:
+* If running the Aurora Scheduler in HA mode, pick a single scheduler instance to rehydrate.
 
-* Identify the leading scheduler by:
-  * examining the `scheduler_lifecycle_LEADER_AWAITING_REGISTRATION` metric at the scheduler
-    `/vars` endpoint. Leader will have 1. All other replicas - 0.
-  * examining scheduler logs
-  * or examining Zookeeper registration under the path defined by `-zk_endpoints`
-    and `-serverset_path`
-
-* Locate the desired backup file, copy it to the leading scheduler's `-backup_dir` folder and stage
-recovery by running the following command on a leader
-`aurora_admin scheduler_stage_recovery --bypass-leader-redirect <cluster> scheduler-backup-<yyyy-MM-dd-HH-mm>`
-
-* At this point, the recovery snapshot is staged and available for manual verification/modification
-via `aurora_admin scheduler_print_recovery_tasks --bypass-leader-redirect` and
-`scheduler_delete_recovery_tasks --bypass-leader-redirect` commands.
-See `aurora_admin help <command>` for usage details.
-
-* Commit recovery. This instructs the scheduler to overwrite the existing Mesos replicated log with
-the provided backup snapshot and initiate a mandatory failover
-`aurora_admin scheduler_commit_recovery --bypass-leader-redirect  <cluster>`
+* Locate the `recovery-tool` in your setup. If Aurora was installed using a Debian package
+generated by our `aurora-packaging` script, the recovery tool can be found
+in `/usr/share/aurora/bin/recovery-tool`.
 
 ## Cleanup
-Undo any modification done during [Preparation](#preparation) sequence.
+
+* Delete (or move) the Mesos replicated log path for each scheduler instance. The location of the
+Mesos replicated log file path can be found by looking at the value given to the flag
+`-native_log_file_path` for each instance.
+
+* Initialize the Mesos replicated log files using the mesos-log tool:
+```
+sudo su -u <USER> mesos-log initialize --path=<native_log_file_path>
+```
+Where `USER` is the user under which the scheduler instance will be run. For installations using
+Debian packages, the default user will be `aurora`. You may alternatively choose to specify
+a group as well by passing the `-g <GROUP>` option to `su`.
+Note that if the user under which the Aurora scheduler instance is run _does not_ have permissions
+to read this directory and the files it contains, the instance will fail to start.
+
+## Restore from backup
+
+* Run the `recovery-tool`. Wherever the flags match those used for the scheduler instance,
+use the same values:
+```
+$ recovery-tool -from BACKUP \
+-to LOG \
+-backup=<selected_backup_location> \
+-native_log_zk_group_path=<native_log_zk_group_path> \
+-native_log_file_path=<native_log_file_path> \
+-zk_endpoints=<zk_endpoints>
+```
+
+## Bring scheduler instances back online
+
+### If running in HA Mode
+
+* Start the rehydrated scheduler instance along with enough cleaned up instances to
+meet the `-native_log_quorum_size`. The mesos-replicated log algorithm will replenish
+the "blank" scheduler instances with the information from the rehydrated instance.
+
+* Start any remaining scheduler instances.
+
+### If running in singleton mode
+
+* Start the single scheduler instance.
+
+

Modified: aurora/site/source/documentation/latest/operations/configuration.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/operations/configuration.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/operations/configuration.md (original)
+++ aurora/site/source/documentation/latest/operations/configuration.md Tue Sep 11 05:25:44 2018
@@ -104,7 +104,7 @@ can furthermore help with storage perfor
 ### `-native_log_zk_group_path`
 ZooKeeper path used for Mesos replicated log quorum discovery.
 
-See [code](https://github.com/apache/aurora/blob/rel/0.20.0/src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLogStreamModule.java) for
+See [code](https://github.com/apache/aurora/blob/master/src/main/java/org/apache/aurora/scheduler/log/mesos/MesosLogStreamModule.java) for
 other available Mesos replicated log configuration options and default values.
 
 ### Changing the Quorum Size
@@ -167,7 +167,7 @@ the latter needs to be enabled via:
 
     -enable_revocable_ram=true
 
-Unless you want to use the [default](https://github.com/apache/aurora/blob/rel/0.20.0/src/main/resources/org/apache/aurora/scheduler/tiers.json)
+Unless you want to use the [default](https://github.com/apache/aurora/blob/master/src/main/resources/org/apache/aurora/scheduler/tiers.json)
 tier configuration, you will also have to specify a file path:
 
     -tier_config=path/to/tiers/config.json
@@ -312,19 +312,19 @@ increased).
 
 To enable this in the Scheduler, you can set the following options:
 
-    --enable_update_affinity=true
-    --update_affinity_reservation_hold_time=3mins
+    -enable_update_affinity=true
+    -update_affinity_reservation_hold_time=3mins
 
 You will need to tune the hold time to match the behavior you see in your cluster. If you have extremely
 high update throughput, you might have to extend it as processing updates could easily add significant
 delays between scheduling attempts. You may also have to tune scheduling parameters to achieve the
 throughput you need in your cluster. Some relevant settings (with defaults) are:
 
-    --max_schedule_attempts_per_sec=40
-    --initial_schedule_penalty=1secs
-    --max_schedule_penalty=1mins
-    --scheduling_max_batch_size=3
-    --max_tasks_per_schedule_attempt=5
+    -max_schedule_attempts_per_sec=40
+    -initial_schedule_penalty=1secs
+    -max_schedule_penalty=1mins
+    -scheduling_max_batch_size=3
+    -max_tasks_per_schedule_attempt=5
 
 There are metrics exposed by the Scheduler which can provide guidance on where the bottleneck is.
 Example metrics to look at:
@@ -337,3 +337,44 @@ Example metrics to look at:
 Most likely you'll run into limits with the number of update instances that can be processed per minute
 before you run into any other limits. So if your total work done per minute starts to exceed 2k instances,
 you may need to extend the update_affinity_reservation_hold_time.
+
+## Cluster Maintenance
+
+Aurora performs maintenance related task drains. One of the scheduler options that can control
+how often the scheduler polls for maintenance work can be controlled via,
+
+    -host_maintenance_polling_interval=1min
+
+## Enforcing SLA limitations
+
+Since tasks can specify their own `SLAPolicy`, the cluster needs to limit these SLA requirements.
+Too aggressive a requirement can permanently block any type of maintenance work
+(ex: OS/Kernel/Security upgrades) on a host and hold it hostage.
+
+An operator can control the limits for SLA requirements via these scheduler configuration options:
+
+    -max_sla_duration_secs=2hrs
+    -min_required_instances_for_sla_check=20
+
+_Note: These limits only apply for `CountSlaPolicy` and `PercentageSlaPolicy`._
+
+### Limiting Coordinator SLA
+
+With `CoordinatorSlaPolicy` the SLA calculation is off-loaded to an external HTTP service. Some
+relevant scheduler configuration options are,
+
+    -sla_coordinator_timeout=1min
+    -max_parallel_coordinated_maintenance=10
+
+Since handing off the SLA calculation to an external service can potentially block maintenance
+on hosts for an indefinite amount of time (either due to a mis-configured coordinator or due to
+a valid degraded service). In those situations the following metrics will be helpful to identify the
+offending tasks.
+
+    sla_coordinator_user_errors_*     (counter tracking number of times the coordinator for the task
+                                       returned a bad response.)
+    sla_coordinator_errors_*          (counter tracking number of times the scheduler was not able
+                                       to communicate with the coordinator of the task.)
+    sla_coordinator_lock_starvation_* (counter tracking number of times the scheduler was not able to
+                                       get the lock for the coordinator of the task.)
+

Modified: aurora/site/source/documentation/latest/reference/configuration.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/reference/configuration.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/reference/configuration.md (original)
+++ aurora/site/source/documentation/latest/reference/configuration.md Tue Sep 11 05:25:44 2018
@@ -23,6 +23,7 @@ configuration design.
     - [Announcer Objects](#announcer-objects)
     - [Container Objects](#container)
     - [LifecycleConfig Objects](#lifecycleconfig-objects)
+    - [SlaPolicy Objects](#slapolicy-objects)
 - [Specifying Scheduling Constraints](#specifying-scheduling-constraints)
 - [Template Namespaces](#template-namespaces)
     - [mesos Namespace](#mesos-namespace)
@@ -343,7 +344,7 @@ Job Schema
   ```contact``` | String | Best email address to reach the owner of the job. For production jobs, this is usually a team mailing list.
   ```instances```| Integer | Number of instances (sometimes referred to as replicas or shards) of the task to create. (Default: 1)
   ```cron_schedule``` | String | Cron schedule in cron format. May only be used with non-service jobs. See [Cron Jobs](../../features/cron-jobs/) for more information. Default: None (not a cron job.)
-  ```cron_collision_policy``` | String | Policy to use when a cron job is triggered while a previous run is still active. KILL_EXISTING Kill the previous run, and schedule the new run CANCEL_NEW Let the previous run continue, and cancel the new run. (Default: KILL_EXISTING)
+  ```cron_collision_policy``` | String | Policy to use when a cron job is triggered while a previous run is still active. KILL\_EXISTING Kill the previous run, and schedule the new run CANCEL\_NEW Let the previous run continue, and cancel the new run. (Default: KILL_EXISTING)
   ```update_config``` | ```UpdateConfig``` object | Parameters for controlling the rate and policy of rolling updates.
   ```constraints``` | dict | Scheduling constraints for the tasks. See the section on the [constraint specification language](#specifying-scheduling-constraints)
   ```service``` | Boolean | If True, restart tasks regardless of success or failure. (Default: False)
@@ -359,6 +360,7 @@ Job Schema
   ```partition_policy``` | ```PartitionPolicy``` object | An optional partition policy that allows job owners to define how to handle partitions for running tasks (in partition-aware Aurora clusters)
   ```metadata``` | list of ```Metadata``` objects | list of ```Metadata``` objects for user's customized metadata information.
   ```executor_config``` | ```ExecutorConfig``` object | Allows choosing an alternative executor defined in `custom_executor_config` to be used instead of Thermos. Tasks will be launched with Thermos as the executor by default. See [Custom Executors](../../features/custom-executors/) for more info.
+  ```sla_policy``` |  Choice of ```CountSlaPolicy```, ```PercentageSlaPolicy``` or ```CoordinatorSlaPolicy``` object | An optional SLA policy that allows job owners to describe the SLA requirements for the job. See [SlaPolicy Objects](#slapolicy-objects) for more information.
 
 
 ### UpdateConfig Objects
@@ -374,6 +376,35 @@ Parameters for controlling the rate and
 | ```rollback_on_failure```    | boolean  | When False, prevents auto rollback of a failed update (Default: True)
 | ```wait_for_batch_completion```| boolean | When True, all threads from a given batch will be blocked from picking up new instances until the entire batch is updated. This essentially simulates the legacy sequential updater algorithm. (Default: False)
 | ```pulse_interval_secs```    | Integer  |  Indicates a [coordinated update](../../features/job-updates/#coordinated-job-updates). If no pulses are received within the provided interval the update will be blocked. Beta-updater only. Will fail on submission when used with client updater. (Default: None)
+| ```sla_aware```              | boolean  | When True, updates will only update an instance if it does not break the task's specified [SLA Requirements](../../features/sla-requirements/). (Default: None)
+
+#### Using the `sla_aware` option
+
+There are some nuances around the `sla_aware` option that users should be aware of:
+
+- SLA-aware updates work in tandem with maintenance. Draining a host that has an instance of the
+job being updated affects the SLA and thus will be taken into account when the update determines
+whether or not it is safe to update another instance.
+- SLA-aware updates will use the [SLAPolicy](../../features/sla-requirements/#custom-sla) of the
+*newest* configuration when determining whether or not it is safe to update an instance. For
+example, if the current configuration specifies a
+[PercentageSlaPolicy](../../features/sla-requirements/#percentageslapolicy-objects) that allows for
+5% of instances to be down and the updated configuration increaes this value to 10%, the SLA
+calculation will be done using the 10% policy. Be mindful of this when doing an update that
+modifies the `SLAPolicy` since it may be possible to put the old configuration in a bad state
+that the new configuration would not be affected by. Additionally, if the update is rolled back,
+then the rollback will use the old `SLAPolicy` (or none if there was not one previously).
+- If using the [CoordinatorSlaPolicy](../../features/sla-requirements/#coordinatorslapolicy-objects),
+it is important to pay attention to the `batch_size` of the update. If you have a complex SLA
+requirement, then you may be limiting the throughput of your updates with an insufficient
+`batch_size`. For example, imagine you have a job with 9 instance that represents three
+replicated caches, and you can only update one instance per replica set: `[0 1 2]
+[3 4 5] [6 7 8]` (the number indicates the instance ID and the brackets represent replica
+sets). If your `batch_size` is 3, then you will slowly update one replica set at a time. If your
+`batch_size` is 9, then you can update all replica sets in parallel and thus speeding up the update.
+- If an instance fails an SLA check for an update, then it will be rechecked starting at a delay
+from `sla_aware_kill_retry_min_delay` and exponentially increasing up to
+`sla_aware_kill_retry_max_delay`. These are cluster-operator set values.
 
 ### HealthCheckConfig Objects
 
@@ -564,7 +595,7 @@ See [Docker Command Line Reference](http
   ```graceful_shutdown_wait_secs``` | Integer | The amount of time (in seconds) to wait after hitting the ```graceful_shutdown_endpoint``` before proceeding with the [task termination lifecycle](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting). (Default: 5)
   ```shutdown_wait_secs```          | Integer | The amount of time (in seconds) to wait after hitting the ```shutdown_endpoint``` before proceeding with the [task termination lifecycle](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting). (Default: 5)
 
-#### graceful_shutdown_endpoint
+#### graceful\_shutdown\_endpoint
 
 If the Job is listening on the port as specified by the HttpLifecycleConfig
 (default: `health`), a HTTP POST request will be sent over localhost to this
@@ -581,6 +612,34 @@ does not shut down on its own after `shu
 forcefully killed.
 
 
+### SlaPolicy Objects
+
+Configuration for specifying custom [SLA requirements](../../features/sla-requirements/) for a job. There are 3 supported SLA policies
+namely, [`CountSlaPolicy`](#countslapolicy-objects), [`PercentageSlaPolicy`](#percentageslapolicy-objects) and [`CoordinatorSlaPolicy`](#coordinatorslapolicy-objects).
+
+
+### CountSlaPolicy Objects
+
+  param                             | type    | description
+  -----                             | :----:  | -----------
+  ```count```                       | Integer | The number of active instances required every `durationSecs`.
+  ```duration_secs```               | Integer | Minimum time duration a task needs to be `RUNNING` to be treated as active.
+
+### PercentageSlaPolicy Objects
+
+  param                             | type    | description
+  -----                             | :----:  | -----------
+  ```percentage```                  | Float   | The percentage of active instances required every `durationSecs`.
+  ```duration_secs```               | Integer | Minimum time duration a task needs to be `RUNNING` to be treated as active.
+
+### CoordinatorSlaPolicy Objects
+
+  param                             | type    | description
+  -----                             | :----:  | -----------
+  ```coordinator_url```             | String  | The URL to the [Coordinator](../../features/sla-requirements/#coordinator) service to be contacted before performing SLA affecting actions (job updates, host drains etc).
+  ```status_key```                  | String  | The field in the Coordinator response that indicates the SLA status for working on the task. (Default: `drain`)
+
+
 Specifying Scheduling Constraints
 =================================
 

Modified: aurora/site/source/documentation/latest/reference/scheduler-configuration.md
URL: http://svn.apache.org/viewvc/aurora/site/source/documentation/latest/reference/scheduler-configuration.md?rev=1840514&r1=1840513&r2=1840514&view=diff
==============================================================================
--- aurora/site/source/documentation/latest/reference/scheduler-configuration.md (original)
+++ aurora/site/source/documentation/latest/reference/scheduler-configuration.md Tue Sep 11 05:25:44 2018
@@ -106,6 +106,8 @@ Optional flags:
 	Minimum guaranteed time for task history retention before any pruning is attempted.
 -history_prune_threshold (default (2, days))
 	Time after which the scheduler will prune terminated task history.
+-host_maintenance_polling_interval (default (1, minute))
+	Interval between polling for pending host maintenance requests.
 -hostname
 	The hostname to advertise in ZooKeeper instead of the locally-resolved hostname.
 -http_authentication_mechanism (default NONE)
@@ -134,6 +136,8 @@ Optional flags:
 	Maximum delay between attempts to schedule a flapping task.
 -max_leading_duration (default (1, days))
 	After leading for this duration, the scheduler should commit suicide.
+-max_parallel_coordinated_maintenance (default 10)
+	Maximum number of coordinators that can be contacted in parallel.
 -max_registration_delay (default (1, mins))
 	Max allowable delay to allow the driver to register before aborting
 -max_reschedule_task_delay_on_startup (default (30, secs))
@@ -144,6 +148,8 @@ Optional flags:
 	Maximum number of scheduling attempts to make per second.
 -max_schedule_penalty (default (1, mins))
 	Maximum delay between attempts to schedule a PENDING tasks.
+-max_sla_duration_secs (default (2, hrs))
+	Maximum duration window for which SLA requirements are to be satisfied. This does not apply to jobs that have a CoordinatorSlaPolicy.
 -max_status_update_batch_size (default 1000) [must be > 0]
 	The maximum number of status updates that can be processed in a batch.
 -max_task_event_batch_size (default 300) [must be > 0]
@@ -156,6 +162,8 @@ Optional flags:
 	Upper limit on the number of failures allowed during a job update. This helps cap potentially unbounded entries into storage.
 -min_offer_hold_time (default (5, mins))
 	Minimum amount of time to hold a resource offer before declining.
+-min_required_instances_for_sla_check (default 20)
+	Minimum number of instances required for a job to be eligible for SLA check. This does not apply to jobs that have a CoordinatorSlaPolicy.
 -native_log_election_retries (default 20)
 	The maximum number of attempts to obtain a new log writer.
 -native_log_election_timeout (default (15, secs))
@@ -214,6 +222,14 @@ Optional flags:
 	Path to shiro.ini for authentication and authorization configuration.
 -shiro_realm_modules (default [class org.apache.aurora.scheduler.http.api.security.IniShiroRealmModule])
 	Guice modules for configuring Shiro Realms.
+-sla_aware_action_max_batch_size (default 300) [must be > 0]
+	The maximum number of sla aware update actions that can be processed in a batch.
+-sla_aware_kill_retry_min_delay (default (1, min)) [must be > 0]
+	The minimum amount of time to wait before retrying an SLA-aware kill (using a truncated binary backoff).
+-sla_aware_kill_retry_max_delay (default (5, min)) [must be > 0]
+	The maximum amount of time to wait before retrying an SLA-aware kill (using a truncated binary backoff).
+-sla_coordinator_timeout (default (1, min)) [must be > 0]
+	Timeout interval for communicating with Coordinator.
 -sla_non_prod_metrics (default [])
 	Metric categories collected for non production tasks.
 -sla_prod_metrics (default [JOB_UPTIMES, PLATFORM_UPTIME, MEDIANS])