You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@aurora.apache.org by sa...@apache.org on 2017/06/13 18:01:20 UTC

aurora git commit: Adding Configurable Wait Period for Graceful Shutdowns

Repository: aurora
Updated Branches:
  refs/heads/master 40d9d4dbe -> cb86e8358


Adding Configurable Wait Period for Graceful Shutdowns

We have some services that require more than the current 10 seconds given to
gracefully shutdown (they need to close resources, finish requests, etc).

We would like to be able to configure the amount of time we wait between each
stage of the graceful shutdown sequence. See this [proposal](https://docs.google.com/document/d/1Sl-KWNyt1j0nIndinqfJsH3pkUY5IYXfGWyLHU2wacs/edit?usp=sharing) for a more in-depth
analysis.

Testing Done:
Ran unit and integration tests.

Created and killed jobs with varying wait_escalation_secs values on the Vagrant devcluster.

Bugs closed: AURORA-1931

Reviewed at https://reviews.apache.org/r/59733/


Project: http://git-wip-us.apache.org/repos/asf/aurora/repo
Commit: http://git-wip-us.apache.org/repos/asf/aurora/commit/cb86e835
Tree: http://git-wip-us.apache.org/repos/asf/aurora/tree/cb86e835
Diff: http://git-wip-us.apache.org/repos/asf/aurora/diff/cb86e835

Branch: refs/heads/master
Commit: cb86e835818202f2665730ef5036976a3328075e
Parents: 40d9d4d
Author: Jordan Ly <jo...@gmail.com>
Authored: Tue Jun 13 11:00:45 2017 -0700
Committer: Santhosh Kumar <ss...@twitter.com>
Committed: Tue Jun 13 11:00:45 2017 -0700

----------------------------------------------------------------------
 RELEASE-NOTES.md                                | 18 ++++-
 docs/reference/configuration.md                 | 21 +++--
 docs/reference/task-lifecycle.md                |  6 +-
 .../python/apache/aurora/config/schema/base.py  |  6 ++
 .../apache/aurora/executor/aurora_executor.py   |  9 ++-
 .../executor/bin/thermos_executor_main.py       | 15 +++-
 .../apache/aurora/executor/http_lifecycle.py    | 28 +++++--
 .../apache/aurora/client/cli/test_inspect.py    |  4 +-
 .../bin/test_thermos_executor_entry_point.py    |  1 +
 .../aurora/executor/test_http_lifecycle.py      | 11 ++-
 .../aurora/executor/test_thermos_executor.py    | 41 +++++++---
 .../aurora/executor/test_thermos_task_runner.py | 83 +++++++++++++++++---
 12 files changed, 195 insertions(+), 48 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/RELEASE-NOTES.md
----------------------------------------------------------------------
diff --git a/RELEASE-NOTES.md b/RELEASE-NOTES.md
index d14c4ad..87283de 100644
--- a/RELEASE-NOTES.md
+++ b/RELEASE-NOTES.md
@@ -1,8 +1,24 @@
-0.18.0
+0.19.0 (unreleased)
 ===================
 
 ### New/updated:
 
+- Added the ability to configure the executor's stop timeout, which is the maximum amount of time
+  the executor will wait during a graceful shutdown sequence before continuing the 'Forceful
+  Termination' process (see
+  [here](http://aurora.apache.org/documentation/latest/reference/task-lifecycle/) for details).
+- Added the ability to configure the wait period after calling the graceful shutdown endpoint and
+  the shutdown endpoint using the `graceful_shutdown_wait_secs` and `shutdown_wait_secs` fields in
+  `HttpLifecycleConfig` respectively. Previously, the executor would only wait 5 seconds between
+  steps (adding up to a total of 10 seconds as there are 2 steps). The overall waiting period is
+  bounded by the executor's stop timeout, which can be configured using the executor's
+  `stop_timeout_in_secs` flag.
+
+0.18.0
+======
+
+### New/updated:
+
 - Update to Mesos 1.2.0. Please upgrade Aurora to 0.18 before upgrading Mesos to 1.2.0 if you rely
   on Mesos filesystem images.
 - Add message parameter to `killTasks` RPC.

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/docs/reference/configuration.md
----------------------------------------------------------------------
diff --git a/docs/reference/configuration.md b/docs/reference/configuration.md
index 0040de1..6a9a3ff 100644
--- a/docs/reference/configuration.md
+++ b/docs/reference/configuration.md
@@ -528,26 +528,31 @@ See [Docker Command Line Reference](https://docs.docker.com/reference/commandlin
 
 ### HttpLifecycleConfig Objects
 
-  param          | type            | description
-  -----          | :----:          | -----------
-  ```port```     | String          | The named port to send POST commands (Default: health)
-  ```graceful_shutdown_endpoint``` | String | Endpoint to hit to indicate that a task should gracefully shutdown. (Default: /quitquitquit)
-  ```shutdown_endpoint``` | String | Endpoint to hit to give a task its final warning before being killed. (Default: /abortabortabort)
+*Note: The combined `graceful_shutdown_wait_secs` and `shutdown_wait_secs` is implicitly upper bounded by the `--stop_timeout_in_secs` flag exposed by the executor (see options [here](https://github.com/apache/aurora/blob/master/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py), default is 2 minutes). Therefore, if the user specifies values that add up to more than `--stop_timeout_in_secs`, the task will be killed earlier than the user anticipates (see the termination lifecycle [here](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting)). Furthermore, `stop_timeout_in_secs` itself is implicitly upper bounded by two scheduler options: `transient_task_state_timeout` and `preemption_slot_hold_time` (see reference [here](http://aurora.apache.org/documentation/latest/reference/scheduler-configuration/). If the `stop_timeout_in_secs` exceeds either of these scheduler options, tasks could be designated as LOST 
 or tasks utilizing preemption could lose their desired slot respectively. Cluster operators should be aware of these timings should they change the defaults.*
+
+  param                             | type    | description
+  -----                             | :----:  | -----------
+  ```port```                        | String  | The named port to send POST commands. (Default: health)
+  ```graceful_shutdown_endpoint```  | String  | Endpoint to hit to indicate that a task should gracefully shutdown. (Default: /quitquitquit)
+  ```shutdown_endpoint```           | String  | Endpoint to hit to give a task its final warning before being killed. (Default: /abortabortabort)
+  ```graceful_shutdown_wait_secs``` | Integer | The amount of time (in seconds) to wait after hitting the ```graceful_shutdown_endpoint``` before proceeding with the [task termination lifecycle](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting). (Default: 5)
+  ```shutdown_wait_secs```          | Integer | The amount of time (in seconds) to wait after hitting the ```shutdown_endpoint``` before proceeding with the [task termination lifecycle](https://aurora.apache.org/documentation/latest/reference/task-lifecycle/#forceful-termination-killing-restarting). (Default: 5)
 
 #### graceful_shutdown_endpoint
 
 If the Job is listening on the port as specified by the HttpLifecycleConfig
 (default: `health`), a HTTP POST request will be sent over localhost to this
 endpoint to request that the task gracefully shut itself down.  This is a
-courtesy call before the `shutdown_endpoint` is invoked a fixed amount of
-time later.
+courtesy call before the `shutdown_endpoint` is invoked
+`graceful_shutdown_wait_secs` seconds later.
 
 #### shutdown_endpoint
 
 If the Job is listening on the port as specified by the HttpLifecycleConfig
 (default: `health`), a HTTP POST request will be sent over localhost to this
 endpoint to request as a final warning before being shut down.  If the task
-does not shut down on its own after this, it will be forcefully killed
+does not shut down on its own after `shutdown_wait_secs` seconds, it will be
+forcefully killed.
 
 
 Specifying Scheduling Constraints

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/docs/reference/task-lifecycle.md
----------------------------------------------------------------------
diff --git a/docs/reference/task-lifecycle.md b/docs/reference/task-lifecycle.md
index cf1b679..8ec0077 100644
--- a/docs/reference/task-lifecycle.md
+++ b/docs/reference/task-lifecycle.md
@@ -81,8 +81,10 @@ In any case, the responsible executor on the agent follows an escalation
 sequence when killing a running task:
 
   1. If a `HttpLifecycleConfig` is not present, skip to (4).
-  2. Send a POST to the `graceful_shutdown_endpoint` and wait 5 seconds.
-  3. Send a POST to the `shutdown_endpoint` and wait 5 seconds.
+  2. Send a POST to the `graceful_shutdown_endpoint` and wait
+  `graceful_shutdown_wait_secs` seconds.
+  3. Send a POST to the `shutdown_endpoint` and wait
+  `shutdown_wait_secs` seconds.
   4. Send SIGTERM (`kill`) and wait at most `finalization_wait` seconds.
   5. Send SIGKILL (`kill -9`).
 

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/main/python/apache/aurora/config/schema/base.py
----------------------------------------------------------------------
diff --git a/src/main/python/apache/aurora/config/schema/base.py b/src/main/python/apache/aurora/config/schema/base.py
index b2692a6..18ce826 100644
--- a/src/main/python/apache/aurora/config/schema/base.py
+++ b/src/main/python/apache/aurora/config/schema/base.py
@@ -74,6 +74,12 @@ class HttpLifecycleConfig(Struct):
   # Endpoint to hit to give a task it's final warning before being killed.
   shutdown_endpoint = Default(String, '/abortabortabort')
 
+  # How much time to wait in seconds after calling the graceful shutdown endpoint
+  graceful_shutdown_wait_secs = Default(Integer, 5)
+
+  # How much time to wait in seconds after calling the shutdown endpoint
+  shutdown_wait_secs = Default(Integer, 5)
+
 
 class LifecycleConfig(Struct):
   http = HttpLifecycleConfig

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/main/python/apache/aurora/executor/aurora_executor.py
----------------------------------------------------------------------
diff --git a/src/main/python/apache/aurora/executor/aurora_executor.py b/src/main/python/apache/aurora/executor/aurora_executor.py
index 81461cb..aeb3a4d 100644
--- a/src/main/python/apache/aurora/executor/aurora_executor.py
+++ b/src/main/python/apache/aurora/executor/aurora_executor.py
@@ -39,7 +39,6 @@ class AuroraExecutor(ExecutorBase, Observable):
   PERSISTENCE_WAIT = Amount(5, Time.SECONDS)
   SANDBOX_INITIALIZATION_TIMEOUT = Amount(10, Time.MINUTES)
   START_TIMEOUT = Amount(2, Time.MINUTES)
-  STOP_TIMEOUT = Amount(2, Time.MINUTES)
   STOP_WAIT = Amount(5, Time.SECONDS)
 
   def __init__(
@@ -50,7 +49,8 @@ class AuroraExecutor(ExecutorBase, Observable):
       status_providers=(),
       clock=time,
       no_sandbox_create_user=False,
-      sandbox_mount_point=None):
+      sandbox_mount_point=None,
+      stop_timeout_in_secs=120):
 
     ExecutorBase.__init__(self)
     if not isinstance(runner_provider, TaskRunnerProvider):
@@ -67,6 +67,7 @@ class AuroraExecutor(ExecutorBase, Observable):
     self._sandbox_provider = sandbox_provider
     self._no_sandbox_create_user = no_sandbox_create_user
     self._sandbox_mount_point = sandbox_mount_point
+    self._stop_timeout = Amount(stop_timeout_in_secs, Time.SECONDS)
     self._kill_manager = KillManager()
     # Events that are exposed for interested entities
     self.runner_aborted = threading.Event()
@@ -206,7 +207,7 @@ class AuroraExecutor(ExecutorBase, Observable):
     runner_status = self._runner.status
 
     try:
-      propagate_deadline(self._chained_checker.stop, timeout=self.STOP_TIMEOUT)
+      propagate_deadline(self._chained_checker.stop, timeout=self._stop_timeout)
     except Timeout:
       log.error('Failed to stop all checkers within deadline.')
     except Exception:
@@ -214,7 +215,7 @@ class AuroraExecutor(ExecutorBase, Observable):
       log.error(traceback.format_exc())
 
     try:
-      propagate_deadline(self._runner.stop, timeout=self.STOP_TIMEOUT)
+      propagate_deadline(self._runner.stop, timeout=self._stop_timeout)
     except Timeout:
       log.error('Failed to stop runner within deadline.')
     except Exception:

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py
----------------------------------------------------------------------
diff --git a/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py b/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py
index c6c0898..a191cf9 100644
--- a/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py
+++ b/src/main/python/apache/aurora/executor/bin/thermos_executor_main.py
@@ -188,6 +188,15 @@ app.add_option(
      action='store_true',
      help="Preserve thermos runners' environment variables for the task being run.")
 
+app.add_option(
+    '--stop_timeout_in_secs',
+    dest='stop_timeout_in_secs',
+    type=int,
+    default=120,
+    help='The maximum amount of time to wait (in seconds) when gracefully killing a task before '
+         'beginning forceful termination. Graceful and forceful termination is defined in '
+         'HttpLifecycleConfig (see Task Lifecycle documentation for more info on termination).')
+
 
 # TODO(wickman) Consider just having the OSS version require pip installed
 # thermos_runner binaries on every machine and instead of embedding the pex
@@ -254,7 +263,8 @@ def initialize(options):
       status_providers=status_providers,
       sandbox_provider=UserOverrideDirectorySandboxProvider(options.execute_as_user),
       no_sandbox_create_user=options.no_create_user,
-      sandbox_mount_point=options.sandbox_mount_point
+      sandbox_mount_point=options.sandbox_mount_point,
+      stop_timeout_in_secs=options.stop_timeout_in_secs
     )
   else:
     thermos_runner_provider = DefaultThermosTaskRunnerProvider(
@@ -273,7 +283,8 @@ def initialize(options):
       runner_provider=thermos_runner_provider,
       status_providers=status_providers,
       no_sandbox_create_user=options.no_create_user,
-      sandbox_mount_point=options.sandbox_mount_point
+      sandbox_mount_point=options.sandbox_mount_point,
+      stop_timeout_in_secs=options.stop_timeout_in_secs
     )
 
   return thermos_executor

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/main/python/apache/aurora/executor/http_lifecycle.py
----------------------------------------------------------------------
diff --git a/src/main/python/apache/aurora/executor/http_lifecycle.py b/src/main/python/apache/aurora/executor/http_lifecycle.py
index 9280bf2..01e4d9a 100644
--- a/src/main/python/apache/aurora/executor/http_lifecycle.py
+++ b/src/main/python/apache/aurora/executor/http_lifecycle.py
@@ -25,7 +25,8 @@ from .common.task_runner import TaskError, TaskRunner
 class HttpLifecycleManager(TaskRunner):
   """A wrapper around a TaskRunner that performs HTTP lifecycle management."""
 
-  ESCALATION_WAIT = Amount(5, Time.SECONDS)
+  DEFAULT_ESCALATION_WAIT = Amount(5, Time.SECONDS)
+  WAIT_POLL_INTERVAL = Amount(1, Time.SECONDS)
 
   @classmethod
   def wrap(cls, runner, task_instance, portmap):
@@ -36,6 +37,14 @@ class HttpLifecycleManager(TaskRunner):
 
     http_lifecycle = task_instance.lifecycle().http()
     http_lifecycle_port = http_lifecycle.port().get()
+    graceful_shutdown_wait_secs = (
+        Amount(http_lifecycle.graceful_shutdown_wait_secs().get(), Time.SECONDS)
+        if http_lifecycle.has_graceful_shutdown_wait_secs()
+        else cls.DEFAULT_ESCALATION_WAIT)
+    shutdown_wait_secs = (
+        Amount(http_lifecycle.shutdown_wait_secs().get(), Time.SECONDS)
+        if http_lifecycle.has_shutdown_wait_secs()
+        else cls.DEFAULT_ESCALATION_WAIT)
 
     if not portmap or http_lifecycle_port not in portmap:
       # If DefaultLifecycle is ever to disable task lifecycle by default, we should
@@ -44,8 +53,8 @@ class HttpLifecycleManager(TaskRunner):
       return runner
 
     escalation_endpoints = [
-        http_lifecycle.graceful_shutdown_endpoint().get(),
-        http_lifecycle.shutdown_endpoint().get()
+        (http_lifecycle.graceful_shutdown_endpoint().get(), graceful_shutdown_wait_secs),
+        (http_lifecycle.shutdown_endpoint().get(), shutdown_wait_secs)
     ]
     return cls(runner, portmap[http_lifecycle_port], escalation_endpoints)
 
@@ -63,13 +72,20 @@ class HttpLifecycleManager(TaskRunner):
   def _terminate_http(self):
     http_signaler = HttpSignaler(self._lifecycle_port)
 
-    for endpoint in self._escalation_endpoints:
+    for endpoint, wait_time in self._escalation_endpoints:
       handled, _ = http_signaler(endpoint, use_post_method=True)
+      log.info('Killing task, calling %s and waiting %s, handled is %s' % (
+          endpoint, str(wait_time), str(handled)))
 
-      if handled:
-        self._clock.sleep(self.ESCALATION_WAIT.as_(Time.SECONDS))
+      waited = Amount(0, Time.SECONDS)
+      while handled:
         if self._runner.status is not None:
           return True
+        if waited >= wait_time:
+          break
+
+        self._clock.sleep(self.WAIT_POLL_INTERVAL.as_(Time.SECONDS))
+        waited += self.WAIT_POLL_INTERVAL
 
   # --- public interface
   def start(self, timeout=None):

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/test/python/apache/aurora/client/cli/test_inspect.py
----------------------------------------------------------------------
diff --git a/src/test/python/apache/aurora/client/cli/test_inspect.py b/src/test/python/apache/aurora/client/cli/test_inspect.py
index 4a23c59..ecefc18 100644
--- a/src/test/python/apache/aurora/client/cli/test_inspect.py
+++ b/src/test/python/apache/aurora/client/cli/test_inspect.py
@@ -142,7 +142,9 @@ Process 'process':
             "http": {
                 "graceful_shutdown_endpoint": "/quitquitquit",
                 "port": "health",
-                "shutdown_endpoint": "/abortabortabort"}},
+                "shutdown_endpoint": "/abortabortabort",
+                "graceful_shutdown_wait_secs": 5,
+                "shutdown_wait_secs": 5}},
         "priority": 0}
 
     mock_output = "\n".join(mock_stdout)

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/test/python/apache/aurora/executor/bin/test_thermos_executor_entry_point.py
----------------------------------------------------------------------
diff --git a/src/test/python/apache/aurora/executor/bin/test_thermos_executor_entry_point.py b/src/test/python/apache/aurora/executor/bin/test_thermos_executor_entry_point.py
index 38deae6..5ad2999 100644
--- a/src/test/python/apache/aurora/executor/bin/test_thermos_executor_entry_point.py
+++ b/src/test/python/apache/aurora/executor/bin/test_thermos_executor_entry_point.py
@@ -34,6 +34,7 @@ class ThermosExecutorMainTest(unittest.TestCase):
     mock_options.execute_as_user = False
     mock_options.nosetuid = False
     mock_options.announcer_ensemble = None
+    mock_options.stop_timeout_in_secs = 1
     with patch(
         'apache.aurora.executor.bin.thermos_executor_main.dump_runner_pex',
         return_value=mock_dump_runner_pex):

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/test/python/apache/aurora/executor/test_http_lifecycle.py
----------------------------------------------------------------------
diff --git a/src/test/python/apache/aurora/executor/test_http_lifecycle.py b/src/test/python/apache/aurora/executor/test_http_lifecycle.py
index a967e34..aeba776 100644
--- a/src/test/python/apache/aurora/executor/test_http_lifecycle.py
+++ b/src/test/python/apache/aurora/executor/test_http_lifecycle.py
@@ -15,6 +15,7 @@
 from contextlib import contextmanager
 
 import mock
+from twitter.common.quantity import Amount, Time
 
 from apache.aurora.config.schema.base import HttpLifecycleConfig, LifecycleConfig, MesosTaskInstance
 from apache.aurora.executor.http_lifecycle import HttpLifecycleManager
@@ -53,20 +54,24 @@ def test_http_lifecycle_wrapper_with_lifecycle():
   with make_mocks(mti, {'health': 31337}) as (runner_mock, runner_wrapper, wrapper_init):
     assert isinstance(runner_wrapper, HttpLifecycleManager)
     assert wrapper_init.mock_calls == [
-      mock.call(runner_mock, 31337, ['/quitquitquit', '/abortabortabort'])
+      mock.call(runner_mock, 31337, [('/quitquitquit', Amount(5, Time.SECONDS)),
+        ('/abortabortabort', Amount(5, Time.SECONDS))])
     ]
 
-  # Validate that we can override ports
+  # Validate that we can override ports, endpoints, wait times
   mti = MesosTaskInstance(lifecycle=LifecycleConfig(http=HttpLifecycleConfig(
       port='http',
       graceful_shutdown_endpoint='/frankfrankfrank',
       shutdown_endpoint='/bobbobbob',
+      graceful_shutdown_wait_secs=123,
+      shutdown_wait_secs=456
   )))
   portmap = {'http': 12345, 'admin': 54321}
   with make_mocks(mti, portmap) as (runner_mock, runner_wrapper, wrapper_init):
     assert isinstance(runner_wrapper, HttpLifecycleManager)
     assert wrapper_init.mock_calls == [
-      mock.call(runner_mock, 12345, ['/frankfrankfrank', '/bobbobbob'])
+      mock.call(runner_mock, 12345, [('/frankfrankfrank', Amount(123, Time.SECONDS)),
+        ('/bobbobbob', Amount(456, Time.SECONDS))])
     ]
 
 

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/test/python/apache/aurora/executor/test_thermos_executor.py
----------------------------------------------------------------------
diff --git a/src/test/python/apache/aurora/executor/test_thermos_executor.py b/src/test/python/apache/aurora/executor/test_thermos_executor.py
index e628ccd..f6ae1be 100644
--- a/src/test/python/apache/aurora/executor/test_thermos_executor.py
+++ b/src/test/python/apache/aurora/executor/test_thermos_executor.py
@@ -42,7 +42,7 @@ from apache.aurora.config.schema.base import (
     Resources,
     Task
 )
-from apache.aurora.executor.aurora_executor import AuroraExecutor
+from apache.aurora.executor.aurora_executor import AuroraExecutor, propagate_deadline
 from apache.aurora.executor.common.executor_timeout import ExecutorTimeout
 from apache.aurora.executor.common.health_checker import HealthCheckerProvider
 from apache.aurora.executor.common.sandbox import DirectorySandbox, SandboxProvider
@@ -218,7 +218,8 @@ def make_executor(
     fast_status=False,
     runner_class=ThermosTaskRunner,
     status_providers=[HealthCheckerProvider()],
-    assert_task_is_running=True):
+    assert_task_is_running=True,
+    stop_timeout_in_secs=120):
 
   status_manager_class = FastStatusManager if fast_status else StatusManager
   runner_provider = make_provider(checkpoint_root, runner_class)
@@ -227,6 +228,7 @@ def make_executor(
       status_manager_class=status_manager_class,
       sandbox_provider=DefaultTestSandboxProvider(),
       status_providers=status_providers,
+      stop_timeout_in_secs=stop_timeout_in_secs
   )
 
   ExecutorTimeout(te.launched, proxy_driver, timeout=Amount(100, Time.MILLISECONDS)).start()
@@ -394,16 +396,35 @@ class TestThermosExecutor(object):
   def test_killTask(self):  # noqa
     proxy_driver = ProxyDriver()
 
-    with temporary_dir() as checkpoint_root:
-      _, executor = make_executor(proxy_driver, checkpoint_root, SLEEP60_MTI)
+    class ProvidedThermosRunnerMatcher(object):
+      """Matcher that ensures a bound method 'stop' from 'ProvidedThermosTaskRunner' is called."""
+
+      def __eq__(self, other):
+        return (type(other.im_self).__name__ == 'ProvidedThermosTaskRunner'
+            and other.__name__ == 'stop')
+
+    with contextlib.nested(
+        temporary_dir(),
+        mock.patch('apache.aurora.executor.aurora_executor.propagate_deadline',
+            wraps=propagate_deadline)) as (checkpoint_root, mock_propagate_deadline):
+
+      _, executor = make_executor(
+          proxy_driver,
+          checkpoint_root,
+          SLEEP60_MTI,
+          stop_timeout_in_secs=123)
       # send two, expect at most one delivered
       executor.killTask(proxy_driver, mesos_pb2.TaskID(value='sleep60-001'))
       executor.killTask(proxy_driver, mesos_pb2.TaskID(value='sleep60-001'))
       executor.terminated.wait()
 
-    updates = proxy_driver.method_calls['sendStatusUpdate']
-    assert len(updates) == 3
-    assert updates[-1][0][0].state == mesos_pb2.TASK_KILLED
+      updates = proxy_driver.method_calls['sendStatusUpdate']
+
+      mock_propagate_deadline.assert_called_with(  # Ensure 'stop' is called with custom timeout.
+          ProvidedThermosRunnerMatcher(),
+          timeout=Amount(123, Time.SECONDS))
+      assert len(updates) == 3
+      assert updates[-1][0][0].state == mesos_pb2.TASK_KILLED
 
   def test_shutdown(self):
     proxy_driver = ProxyDriver()
@@ -618,12 +639,11 @@ class TestThermosExecutor(object):
     with temporary_dir() as tempdir:
       te = FastThermosExecutor(
         runner_provider=make_provider(tempdir, mesos_containerizer_path='/doesnotexist'),
-        sandbox_provider=FileSystemImageTestSandboxProvider())
+        sandbox_provider=FileSystemImageTestSandboxProvider(), stop_timeout_in_secs=1)
       te.launchTask(proxy_driver, make_task(HELLO_WORLD_MTI))
 
       te.SANDBOX_INITIALIZATION_TIMEOUT = Amount(1, Time.MILLISECONDS)
       te.START_TIMEOUT = Amount(10, Time.MILLISECONDS)
-      te.STOP_TIMEOUT = Amount(10, Time.MILLISECONDS)
 
       proxy_driver.wait_stopped()
 
@@ -643,11 +663,10 @@ class TestThermosExecutor(object):
 
       te = FastThermosExecutor(
         runner_provider=make_provider(tempdir, mesos_containerizer_path=tempfile),
-        sandbox_provider=FileSystemImageTestSandboxProvider())
+        sandbox_provider=FileSystemImageTestSandboxProvider(), stop_timeout_in_secs=1)
 
       te.SANDBOX_INITIALIZATION_TIMEOUT = Amount(1, Time.MILLISECONDS)
       te.START_TIMEOUT = Amount(10, Time.MILLISECONDS)
-      te.STOP_TIMEOUT = Amount(10, Time.MILLISECONDS)
 
       te.launchTask(proxy_driver, make_task(HELLO_WORLD_MTI))
 

http://git-wip-us.apache.org/repos/asf/aurora/blob/cb86e835/src/test/python/apache/aurora/executor/test_thermos_task_runner.py
----------------------------------------------------------------------
diff --git a/src/test/python/apache/aurora/executor/test_thermos_task_runner.py b/src/test/python/apache/aurora/executor/test_thermos_task_runner.py
index 1b92667..7096557 100644
--- a/src/test/python/apache/aurora/executor/test_thermos_task_runner.py
+++ b/src/test/python/apache/aurora/executor/test_thermos_task_runner.py
@@ -32,6 +32,7 @@ from twitter.common.quantity import Amount, Time
 
 from apache.aurora.config.schema.base import MB, MesosTaskInstance, Process, Resources, Task
 from apache.aurora.executor.common.sandbox import DirectorySandbox
+from apache.aurora.executor.common.status_checker import StatusResult
 from apache.aurora.executor.http_lifecycle import HttpLifecycleManager
 from apache.aurora.executor.thermos_task_runner import ThermosTaskRunner
 from apache.thermos.common.statuses import (
@@ -227,33 +228,95 @@ class TestThermosTaskRunnerIntegration(object):
       assert task_runner.status.status == mesos_pb2.TASK_KILLED
 
   @patch('apache.aurora.executor.http_lifecycle.HttpSignaler')
-  def test_integration_http_teardown(self, SignalerClass):
+  def test_integration_http_teardown_killed(self, SignalerClass):
+    """Ensure that the http teardown procedure closes correctly when abort kills the process."""
     signaler = SignalerClass.return_value
-    signaler.side_effect = lambda path, use_post_method: (path != '/quitquitquit', None)
+    signaler.side_effect = lambda path, use_post_method: (path == '/abortabortabort', None)
 
     clock = Mock(wraps=time)
 
+    class TerminalStateStatusRunner(ThermosTaskRunner):
+      """
+      Status is called each poll in the teardown procedure. We return kill after the 3rd poll
+      to mimic a task that exits early. We want to ensure the shutdown procedure doesn't wait
+      the full time if it doesn't need to.
+      """
+
+      TIMES_CALLED = 0
+
+      @property
+      def status(self):
+        if (self.TIMES_CALLED >= 3):
+          return StatusResult('Test task mock status', mesos_pb2.TASK_KILLED)
+        self.TIMES_CALLED += 1
+
     with self.yield_sleepy(
-        ThermosTaskRunner,
+        TerminalStateStatusRunner,
         portmap={'health': 3141},
         clock=clock,
-        sleep=1000,
+        sleep=0,
         exit_code=0) as task_runner:
 
-      class ImmediateHttpLifecycleManager(HttpLifecycleManager):
-        ESCALATION_WAIT = Amount(1, Time.MICROSECONDS)
+      graceful_shutdown_wait = Amount(1, Time.SECONDS)
+      shutdown_wait = Amount(5, Time.SECONDS)
+      http_task_runner = HttpLifecycleManager(
+          task_runner, 3141, [('/quitquitquit', graceful_shutdown_wait),
+          ('/abortabortabort', shutdown_wait)], clock=clock)
+      http_task_runner.start()
+      task_runner.forked.wait()
+      http_task_runner.stop()
+
+      http_teardown_poll_wait_call = call(HttpLifecycleManager.WAIT_POLL_INTERVAL.as_(Time.SECONDS))
+      assert clock.sleep.mock_calls.count(http_teardown_poll_wait_call) == 3  # Killed before 5
+      assert signaler.mock_calls == [
+        call('/quitquitquit', use_post_method=True),
+        call('/abortabortabort', use_post_method=True)]
+
+  @patch('apache.aurora.executor.http_lifecycle.HttpSignaler')
+  def test_integration_http_teardown_escalate(self, SignalerClass):
+    """Ensure that the http teardown process fully escalates when quit/abort both fail to kill."""
+    signaler = SignalerClass.return_value
+    signaler.side_effect = lambda path, use_post_method: (True, None)
+
+    clock = Mock(wraps=time)
+
+    class KillCalledTaskRunner(ThermosTaskRunner):
+      def __init__(self, *args, **kwargs):
+        self._killed_called = False
+        ThermosTaskRunner.__init__(self, *args, **kwargs)
+
+      def kill_called(self):
+        return self._killed_called
+
+      def kill(self):
+        self._killed_called = True
+
+      @property
+      def status(self):
+        return None
+
+    with self.yield_sleepy(
+        KillCalledTaskRunner,
+        portmap={'health': 3141},
+        clock=clock,
+        sleep=0,
+        exit_code=0) as task_runner:
 
-      http_task_runner = ImmediateHttpLifecycleManager(
-          task_runner, 3141, ['/quitquitquit', '/abortabortabort'], clock=clock)
+      graceful_shutdown_wait = Amount(1, Time.SECONDS)
+      shutdown_wait = Amount(5, Time.SECONDS)
+      http_task_runner = HttpLifecycleManager(
+          task_runner, 3141, [('/quitquitquit', graceful_shutdown_wait),
+          ('/abortabortabort', shutdown_wait)], clock=clock)
       http_task_runner.start()
       task_runner.forked.wait()
       http_task_runner.stop()
 
-      escalation_wait = call(ImmediateHttpLifecycleManager.ESCALATION_WAIT.as_(Time.SECONDS))
-      assert clock.sleep.mock_calls.count(escalation_wait) == 1
+      http_teardown_poll_wait_call = call(HttpLifecycleManager.WAIT_POLL_INTERVAL.as_(Time.SECONDS))
+      assert clock.sleep.mock_calls.count(http_teardown_poll_wait_call) == 6
       assert signaler.mock_calls == [
         call('/quitquitquit', use_post_method=True),
         call('/abortabortabort', use_post_method=True)]
+      assert task_runner.kill_called() == True
 
   def test_thermos_normal_exit_status(self):
     with self.exit_with_status(0, TaskState.SUCCESS) as task_runner: