You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by Bill Farner <wf...@apache.org> on 2017/11/02 02:35:54 UTC
Re: Review Request 63443: Terminate the executor on unhandled errors
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63443/#review189866
-----------------------------------------------------------
Ship it!
Ship It!
- Bill Farner
On Oct. 31, 2017, 9:17 a.m., Stephan Erb wrote:
>
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63443/
> -----------------------------------------------------------
>
> (Updated Oct. 31, 2017, 9:17 a.m.)
>
>
> Review request for Aurora, Bill Farner and Zameer Manji.
>
>
> Bugs: AURORA-1955
> https://issues.apache.org/jira/browse/AURORA-1955
>
>
> Repository: aurora
>
>
> Description
> -------
>
> This commit consits of two independent parts:
>
> a) ensure we interrupt the main thread when there are unhandled exceptions
> b) ensure the main thread of the executor can be interrupted
>
>
> Diffs
> -----
>
> src/main/python/apache/aurora/executor/bin/thermos_executor_main.py a191cf9eec844035c0f6aa5aed3731a06024c0df
> src/main/python/apache/aurora/tools/thermos.py de20c06cea5bbb45c7a6f5acfeee69289f8e6ad8
> src/main/python/apache/aurora/tools/thermos_observer.py 0318f990ac003c0b8925b7eb7359431cdee34f05
> src/main/python/apache/thermos/common/excepthook.py PRE-CREATION
> src/main/python/apache/thermos/runner/thermos_runner.py 847f51ed2c0e003f1325aa903bd0f0b760acb365
>
>
> Diff: https://reviews.apache.org/r/63443/diff/1/
>
>
> Testing
> -------
>
> This bug is pretty hard to reproduce and test. I therefore opted for a manual
> verification and injected an exception throw shortly before the last statement
> of the `AuroraExecutor._shutdown` method. Without this patch, this resulted in
> hanging executors on the host. With this patch everything is terminated as
> expected.
>
> For details of the suffessful run, please see the executor logs below. Please
> note that the `apport.fileutils` is due to Ubuntu messing with its Python
> installation. This is not critical.
>
> ```
> twitter.common.app debug: Initializing: apache.thermos.common.excepthook (Exception termination handler.)
> I1031 15:59:37.188621 25437 exec.cpp:162] Version: 1.2.0
> I1031 15:59:37.192201 25429 exec.cpp:237] Executor registered on agent 93259518-14f4-4956-a39c-aa615bff9a5e-S0
> Writing log files to disk in /var/lib/mesos/slaves/93259518-14f4-4956-a39c-aa615bff9a5e-S0/frameworks/7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000/executors/thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c/runs/54a5ed51-aa9b-476f-9f75-0b42bd6dfa8d
>
> ERROR] Unhandled error in <StatusManager(Thread-7 [TID=25450], started daemon 139968452134656)>. Interrupting main thread.
> Traceback (most recent call last):
> File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run
> self.__real_run(*args, **kw)
> File "apache/aurora/executor/status_manager.py", line 62, in run
> File "apache/aurora/executor/aurora_executor.py", line 236, in _shutdown
> RuntimeError: Woops!
> Exception in thread Thread-7 [TID=25450]:
> Traceback (most recent call last):
> File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
> self.run()
> File "/root/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py", line 115, in identified
> return instancemethod(self, *args, **kwargs)
> File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 130, in _excepting_run
> sys.excepthook(*sys.exc_info())
> File "apache/thermos/common/excepthook.py", line 41, in teardown_handler
> self._former_hook()(exc_type, value, trace)
> File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
> from apport.fileutils import likely_packaged, get_recent_crashes
> ImportError: No module named apport.fileutils
>
> twitter.common.app debug: main exited with ^C
> twitter.common.app debug: Shutting application down.
> twitter.common.app debug: Running exit function for apache.thermos.common.excepthook (Exception termination handler.)
> twitter.common.app debug: Running exit function for twitter.common.log (Logging subsystem.)
> twitter.common.app debug: Finishing up module teardown.
> twitter.common.app debug: Active thread: <_MainThread(MainThread, started 139968622749504)>
> twitter.common.app debug: Active thread (daemon): <TaskResourceMonitor(TaskResourceMonitor[www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c] [TID=25449], started daemon 139967951009536)>
> twitter.common.app debug: Active thread (daemon): <_DummyThread(Dummy-13, started daemon 139968485705472)>
> twitter.common.app debug: Active thread (daemon): <WaitThread(Thread-9, started daemon 139967934224128)>
> twitter.common.app debug: Active thread (daemon): <WaitThread(Thread-12, started daemon 139967942616832)>
> twitter.common.app debug: Active thread (daemon): <_DummyThread(Dummy-3, started daemon 139968510883584)>
> twitter.common.app debug: Active thread (daemon): <WaitThread(Thread-11, started daemon 139967925831424)>
> twitter.common.app debug: Exiting cleanly.
> ```
>
> Corresponding agent logs, indicating that Mesos knows about the crash on teardown:
> ```
> I1031 15:59:54.692739 1956 slave.cpp:4769] Executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 exited with status 130
> I1031 15:59:54.692834 1956 slave.cpp:4869] Cleaning up executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 at executor(1)@192.168.33.7:48931
> I1031 15:59:54.692996 1956 slave.cpp:4957] Cleaning up framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000
> ```
>
>
> Thanks,
>
> Stephan Erb
>
>