You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@aurora.apache.org by Bill Farner <wf...@apache.org> on 2017/11/02 02:35:54 UTC

Re: Review Request 63443: Terminate the executor on unhandled errors

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63443/#review189866
-----------------------------------------------------------


Ship it!




Ship It!

- Bill Farner


On Oct. 31, 2017, 9:17 a.m., Stephan Erb wrote:
> 
> -----------------------------------------------------------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63443/
> -----------------------------------------------------------
> 
> (Updated Oct. 31, 2017, 9:17 a.m.)
> 
> 
> Review request for Aurora, Bill Farner and Zameer Manji.
> 
> 
> Bugs: AURORA-1955
>     https://issues.apache.org/jira/browse/AURORA-1955
> 
> 
> Repository: aurora
> 
> 
> Description
> -------
> 
> This commit consits of two independent parts:
> 
> a) ensure we interrupt the main thread when there are unhandled exceptions
> b) ensure the main thread of the executor can be interrupted
> 
> 
> Diffs
> -----
> 
>   src/main/python/apache/aurora/executor/bin/thermos_executor_main.py a191cf9eec844035c0f6aa5aed3731a06024c0df 
>   src/main/python/apache/aurora/tools/thermos.py de20c06cea5bbb45c7a6f5acfeee69289f8e6ad8 
>   src/main/python/apache/aurora/tools/thermos_observer.py 0318f990ac003c0b8925b7eb7359431cdee34f05 
>   src/main/python/apache/thermos/common/excepthook.py PRE-CREATION 
>   src/main/python/apache/thermos/runner/thermos_runner.py 847f51ed2c0e003f1325aa903bd0f0b760acb365 
> 
> 
> Diff: https://reviews.apache.org/r/63443/diff/1/
> 
> 
> Testing
> -------
> 
> This bug is pretty hard to reproduce and test. I therefore opted for a manual 
> verification and injected an exception throw shortly before the last statement 
> of the `AuroraExecutor._shutdown` method. Without this patch, this resulted in
> hanging executors on the host. With this patch everything is terminated as
> expected. 
> 
> For details of the suffessful run, please see the executor logs below. Please
> note that the `apport.fileutils` is due to Ubuntu messing  with its Python
> installation. This is not critical.
> 
> ```
> twitter.common.app debug: Initializing: apache.thermos.common.excepthook (Exception termination handler.)
> I1031 15:59:37.188621 25437 exec.cpp:162] Version: 1.2.0
> I1031 15:59:37.192201 25429 exec.cpp:237] Executor registered on agent 93259518-14f4-4956-a39c-aa615bff9a5e-S0
> Writing log files to disk in /var/lib/mesos/slaves/93259518-14f4-4956-a39c-aa615bff9a5e-S0/frameworks/7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000/executors/thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c/runs/54a5ed51-aa9b-476f-9f75-0b42bd6dfa8d
> 
> ERROR] Unhandled error in <StatusManager(Thread-7 [TID=25450], started daemon 139968452134656)>. Interrupting main thread.
> Traceback (most recent call last):
>   File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run
>     self.__real_run(*args, **kw)
>   File "apache/aurora/executor/status_manager.py", line 62, in run
>   File "apache/aurora/executor/aurora_executor.py", line 236, in _shutdown
> RuntimeError: Woops!
> Exception in thread Thread-7 [TID=25450]:
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
>     self.run()
>   File "/root/.pex/install/twitter.common.decorators-0.3.7-py2-none-any.whl.b23f2874a4392741fca582d9e0528c08e0335c68/twitter.common.decorators-0.3.7-py2-none-any.whl/twitter/common/decorators/threads.py", line 115, in identified
>     return instancemethod(self, *args, **kwargs)
>   File "/root/.pex/install/twitter.common.exceptions-0.3.7-py2-none-any.whl.f6376bcca9bfda5eba4396de2676af5dfe36237d/twitter.common.exceptions-0.3.7-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 130, in _excepting_run
>     sys.excepthook(*sys.exc_info())
>   File "apache/thermos/common/excepthook.py", line 41, in teardown_handler
>     self._former_hook()(exc_type, value, trace)
>   File "/usr/lib/python2.7/dist-packages/apport_python_hook.py", line 63, in apport_excepthook
>     from apport.fileutils import likely_packaged, get_recent_crashes
> ImportError: No module named apport.fileutils
> 
> twitter.common.app debug: main exited with ^C
> twitter.common.app debug: Shutting application down.
> twitter.common.app debug: Running exit function for apache.thermos.common.excepthook (Exception termination handler.)
> twitter.common.app debug: Running exit function for twitter.common.log (Logging subsystem.)
> twitter.common.app debug: Finishing up module teardown.
> twitter.common.app debug:   Active thread: <_MainThread(MainThread, started 139968622749504)>
> twitter.common.app debug:   Active thread (daemon): <TaskResourceMonitor(TaskResourceMonitor[www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c] [TID=25449], started daemon 139967951009536)>
> twitter.common.app debug:   Active thread (daemon): <_DummyThread(Dummy-13, started daemon 139968485705472)>
> twitter.common.app debug:   Active thread (daemon): <WaitThread(Thread-9, started daemon 139967934224128)>
> twitter.common.app debug:   Active thread (daemon): <WaitThread(Thread-12, started daemon 139967942616832)>
> twitter.common.app debug:   Active thread (daemon): <_DummyThread(Dummy-3, started daemon 139968510883584)>
> twitter.common.app debug:   Active thread (daemon): <WaitThread(Thread-11, started daemon 139967925831424)>
> twitter.common.app debug: Exiting cleanly.
> ```
> 
> Corresponding agent logs, indicating that Mesos knows about the crash on teardown:
> ```
> I1031 15:59:54.692739  1956 slave.cpp:4769] Executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 exited with status 130
> I1031 15:59:54.692834  1956 slave.cpp:4869] Cleaning up executor 'thermos-www-data-prod-hello-0-d8d50c2f-e79b-467d-8c65-cca3cb44cf9c' of framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000 at executor(1)@192.168.33.7:48931
> I1031 15:59:54.692996  1956 slave.cpp:4957] Cleaning up framework 7b202c2e-8796-4f27-afeb-8b76ba4b3037-0000
> ```
> 
> 
> Thanks,
> 
> Stephan Erb
> 
>