You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@aurora.apache.org by "Stephan Erb (JIRA)" <ji...@apache.org> on 2016/11/24 15:36:58 UTC

[jira] [Commented] (AURORA-1799) Thermos does not handle low memory scenarios gracefully

    [ https://issues.apache.org/jira/browse/AURORA-1799?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15693556#comment-15693556 ] 

Stephan Erb commented on AURORA-1799:
-------------------------------------

We can probably adopt the same idea used here https://reviews.apache.org/r/53519/

> Thermos does not handle low memory scenarios gracefully
> -------------------------------------------------------
>
>                 Key: AURORA-1799
>                 URL: https://issues.apache.org/jira/browse/AURORA-1799
>             Project: Aurora
>          Issue Type: Bug
>            Reporter: Zameer Manji
>
> Background:
> In an environment where Aurora is used to launch Docker containers via the DockerContainerizer, it was observed that some tasks would not be killed.
> What happened is that a task was allocated with a low amount of memory but demanded a lot. This caused the linux OOM killer to be invoked. Unlike the MesosContainerizer, the agent doesn't tear down the container when the OOM killer is invoked. Instead the OOM killer just kills a process in the container and thermos and mesos are unaware (unless a process directly launched by thermos is killed).
> I observed in the scheduler logs that the scheduler was trying to kill a container every reconciliation period but it never died. The slave had the logs indicating it received the killTask RPC and forwarded it to Thermos.
> The thermos logs had several entries like every hour:
> {noformat}
> I1018 20:39:18.102894 6 executor_base.py:45] Executor [aaeac4c8-2b2f-4351-874b-a16bea1b36b0-S147]: Activating kill manager.
> I1018 20:39:18.103034 6 executor_base.py:45] Executor [aaeac4c8-2b2f-4351-874b-a16bea1b36b0-S147]: killTask returned.
> I1018 21:39:17.859935 6 executor_base.py:45] Executor [aaeac4c8-2b2f-4351-874b-a16bea1b36b0-S147]: killTask got task_id: value: "<task_id>"
> {noformat}
> However, the tasks was never killed. Looking at the stderr of thermos I saw the following entries:
> {noformat}
> Logged from file resource.py, line 155
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/logging/__init__.py", line 883, in emit
>     self.flush()
>   File "/usr/lib/python2.7/logging/__init__.py", line 843, in flush
>     self.stream.flush()
> IOError: [Errno 12] Cannot allocate memory
> {noformat}
> and 
> {noformat}
> Logged from file thermos_task_runner.py, line 171
> Traceback (most recent call last):
>   File "/root/.pex/install/twitter.common.exceptions-0.3.3-py2-none-any.whl.2a67b833b1517d179ef1c8dc6f2dac1023d51e3c/twitter.common.exceptions-0.3.3-py2-none-any.whl/twitter/common/exceptions/__init__.py", line 126, in _excepting_run
>   File "apache/aurora/executor/status_manager.py", line 47, in run
>   File "apache/aurora/executor/common/status_checker.py", line 97, in status
>   File "apache/aurora/executor/thermos_task_runner.py", line 358, in status
>   File "apache/aurora/executor/thermos_task_runner.py", line 186, in compute_status
>   File "apache/aurora/executor/thermos_task_runner.py", line 136, in task_state
>   File "apache/thermos/monitoring/monitor.py", line 118, in task_state
>   File "apache/thermos/monitoring/monitor.py", line 114, in get_state
>   File "apache/thermos/monitoring/monitor.py", line 77, in _apply_states
>   File "/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/recordio.py", line 182, in try_read
>     class InvalidTypeException(Error): pass
>   File "/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/recordio.py", line 168, in read
>     return RecordIO.Reader.do_read(self._fp, self._codec)
>   File "/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/recordio.py", line 135, in do_read
>     header = fp.read(RecordIO.RECORD_HEADER_SIZE)
>   File "/root/.pex/install/twitter.common.recordio-0.3.3-py2-none-any.whl.9f1e9394eca1bc33ad7d10ae3025301866824139/twitter.common.recordio-0.3.3-py2-none-any.whl/twitter/common/recordio/filelike.py", line 81, in read
>     return self._fp.read(length)
> IOError: [Errno 12] Cannot allocate memory
> {noformat}
> It seems the regular avenues of reading checkpoints or logging data, thermos would get an IOError. Some part of twitter common installs an excepthook to log the exception, but we don't seem to do anything else.
> I think we should probably install our own exception hook to send a {{LOST_TASK}} with the exception information instead of failing to kill the task.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)