You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Qian Zhang (Jira)" <ji...@apache.org> on 2020/06/10 01:29:00 UTC

[jira] [Commented] (MESOS-10139) Mesos agent host may become unresponsive when it is under low memory pressure

    [ https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129923#comment-17129923 ] 

Qian Zhang commented on MESOS-10139:
------------------------------------

When this issue happens, via the `top` command I see `wa` is high which should be caused by `kswapd0`
{code:java}
top - 01:18:41 up  1:23,  4 users,  load average: 73.47, 38.72, 41.05
Tasks: 227 total,   3 running, 223 sleeping,   0 stopped,   1 zombie
%Cpu(s):  1.4 us,  3.0 sy,  0.0 ni, 48.7 id, 46.9 wa,  0.0 hi,  0.0 si,  0.0 st
MiB Mem :  31211.2 total,    208.8 free,  30836.6 used,    165.8 buff/cache
MiB Swap:      0.0 total,      0.0 free,      0.0 used.      1.4 avail Mem   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                                  
  103 root          20   0       0         0         0      R   100.0  0.0   2:40.74 kswapd0
...

{code}
Please note the swap is NOT enabled in the agent host, so it seems `kswapd0` tries to page out the executable code of some processes and OOM killer is not triggered at all.

 

 

 

 

> Mesos agent host may become unresponsive when it is under low memory pressure
> -----------------------------------------------------------------------------
>
>                 Key: MESOS-10139
>                 URL: https://issues.apache.org/jira/browse/MESOS-10139
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Qian Zhang
>            Priority: Major
>
> When user launches a task to use a large number of memory on an agent host (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on an agent host which have 32GB memory), the whole agent host will become unresponsive (no commands can be executed anymore, but still pingable). A few minutes later Mesos master will mark this agent as unreachable and update all its task’s state to `TASK_UNREACHABLE`.
> {code:java}
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE because of health check timeout
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: health check timed out
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: health check timed out
> …
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 of framework 89d2d679-fa08-49be-94c3-880ebb595212-0000 (latest state: TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 of framework 89d2d679-fa08-49be-94c3-880ebb595212-0000 (latest state: TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> ...{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)