You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Qian Zhang (Jira)" <ji...@apache.org> on 2020/06/10 01:29:00 UTC
[jira] [Commented] (MESOS-10139) Mesos agent host may become
unresponsive when it is under low memory pressure
[ https://issues.apache.org/jira/browse/MESOS-10139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17129923#comment-17129923 ]
Qian Zhang commented on MESOS-10139:
------------------------------------
When this issue happens, via the `top` command I see `wa` is high which should be caused by `kswapd0`
{code:java}
top - 01:18:41 up 1:23, 4 users, load average: 73.47, 38.72, 41.05
Tasks: 227 total, 3 running, 223 sleeping, 0 stopped, 1 zombie
%Cpu(s): 1.4 us, 3.0 sy, 0.0 ni, 48.7 id, 46.9 wa, 0.0 hi, 0.0 si, 0.0 st
MiB Mem : 31211.2 total, 208.8 free, 30836.6 used, 165.8 buff/cache
MiB Swap: 0.0 total, 0.0 free, 0.0 used. 1.4 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
103 root 20 0 0 0 0 R 100.0 0.0 2:40.74 kswapd0
...
{code}
Please note the swap is NOT enabled in the agent host, so it seems `kswapd0` tries to page out the executable code of some processes and OOM killer is not triggered at all.
> Mesos agent host may become unresponsive when it is under low memory pressure
> -----------------------------------------------------------------------------
>
> Key: MESOS-10139
> URL: https://issues.apache.org/jira/browse/MESOS-10139
> Project: Mesos
> Issue Type: Bug
> Reporter: Qian Zhang
> Priority: Major
>
> When user launches a task to use a large number of memory on an agent host (e.g., launch a task to run `stress --vm 1 --vm-bytes 29800M --vm-hang 0` on an agent host which have 32GB memory), the whole agent host will become unresponsive (no commands can be executed anymore, but still pingable). A few minutes later Mesos master will mark this agent as unreachable and update all its task’s state to `TASK_UNREACHABLE`.
> {code:java}
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.103382 15491 master.cpp:260] Scheduling transition of agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 to UNREACHABLE because of health check timeout
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.103612 15491 master.cpp:8592] Marking agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: health check timed out
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.108093 15495 master.cpp:8635] Marked agent 89d2d679-fa08-49be-94c3-880ebb595212-S0 (172.16.3.236) unreachable: health check timed out
> …
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.108419 15495 master.cpp:11149] Updating the state of task app10.instance-1f70be9f-9ef5-11ea-8981-9a93e42a6514._app.2 of framework 89d2d679-fa08-49be-94c3-880ebb595212-0000 (latest state: TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> May 26 02:13:31 ip-172-16-15-17.us-west-2.compute.internal mesos-master[15468]: I0526 02:13:31.108865 15495 master.cpp:11149] Updating the state of task app9.instance-954f91ad-9ef4-11ea-8981-9a93e42a6514._app.1 of framework 89d2d679-fa08-49be-94c3-880ebb595212-0000 (latest state: TASK_UNREACHABLE, status update state: TASK_UNREACHABLE)
> ...{code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)