You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Yan Xu (JIRA)" <ji...@apache.org> on 2017/09/27 01:04:02 UTC

[jira] [Commented] (MESOS-7964) Heavy-duty GC makes the agent unresponsive

    [ https://issues.apache.org/jira/browse/MESOS-7964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16181817#comment-16181817 ] 

Yan Xu commented on MESOS-7964:
-------------------------------

{noformat:title=master}
commit 06341309e61a5cee702ea3c7b6d3ef340ac95ad0
Author: Chun-Hung Hsiao <ch...@mesosphere.io>
Date:   Tue Sep 26 17:07:11 2017 -0700

    Prevent GC path removals from blocking other processes.
    
    This patch dispatches all path removals to a single executor instead of
    one `AsyncExecutor` per path such that heavy-duty removals won't occupy
    all worker threads and block other actors.
    
    Review: https://reviews.apache.org/r/62230/
{noformat}


{noformat:title=1.4.x}
commit 27b83565082720cbc9c93b3b892305b899af84b7
Author: Chun-Hung Hsiao <ch...@mesosphere.io>
Date:   Tue Sep 26 17:07:11 2017 -0700

    Prevent GC path removals from blocking other processes.
    
    This patch dispatches all path removals to a single executor instead of
    one `AsyncExecutor` per path such that heavy-duty removals won't occupy
    all worker threads and block other actors.
    
    Review: https://reviews.apache.org/r/62230/

commit bf82953f1ede7ddf182f9cad79a3248ef2630dc8
Author: Chun-Hung Hsiao <ch...@mesosphere.io>
Date:   Mon Sep 25 14:10:27 2017 -0700

    Added `process::Executor::execute()`.
    
    This patch adds a convenient interface to `process::Executor` to
    asynchronously execute arbitrary functions.
    
    Review: https://reviews.apache.org/r/62252/
{noformat}

> Heavy-duty GC makes the agent unresponsive
> ------------------------------------------
>
>                 Key: MESOS-7964
>                 URL: https://issues.apache.org/jira/browse/MESOS-7964
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent
>    Affects Versions: 1.4.0
>            Reporter: Chun-Hung Hsiao
>            Assignee: Chun-Hung Hsiao
>             Fix For: 1.4.1, 1.5.0
>
>
> An agent is observed to performe heavy-duty GC every half an hour:
> {noformat}
> Sep 07 18:15:56 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]: I0907 18:15:56.900282 16054 slave.cpp:5920] Current disk usage 93.61%. Max allowed age: 0ns
> Sep 07 18:15:56 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]: I0907 18:15:56.900476 16054 gc.cpp:218] Pruning directories with remaining removal time 1.99022105972148days
> ...
> Sep 07 18:22:08 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]: I0907 18:22:08.173645 16050 gc.cpp:178] Deleted '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__f33065c9-eb42-44a7-9013-25bafc306bd5'
> ...
> Sep 07 18:41:08 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]: I0907 18:41:08.195329 16051 slave.cpp:5920] Current disk usage 90.85%. Max allowed age: 0ns
> Sep 07 18:41:08 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]: I0907 18:41:08.195503 16051 gc.cpp:218] Pruning directories with remaining removal time 1.99028708946667days
> ...
> Sep 07 18:49:01 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]: I0907 18:49:01.253906 16049 gc.cpp:178] Deleted '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__014b451a-30de-41ee-b0b1-3733c790382c/runs/c5b922e8-eee0-4793-8637-7abbd7f8507e'
> ...
> Sep 07 19:08:01 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]: I0907 19:08:01.291092 16048 slave.cpp:5920] Current disk usage 91.39%. Max allowed age: 0ns
> Sep 07 19:08:01 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]: I0907 19:08:01.291285 16048 gc.cpp:218] Pruning directories with remaining removal time 1.99028598086815days
> ...
> Sep 07 19:14:50 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]: W0907 19:14:50.737226 16050 gc.cpp:174] Failed to delete '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__4139bf2e-e33b-4743-8527-f8f50ac49280/runs/b1991e28-7ff8-476f-8122-1a483e431ff2': No such file or directory
> ...
> Sep 07 19:33:50 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]: I0907 19:33:50.758191 16052 slave.cpp:5920] Current disk usage 91.39%. Max allowed age: 0ns
> Sep 07 19:33:50 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]: I0907 19:33:50.758872 16047 gc.cpp:218] Pruning directories with remaining removal time 1.99028057238519days
> ...
> Sep 07 19:39:43 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]: I0907 19:39:43.081485 16052 gc.cpp:178] Deleted '/var/lib/mesos/slave/meta/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0258/executors/node__d89dce1f-609b-4cf8-957a-5ba198be7828'
> ...
> Sep 07 19:59:43 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]: I0907 19:59:43.150535 16048 slave.cpp:5920] Current disk usage 94.56%. Max allowed age: 0ns
> Sep 07 19:59:43 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]: I0907 19:59:43.150869 16054 gc.cpp:218] Pruning directories with remaining removal time 1.98959316198222days
> ...
> Sep 07 20:06:16 int-infinityagentm42xl6-soak110.us-east-1a.mesosphere.com mesos-agent[16040]: I0907 20:06:16.251552 16051 gc.cpp:178] Deleted '/var/lib/mesos/slave/slaves/9d4b2f2b-a759-4458-bebf-7d3507a6f0ca-S20/frameworks/9750f9be-89d9-4e02-80d3-bdced653e9c3-0259/executors/data__45283e7d-9a5e-4d4b-9901-b7f1e096cd54/runs/5cfc5e3e-3975-41aa-846b-c125eb529fbe'
> {noformat}
> Each GC activity took 5+ minutes. During the period, the agent became unresponsive, the health check timed out, and no endpoint responded as well. When a disk-usage GC is trigged, around 300 pruning actors would be generated (https://github.com/apache/mesos/blob/master/src/slave/gc.cpp#L229). My hypothesis is that these actors would used all of the worker threads, and some of them took a long time to finish (possibly due to many files to delete, or too many fs operations at once, etc).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)