You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Joris Van Remoortere (JIRA)" <ji...@apache.org> on 2015/09/24 01:41:04 UTC

[jira] [Comment Edited] (MESOS-3352) Problem Statement Summary for Systemd Cgroup Launcher

    [ https://issues.apache.org/jira/browse/MESOS-3352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14743635#comment-14743635 ] 

Joris Van Remoortere edited comment on MESOS-3352 at 9/23/15 11:40 PM:
-----------------------------------------------------------------------

In order to avoid the migration of cgroup pids by Systemd we can use the {{delegate=true}} flag. This guards Systemd from migrating the pids that are descendants of the process launched by a Systemd unit.

In order for this strategy to work, the {{delegate}} flag must be supported by the Systemd version. Support for this was introduced in Systemd v218; however, it has also been backported to v208 for RHEL7 and CentOS7 [here|http://centoserrata.nagater.net/item/CEBA-2015-0037-CentOS-7.i386.x86_64.html] with the package [systemd-208-20|https://rhn.redhat.com/errata/RHBA-2015-1155.html]. It is highly recommended to upgrade to this package if running those operating systems.

Once the {{delegate=true}} flag has been set, the cgroups that are manually manipulated by the agent will no longer be migrated *during the lifetime of the agent*.

This still leaves the problem of tasks being migrated _after the agent has stopped running_ (voluntarily or not). In order to deal with the problem we propose the following solution:

If an agent is running on a Systemd initialized machine, then the agent will create a Systemd slice with a life-time that is independent of the agent and {{delegate=true}}. The linux launcher (used when cgroups isolators are enabled) will then assign the cgroup name for any executor that is launched to this separate slice. The consequence of this is that when the agent unit is terminated, the separate slice will continue to delegate the cgroups preventing Systemd from migrating the pids. A side benefit of this is that we can maintain the {{KillMode=control-group}} flag on the agent and terminate all agent specific services such as the {{fetcher}} without terminating the tasks. This provides for a nice clean-up.

This solution will still require that the agent unit be launched with the {{delegate=true}} flag such that there is no race during the transition of the pids from the agent to the separate slice.

The agent will be responsible for verifying the slice is still available upon recovery, and warning the operator if it notices that the tasks it is recovering are no longer associated with this separate slice, as this can cause *silent* loss of isolation of existing tasks.


was (Author: jvanremoortere):
In order to avoid the migration of cgroup pids by Systemd we can use the {{delegate=true}} flag. This guards Systemd from migrating the pids that are descendants of the process launched by a Systemd unit.

In order for this strategy to work, the {{delegate}} flag must be supported by the Systemd version. Support for this was introduced in Systemd v218; however, it has also been backported to v208 for RHEL7 and CentOS7 [here|http://centoserrata.nagater.net/item/CEBA-2015-0037-CentOS-7.i386.x86_64.html] with the package [systemd-208-20|https://rhn.redhat.com/errata/RHBA-2015-1155.html]. It is highly recommended to upgrade to this package if running those operating systems.

Once the {{delegate=true}} flag has been set, the cgroups that are manually manipulated by the agent will no longer be migrated *during the lifetime of the agent*.

This still leaves the problem of tasks being migrated _after the agent has stopped running_ (voluntarily or not). In order to deal with the problem we propose the following solution:

If an agent is running on a Systemd initialized machine, then the agent will create a Systemd slice with a life-time that is independent of the agent and {{delegate=true}}. The linux launcher (used when cgroups isolators are enabled) will then assign the cgroup name for any executor that is launched to this separate slice. The consequence of this is that when the agent unit is terminated, the separate slice will continue to delegate the cgroups preventing Systemd from migrating the pids. A side benefit of this is that we can maintain the {{KillMode=cgroup}} flag on the agent and terminate all agent specific services such as the {{fetcher}} without terminating the tasks. This provides for a nice clean-up.

This solution will still require that the agent unit be launched with the {{delegate=true}} flag such that there is no race during the transition of the pids from the agent to the separate slice.

The agent will be responsible for verifying the slice is still available upon recovery, and warning the operator if it notices that the tasks it is recovering are no longer associated with this separate slice, as this can cause *silent* loss of isolation of existing tasks.

> Problem Statement Summary for Systemd Cgroup Launcher
> -----------------------------------------------------
>
>                 Key: MESOS-3352
>                 URL: https://issues.apache.org/jira/browse/MESOS-3352
>             Project: Mesos
>          Issue Type: Task
>            Reporter: Joris Van Remoortere
>            Assignee: Joris Van Remoortere
>              Labels: design, mesosphere, systemd
>
> There have been many reports of cgroups related issues when running Mesos on Systemd.
> Many of these issues are rooted in the manual manipulation of the cgroups filesystem by Mesos.
> This task is to describe the problem in a 1-page summary, and elaborate on the suggested 2 part solution:
> 1. Using the {{delegate=true}} flag for the slave
> 2. Implementing a Systemd launcher to run executors with tighter Systemd integration.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)