You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Qian Zhang (Jira)" <ji...@apache.org> on 2020/07/23 08:08:00 UTC

[jira] [Created] (MESOS-10163) Implement a new component to launch CSI plugins as standalone containers and make CSI gRPC calls

Qian Zhang created MESOS-10163:
----------------------------------

             Summary: Implement a new component to launch CSI plugins as standalone containers and make CSI gRPC calls
                 Key: MESOS-10163
                 URL: https://issues.apache.org/jira/browse/MESOS-10163
             Project: Mesos
          Issue Type: Task
            Reporter: Qian Zhang
            Assignee: Greg Mann


*Background:*

Originally we want `volume/csi` isolator to leverage the existing [service manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51] to launch CSI plugins as standalone containers and currently service manager needs to call the following agent HTTP APIs:
 # `GET_CONTAINERS` to get all standalone containers in its `recover` method.
 # `KILL_CONTAINER` and `WAIT_CONTAINER` to kill the outdated standalone containers in its `recover` method.
 # `LAUNCH_CONTAINER` via the existing [ContainerDaemon|https://github.com/apache/mesos/blob/1.10.0/src/slave/container_daemon.hpp#L41:L46] to launch CSI plugin as standalone container when its `getEndpoint` method is called.

The problem with the above design is, `volume/csi` isolator may need to clean up orphan container during agent recovery which is triggered by containerizer (see [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/containerizer/mesos/containerizer.cpp#L1272:L1275] for details), to clean up an orphan container which is using a CSI volume, `volume/csi` isolator needs to instantiate and recover the service manager and get CSI plugin’s endpoint from it (i.e., service manager’s `getEndpoint` method will be called by `volume/csi` isolator during agent recovery. And as I mentioned above service manager’s `getEndpoint` may need to call `LAUNCH_CONTAINER` to launch CSI plugin as standalone container, since agent is still in recovering state, such agent HTTP call will be just rejected by agent. So we have to instantiate and recover service manager *after agent recovery is done*, but in `volume/csi` isolator we do not have such information (i.e. the signal that agent recovery is done).

 

*Solution*

We need to implement a new component (like `CSIVolumeManager` or a better name?) in Mesos agent which is responsible for launching CSI plugins as standalone containers (via the existing [service manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/service_manager.hpp#L50:L51]) and making CSI gRPC calls (via the existing [volume manager|https://github.com/apache/mesos/blob/1.10.0/src/csi/volume_manager.hpp#L55:L56]).
 * We can instantiate this new component in the `main` method of agent and pass it to both containerizer and agent (i.e. it will be a member of the `Slave` object), and containerizer will in turn pass it to the `volume/csi` isolator.
 * Since this new component relies on service manager which will call agent HTTP APIs, we need to pass agent URL to it, like `process::http::URL(scheme, agentIP, agentPort, agentLibprocessId + "/api/v1")`, see [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L459:L471] for an example.
 * When agent registers/reregisters with master (`Slave::registered` and `Slave::reregistered`), we should call this new component’s `start` method (see [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1740:L1742] and [here|https://github.com/apache/mesos/blob/1.10.0/src/slave/slave.cpp#L1825:L1827] as examples) which will scan the directory `--csi_plugin_config_dir` and create the `service manager - volume manager` pair for each CSI plugin loaded from that directory.
 * For the `volume/csi` isolator, it needs to call this new component’s `publishVolume` and `unpublishVolume` methods in its `prepare` and `cleanup` method.

In the case of clean up orphan containers during agent recovery, `volume/csi` isolator will just call this new component’s `unpublishVolume` method as usual, and it is this new component’s responsibility to only make the actual CSI gRPC call after agent recovery is done and agent has registered with master (e.g., when this new component’s start method is called).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)