You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Gregoire Seux (Jira)" <ji...@apache.org> on 2021/06/10 07:06:00 UTC

[jira] [Commented] (MESOS-8400) Handle plugin crashes gracefully in SLRP recovery.

    [ https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17360610#comment-17360610 ] 

Gregoire Seux commented on MESOS-8400:
--------------------------------------

All related reviews seems to have been applied, should we close this issue?

> Handle plugin crashes gracefully in SLRP recovery.
> --------------------------------------------------
>
>                 Key: MESOS-8400
>                 URL: https://issues.apache.org/jira/browse/MESOS-8400
>             Project: Mesos
>          Issue Type: Improvement
>            Reporter: Chun-Hung Hsiao
>            Priority: Blocker
>              Labels: mesosphere, mesosphere-dss-post-ga, storage
>
> When a CSI plugin crashes, the container daemon in SLRP will reset its corresponding {{csi::Client}} service future. However, if a CSI call races with a plugin crash, the call may be issued before the service future is reset, resulting in a failure for that CSI call. MESOS-9517 partly addresses this for {{CreateVolume}} and {{DeleteVolume}} calls, but calls in the SLRP recovery path, e.g., {{ListVolume}}, {{GetCapacity}}, {{Probe}}, could make the SLRP unrecoverable.
> There are two main issues:
>  1. For {{Probe}}, we should investigate if it is needed to make a few retry attempts, then after that, we should recover from failed attempts (e.g., kill the plugin container), then make the container daemon relaunch the plugin instead of failing the daemon.
> 2. For other calls in the recovery path, we should either retry the call, or make the local resource provider daemon be able to restart the SLRP after it fails.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)