You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Chun-Hung Hsiao (JIRA)" <ji...@apache.org> on 2018/11/21 19:32:00 UTC

[jira] [Commented] (MESOS-8400) Retry logic for CSI calls when plugin crashes

    [ https://issues.apache.org/jira/browse/MESOS-8400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16695160#comment-16695160 ] 

Chun-Hung Hsiao commented on MESOS-8400:
----------------------------------------

Thought dumps:

This can be tackled in either ways:
1. Adding a retry logic with an exponential backoff in the {{StorageLocalResourceProviderProcess::call}} method.
2. Fail the resource provider and simply rely on MESOS-9223 to restart a new instance. Pros and cons:
* + SLRP no longer needs to manage its container daemon, just do a launch and fail itself if {{Probe}} fails. In the future we may want an external orchestrator, e.g., Marathon, to manage the lifecycle of a local resource provider, to enable features like rolling upgrade. To achieve this, we can add a very simple relaunch policy into the default executor, and make it responsible to relaunch the SLRP pod containing an SLRP task and a CSI task upon failure.
* - A failure would lead to RP reregistration, therefore multiple {{UpdateSlaveMessage}}s.

In that future vision, 1 might still be needed if the relaunch policy is on a per-task basis instead of a per-pod basis.
So we can go for 1 for now, and do the remaining refactoring in the future.

> Retry logic for CSI calls when plugin crashes
> ---------------------------------------------
>
>                 Key: MESOS-8400
>                 URL: https://issues.apache.org/jira/browse/MESOS-8400
>             Project: Mesos
>          Issue Type: Improvement
>            Reporter: Chun-Hung Hsiao
>            Assignee: Chun-Hung Hsiao
>            Priority: Critical
>              Labels: mesosphere, storage
>
> When a CSI plugin crashes, the container daemon in SLRP will reset its corresponding {{csi::Client}} service future. However, if there is a racy CSI call, the call may be issued before the future is reset, resulting in a failure for that CSI call. This could be avoided by introducing a retry logic. The following lists two possibilities:
> 1. If a GRPC channel can continue to work after its underlying domain socket is unbinded, removed and binded with the same filename (but different fd) again, then we can consider implementing the retry logic in `csi::Client`. The downside is that the racy call would go to the old future and all succeeding calls would go to the new future set up by the container daemon.
> 2. If the GRPC channel is bound to the domain socket fd, then we need to implement the retry logic in SLRP.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)