You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Benjamin Bannier (JIRA)" <ji...@apache.org> on 2019/01/09 20:36:00 UTC

[jira] [Commented] (MESOS-9223) Storage local provider does not sufficiently handle container launch failures or errors

    [ https://issues.apache.org/jira/browse/MESOS-9223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16738638#comment-16738638 ] 

Benjamin Bannier commented on MESOS-9223:
-----------------------------------------

Capturing results from offline sync with [~chhsia0], [~jieyu], and [~jdef]:

 The agent should expose metrics reflecting resource provider-related state changes (e.g., a metric for resource provider (re)subscriptions, and other relevant events the RP manager currently exposes). If this could be implemented in the RP manager, we could reuse the code for ERPs where masters would hold RP managers. We'll want to expose metrics aggregated over all agents, but probably also metrics per RP to simplify triage.

We are not yet sure how RPs can surface reasons for e.g., disconnects as no message channel from RP up to RP manager exists. Right now this could be implemented by making use of e.g., out of band transport of plugin logs (e.g., via journald).

> Storage local provider does not sufficiently handle container launch failures or errors
> ---------------------------------------------------------------------------------------
>
>                 Key: MESOS-9223
>                 URL: https://issues.apache.org/jira/browse/MESOS-9223
>             Project: Mesos
>          Issue Type: Improvement
>          Components: agent, storage
>            Reporter: Benjamin Bannier
>            Assignee: Benjamin Bannier
>            Priority: Critical
>
> The storage local resource provider as currently implemented does not handle launch failures or task errors of its standalone containers well enough, If e.g., a RP container fails to come up during node start a warning would be logged, but an operator still needs to detect degraded functionality, manually check the state of containers with {{GET_CONTAINERS}}, and decide whether the agent needs restarting; I suspect they do not have always have enough context for this decision. It would be better if the provider would either enforce a restart by failing over the whole agent, or by retrying the operation (optionally: up to some maximum amount of retries).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)