You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Benjamin Bannier (JIRA)" <ji...@apache.org> on 2019/01/31 15:58:00 UTC

[jira] [Commented] (MESOS-8839) Resource provider manager registrar recovery can race with agent on agent state leading to hard failures

    [ https://issues.apache.org/jira/browse/MESOS-8839?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16757399#comment-16757399 ] 

Benjamin Bannier commented on MESOS-8839:
-----------------------------------------

Reopening as we saw this again in our internal CI with something close to today's {{master}} {{HEAD}}.

> Resource provider manager registrar recovery can race with agent on agent state leading to hard failures
> --------------------------------------------------------------------------------------------------------
>
>                 Key: MESOS-8839
>                 URL: https://issues.apache.org/jira/browse/MESOS-8839
>             Project: Mesos
>          Issue Type: Bug
>          Components: agent, storage
>    Affects Versions: 1.6.0, 1.8.0
>            Reporter: Benjamin Bannier
>            Assignee: Benjamin Bannier
>            Priority: Blocker
>         Attachments: log
>
>
> When running in the agent the resource provider manager persists its state into the agent's state. The agent uses a LevelDB state which protects against concurrent access. The way we modelled LevelDB an {{fetch}} when a lock is present leads to a failed {{Future}} result. When the resource provider manager encounters a failed recovery it emits a fatal error, e.g.,
> {noformat}
> 11:48:26 F0425 11:48:26.650568 26819 manager.cpp:254] Failed to recover resource provider manager registry: Failed: IO error: lock /tmp/ParentChildContainerTypeAndContentType_AgentContainerAPITest_RecoverNestedContainer_10_HXbQCK/meta/slaves/6645885c-050a-4518-b896-a20b3e72a070-S0/resource_provider_registry/LOCK: already held by process
> 11:48:26 *** Check failure stack trace: ***{noformat}
> We should not fail hard for such recoverable failure scenarios.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)