You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Andrew Schwartzmeyer (JIRA)" <ji...@apache.org> on 2018/02/01 18:02:00 UTC

[jira] [Commented] (MESOS-8519) Fix recovery of job object isolated tasks

    [ https://issues.apache.org/jira/browse/MESOS-8519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16349016#comment-16349016 ] 

Andrew Schwartzmeyer commented on MESOS-8519:
---------------------------------------------

Got this working. Apparently, Windows does not care if processes in a job object are still alive; if the handles to that job object are closed, it deletes the object. You'd think it'd count each process in the job object as a reference so it didn't inadvertently delete an in-use job object. The fix is to make the mesos-containerizer open and hold a handle to its job object, which keeps alive and findable again by the agent.

> Fix recovery of job object isolated tasks
> -----------------------------------------
>
>                 Key: MESOS-8519
>                 URL: https://issues.apache.org/jira/browse/MESOS-8519
>             Project: Mesos
>          Issue Type: Choose from below ...
>          Components: agent
>         Environment: Windows 10 Client 16299.192
>            Reporter: Andrew Schwartzmeyer
>            Assignee: Andrew Schwartzmeyer
>            Priority: Major
>              Labels: windows
>
> While the chain starting at https://reviews.apache.org/r/65397/ fixes many of the bugs leading up to the enabling of agent recovery on Windows (and indeed, enables it fully for Docker tasks), it explicitly does not yet enable the recovery of tasks contained in a job object.
> This JIRA issues specifically covers the bug where the agent fails to find an existing job object contained task, because it cannot find the job object when its back up. The task still exists, and when first launched, is named appropriately, and that name is checkpointed correctly and used by the recovering agent to find it again, but it fails because the job object the task is in has "lost" it's name.
> Inspecting it in process explorer, I verified the container process initially is in the correctly named job object, but after the parent process (the initial mesos agent) dies, while the container is still running, process explorer reports "Access Denied" for the job object name.
> My hypothesis is that this is related to the kernel object namespace mechanism. Currently researching.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)