You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Andrew Schwartzmeyer (JIRA)" <ji...@apache.org> on 2018/01/31 18:48:00 UTC

[jira] [Commented] (MESOS-8519) Fix recovery of job object isolated tasks

    [ https://issues.apache.org/jira/browse/MESOS-8519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347374#comment-16347374 ] 

Andrew Schwartzmeyer commented on MESOS-8519:
---------------------------------------------

{noformat}
Failed to update resources for container cfd95aea-c4b0-49a9-b4f3-5e62f48201ef of executor 'notepad.240f3c31-06b7-11e8-8f77-02421c3bc93c' of framework eb32cef4-c503-4ab7-85d4-8d4577e6a3bf-0000, destroying container: Collect failed: Failed to update container 'cfd95aea-c4b0-49a9-b4f3-5e62f48201ef': os::open_job: Call to `OpenJobObject` failed for job: MESOS_JOB_3A64: The system cannot find the file specified.
{noformat}

> Fix recovery of job object isolated tasks
> -----------------------------------------
>
>                 Key: MESOS-8519
>                 URL: https://issues.apache.org/jira/browse/MESOS-8519
>             Project: Mesos
>          Issue Type: Choose from below ...
>          Components: agent
>         Environment: Windows 10 Client 16299.192
>            Reporter: Andrew Schwartzmeyer
>            Assignee: Andrew Schwartzmeyer
>            Priority: Major
>              Labels: windows
>
> While the chain starting at https://reviews.apache.org/r/65397/ fixes many of the bugs leading up to the enabling of agent recovery on Windows (and indeed, enables it fully for Docker tasks), it explicitly does not yet enable the recovery of tasks contained in a job object.
> This JIRA issues specifically covers the bug where the agent fails to find an existing job object contained task, because it cannot find the job object when its back up. The task still exists, and when first launched, is named appropriately, and that name is checkpointed correctly and used by the recovering agent to find it again, but it fails because the job object the task is in has "lost" it's name.
> Inspecting it in process explorer, I verified the container process initially is in the correctly named job object, but after the parent process (the initial mesos agent) dies, while the container is still running, process explorer reports "Access Denied" for the job object name.
> My hypothesis is that this is related to the kernel object namespace mechanism. Currently researching.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)