You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Jie Yu (JIRA)" <ji...@apache.org> on 2016/10/05 23:57:21 UTC
[jira] [Commented] (MESOS-6302) Agent recovery can fail after nested containers are launched

    [ https://issues.apache.org/jira/browse/MESOS-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15550349#comment-15550349 ] 

Jie Yu commented on MESOS-6302:
-------------------------------

commit 8bab70c691a3efeda301f72956de4f80b258464e
Author: Gilbert Song songzihao1990@gmail.com
Date:   Mon Oct 3 15:28:39 2016 -0700

Fixed provisioner recovering with nested containers existed.

Previously, in provisioner recover, we firstly get all container
ids from the provisioner directory, and then find all rootfses
from each container's 'backends' directory. We made an assumption
that if a 'container_id' directory exists in the provisioner
directory, it must contain a 'backends' directory underneath,
which contains at least one rootfs for this container.

However, this is no longer true since we added support for nested
containers. Because we allow the case that a nested container is
specified with a container image while its parent does not have
an image specified. In this case, when the provisioner recovers,
it can still find the parent container's id in the provisioner
directory while no 'backends' directory exists, since all nested
containers backend information are under its parent container's
directory.

As a result, we should skip recovering the 'Info' struct in
provisioner for the parent container if it never provisions any
image.

Review: https://reviews.apache.org/r/52480/

> Agent recovery can fail after nested containers are launched
> ------------------------------------------------------------
>
>                 Key: MESOS-6302
>                 URL: https://issues.apache.org/jira/browse/MESOS-6302
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>            Reporter: Greg Mann
>            Assignee: Gilbert Song
>            Priority: Blocker
>              Labels: mesosphere
>             Fix For: 1.1.0
>
>         Attachments: read_write_app.json
>
>
> After launching a nested container which used a Docker image, I restarted the agent which ran that task group and saw the following in the agent logs during recovery:
> {code}
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: I1001 01:45:10.813596  4640 status_update_manager.cpp:203] Recovering status update manager
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: I1001 01:45:10.813622  4640 status_update_manager.cpp:211] Recovering executor 'instance-testvolume.02c26bce-8778-11e6-9ff3-7a3cd7c1568e' of framework 118ca38d-daee-4b2d-b584-b5581738a3dd-0000
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: I1001 01:45:10.814249  4639 docker.cpp:745] Recovering Docker containers
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: I1001 01:45:10.815294  4642 containerizer.cpp:581] Recovering containerizer
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: Failed to perform recovery: Collect failed: Unable to list rootfses belonged to container a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53: Unable to list the container directory: Failed to opendir '/var/lib/mesos/slave/provisioner/containers/a7d576da-fd0f-4dc1-bd5a-6d0a93ac8a53/backends': No such file or directory
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: To remedy this do as follows:
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: Step 1: rm -f /var/lib/mesos/slave/meta/slaves/latest
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]:         This ensures agent doesn't recover old live executors.
> Oct 01 01:45:10 ip-10-0-3-133.us-west-2.compute.internal mesos-agent[4629]: Step 2: Restart the agent.
> {code}
> and the agent continues to restart in this fashion. Attached is the Marathon app definition that I used to launch the task group.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)