You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Andrei Budnik (JIRA)" <ji...@apache.org> on 2018/01/23 19:21:00 UTC
[jira] [Comment Edited] (MESOS-7506) Multiple tests leave orphan containers.

    [ https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16336272#comment-16336272 ] 

Andrei Budnik edited comment on MESOS-7506 at 1/23/18 7:20 PM:
---------------------------------------------------------------

While recovery is in progress for [the first slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733], calling [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738] leads to calling [slave::Containerizer::create()|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431] to create a containerizer. An attempt to create a mesos c'zer, leads to calling [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124]. Finally, we get to the point, where we try to create a ["test" container|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476]. So, the recovery process for the first slave [might detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301] this "test" container as an orphaned container.

So, there is the race between recovery process for the first slave and an attempt to create a c'zer for the second agent.


was (Author: abudnik):
While recovery is in progress for [the first slave|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L733], calling [`StartSlave()`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/containerizer/linux_capabilities_isolator_tests.cpp#L738] leads to calling [slave::Containerizer::create()|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/tests/cluster.cpp#L431] to create a containerizer. An attempt to create a mesos c'zer, leads to calling [`cgroups::prepare`|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L124]. Finally, we get to the point, where we try to create a ["test" container|[https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/linux/cgroups.cpp#L476].] So, the recovery process for the first slave [might detect|https://github.com/apache/mesos/blob/ce0905fcb31a10ade0962a89235fa90b01edf01a/src/slave/containerizer/mesos/linux_launcher.cpp#L268-L301] this "test" container as an orphaned container.

So, there is the race between recovery process for the first slave and an attempt to create a c'zer for the second agent.

> Multiple tests leave orphan containers.
> ---------------------------------------
>
>                 Key: MESOS-7506
>                 URL: https://issues.apache.org/jira/browse/MESOS-7506
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>         Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>            Reporter: Alexander Rukletsov
>            Assignee: Andrei Budnik
>            Priority: Major
>              Labels: containerizer, flaky-test, mesosphere
>         Attachments: KillMultipleTasks-badrun.txt, ROOT_IsolatorFlags-badrun.txt, ROOT_IsolatorFlags-badrun2.txt, ROOT_IsolatorFlags-badrun3.txt, ReconcileTasksMissingFromSlave-badrun.txt, ResourceLimitation-badrun.txt, ResourceLimitation-badrun2.txt, RestartSlaveRequireExecutorAuthentication-badrun.txt, TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> SlaveTest.RestartSlaveRequireExecutorAuthentication // cannot reproduce any more
> ROOT_IsolatorFlags
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)