You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Andrei Budnik (JIRA)" <ji...@apache.org> on 2017/11/08 11:02:00 UTC

[jira] [Comment Edited] (MESOS-7506) Multiple tests leave orphan containers.

    [ https://issues.apache.org/jira/browse/MESOS-7506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16225515#comment-16225515 ] 

Andrei Budnik edited comment on MESOS-7506 at 11/8/17 11:01 AM:
----------------------------------------------------------------

*First cause*

Some tests (from {{SlaveTest}} and {{SlaveRecoveryTest}}) have a pattern [like this|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/slave_tests.cpp#L393-L406], where the clock is advanced by {{executor_registration_timeout}} and then it waits in a loop until a task status update is sent. This loop is executing while the container is being destroyed. At the same time, container destruction consists of multiple steps, one of them waits for [cgroups destruction|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/linux_launcher.cpp#L567]. That means, we have a race between container destruction process and the loop that advances the clock, leading to the following outcomes:
#  Container completely destroyed, before clock advancing reaches timeout (e.g. {{cgroups::DESTROY_TIMEOUT}}).
# Triggered timeout due to clock advancing, before container destruction completes. That results in [leaving orphaned|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/containerizer.cpp#L2367-L2380] containers that will be detected by [Slave destructor|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/cluster.cpp#L559-L584] in `tests/cluster.cpp`, so the test will fail.

The issue is easily reproduced by advancing the clocks by 60 seconds or more in the loop, which waits for a status update.


was (Author: abudnik):
Some tests (from {{SlaveTest}} and {{SlaveRecoveryTest}}) have a pattern [like this|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/slave_tests.cpp#L393-L406], where the clock is advanced by {{executor_registration_timeout}} and then it waits in a loop until a task status update is sent. This loop is executing while the container is being destroyed. At the same time, container destruction consists of multiple steps, one of them waits for [cgroups destruction|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/linux_launcher.cpp#L567]. That means, we have a race between container destruction process and the loop that advances the clock, leading to the following outcomes:
#  Container completely destroyed, before clock advancing reaches timeout (e.g. {{cgroups::DESTROY_TIMEOUT}}).
# Triggered timeout due to clock advancing, before container destruction completes. That results in [leaving orphaned|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/slave/containerizer/mesos/containerizer.cpp#L2367-L2380] containers that will be detected by [Slave destructor|https://github.com/apache/mesos/blob/ff01d0c44251e2ffaa2f4f47b33c790594d194d9/src/tests/cluster.cpp#L559-L584] in `tests/cluster.cpp`, so the test will fail.

The issue is easily reproduced by advancing the clocks by 60 seconds or more in the loop, which waits for a status update.

> Multiple tests leave orphan containers.
> ---------------------------------------
>
>                 Key: MESOS-7506
>                 URL: https://issues.apache.org/jira/browse/MESOS-7506
>             Project: Mesos
>          Issue Type: Bug
>          Components: containerization
>         Environment: Ubuntu 16.04
> Fedora 23
> other Linux distros
>            Reporter: Alexander Rukletsov
>            Assignee: Andrei Budnik
>              Labels: containerizer, flaky-test, mesosphere
>         Attachments: KillMultipleTasks-badrun.txt, ResourceLimitation-badrun.txt, TaskWithFileURI-badrun.txt
>
>
> I've observed a number of flaky tests that leave orphan containers upon cleanup. A typical log looks like this:
> {noformat}
> ../../src/tests/cluster.cpp:580: Failure
> Value of: containers->empty()
>   Actual: false
> Expected: true
> Failed to destroy containers: { da3e8aa8-98e7-4e72-a8fd-5d0bae960014 }
> {noformat}
> All currently affected tests:
> {noformat}
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillTask/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.ResourceLimitation/0
> ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.KillMultipleTasks/0
> SlaveRecoveryTest/0.RecoverUnregisteredExecutor
> SlaveRecoveryTest/0.CleanupExecutor
> SlaveRecoveryTest/0.RecoverTerminatedExecutor
> SlaveTest.ShutdownUnregisteredExecutor
> ShutdownUnregisteredExecutor
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)