You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Meng Zhu (JIRA)" <ji...@apache.org> on 2018/10/09 19:07:00 UTC
[jira] [Assigned] (MESOS-9108) Test `ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI` is flaky.

     [ https://issues.apache.org/jira/browse/MESOS-9108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Meng Zhu reassigned MESOS-9108:
-------------------------------

    Assignee:     (was: Meng Zhu)

> Test `ROOT_DOCKER_DockerAndMesosContainerizers/DefaultExecutorTest.TaskWithFileURI` is flaky.
> ---------------------------------------------------------------------------------------------
>
>                 Key: MESOS-9108
>                 URL: https://issues.apache.org/jira/browse/MESOS-9108
>             Project: Mesos
>          Issue Type: Bug
>            Reporter: Meng Zhu
>            Priority: Major
>              Labels: flaky-test
>         Attachments: DefaultExecutorTest_TaskWithFileURI_badrun.txt
>
>
> The test is flaky and segfault on CI ubuntu-16.04-SSL, log attached.
> Looks like this is due to a race condition during the test destruction sequence:
> The test 
> {code:c++}
>   Future<v1::scheduler::Event::Update> startingUpdate;
>   Future<v1::scheduler::Event::Update> runningUpdate;
>   Future<v1::scheduler::Event::Update> finishedUpdate;
>   EXPECT_CALL(*scheduler, update(_, _))
>     .WillOnce(
>         DoAll(
>             FutureArg<1>(&startingUpdate),
>             v1::scheduler::SendAcknowledge(frameworkId, agentId)))
>     .WillOnce(
>         DoAll(
>             FutureArg<1>(&runningUpdate),
>             v1::scheduler::SendAcknowledge(frameworkId, agentId)))
>     .WillOnce(
>         DoAll(
>             FutureArg<1>(&finishedUpdate),
>             v1::scheduler::SendAcknowledge(frameworkId, agentId)));
>   mesos.send(
>       v1::createCallAccept(
>           frameworkId,
>           offer,
>           {v1::LAUNCH_GROUP(
>               executorInfo, v1::createTaskGroupInfo({taskInfo}))}));
>   AWAIT_READY(startingUpdate);
>   ASSERT_EQ(v1::TASK_STARTING, startingUpdate->status().state());
>   ASSERT_EQ(taskInfo.task_id(), startingUpdate->status().task_id());
>   AWAIT_READY(runningUpdate);
>   ASSERT_EQ(v1::TASK_RUNNING, runningUpdate->status().state());
>   ASSERT_EQ(taskInfo.task_id(), runningUpdate->status().task_id());
>   AWAIT_READY(finishedUpdate);
>   ASSERT_EQ(v1::TASK_FINISHED, finishedUpdate->status().state());
>   ASSERT_EQ(taskInfo.task_id(), finishedUpdate->status().task_id());
> }
> {code}
> The sending acknowledgment of the last task status update (TASK_FINISHED) could race with the test tear down. Specifically, the `EXPECT_CALL` on the `update()` captures a pointer to arg0 which is `mesos`. However, `mesos` could be destructed during the test teardown leaving arg0 a nullptr and consequently the test segfaults when it tries to call arg0->...send().
> One quick fix is to remove the last acknowledgment. However, a sound fix is to make the `mesos` pointer a shared one. This would entail a lot of interface changes. Since our only concern is the capture in the `EXPECT_CALL`, maybe we only need to change the interface of that to take a shared pointer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)