You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Jie Yu (JIRA)" <ji...@apache.org> on 2014/11/06 20:22:38 UTC
[jira] [Created] (MESOS-2047) Isolator cleanup failures shouldn't
cause TASK_LOST.
Jie Yu created MESOS-2047:
-----------------------------
Summary: Isolator cleanup failures shouldn't cause TASK_LOST.
Key: MESOS-2047
URL: https://issues.apache.org/jira/browse/MESOS-2047
Project: Mesos
Issue Type: Bug
Affects Versions: 0.21.0
Reporter: Jie Yu
Right now, if isolator cleanup fails, we'll transition all pending tasks to TASK_LOST (even in the OOM case, we should have transitioned it to TASK_FAILED).
The problematic code is here:
{noformat}
1052 void MesosContainerizerProcess::___destroy(
1053 const ContainerID& containerId,
1054 const Future<Option<int>>& status,
1055 const Future<list<Future<Nothing>>>& cleanups)
1056 {
1057 // This should not occur because we only use the Future<list> to
1058 // facilitate chaining.
1059 CHECK_READY(cleanups);
1060
1061 // Check cleanup succeeded for all isolators. If not, we'll fail the
1062 // container termination and remove the 'destroying' flag but leave
1063 // all other state. The container is now in an inconsistent state.
1064 foreach (const Future<Nothing>& cleanup, cleanups.get()) {
1065 if (!cleanup.isReady()) {
1066 promises[containerId]->fail(
1067 "Failed to clean up an isolator when destroying container '" +
1068 stringify(containerId) + "' :" +
1069 (cleanup.isFailed() ? cleanup.failure() : "discarded future"));
1070
1071 destroying.erase(containerId);
1072
1073 return;
1074 }
1075 }
{noformat}
Since launcher->destroy already succeeds (all processes are killed), instead of failing the promises[containerId], we probably should just export the error through metrics (so that people can get alerted on that) and still set the termination appropriately.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)