You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2013/11/05 22:00:17 UTC
[jira] [Resolved] (MESOS-800) CHECK failure in cgroups_isolator.
[ https://issues.apache.org/jira/browse/MESOS-800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benjamin Mahler resolved MESOS-800.
-----------------------------------
Resolution: Fixed
Fix committed on master:
{noformat}
commit 063376d3abb3e37e4a1e12fe8290049c4d1ac35b
Author: Benjamin Mahler <bm...@twitter.com>
Date: Tue Nov 5 12:48:53 2013 -0800
Fixed a race condition in the Cgroups Isolator.
{noformat}
> CHECK failure in cgroups_isolator.
> ----------------------------------
>
> Key: MESOS-800
> URL: https://issues.apache.org/jira/browse/MESOS-800
> Project: Mesos
> Issue Type: Bug
> Affects Versions: 0.14.0, 0.15.0, 0.14.1, 0.14.2
> Reporter: Benjamin Mahler
> Assignee: Benjamin Mahler
> Fix For: 0.16.0
>
>
> F1105 17:22:04.206166 35860 cgroups_isolator.cpp:1205] Check failed: !info->killed OOM detected for an already killed executor
> *** Check failure stack trace: ***
> @ 0x7f3ad114262d google::LogMessage::Fail()
> @ 0x7f3ad1146617 google::LogMessage::SendToLog()
> @ 0x7f3ad1144f14 google::LogMessage::Flush()
> @ 0x7f3ad1145146 google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f3ad0f0c142 mesos::internal::slave::CgroupsIsolator::oom()
> @ 0x7f3ad0f0c571 mesos::internal::slave::CgroupsIsolator::oomWaited()
> @ 0x7f3ad0f1de61 std::tr1::_Function_handler<>::_M_invoke()
> @ 0x7f3ad0f1fb54 std::tr1::_Function_handler<>::_M_invoke()
> @ 0x7f3ad1033f84 process::ProcessManager::resume()
> @ 0x7f3ad10349df process::schedule()
> @ 0x7f3ad079d83d start_thread
> @ 0x7f3acf17ff8d clone
> This is because we're not ignoring killed executors in the OOM handler, see my comments below:
> void CgroupsIsolator::killExecutor(
> const FrameworkID& frameworkId,
> const ExecutorID& executorId)
> {
> ...
> // Stop the OOM listener if needed.
> // XXX: The OOM listener can already be ready at this point! This means we need to ignore killed executors in the OOM handler.
> if (info->oomNotifier.isPending()) {
> info->oomNotifier.discard();
> }
> info->killed = true;
> }
> Given my comment above, we need to ignore killed executors in the OOM handler. Instead, we perform a CHECK which can fail when the race between kill and OOM occurs:
> void CgroupsIsolator::oomWaited(
> const FrameworkID& frameworkId,
> const ExecutorID& executorId,
> const UUID& uuid,
> const Future<uint64_t>& future)
> {
> LOG(INFO) << "OOM notifier is triggered for executor "
> << executorId << " of framework " << frameworkId
> << " with uuid " << uuid;
> if (future.isDiscarded()) {
> LOG(INFO) << "Discarded OOM notifier for executor "
> << executorId << " of framework " << frameworkId
> << " with uuid " << uuid;
> } else if (future.isFailed()) {
> LOG(ERROR) << "Listening on OOM events failed for executor "
> << executorId << " of framework " << frameworkId
> << " with uuid " << uuid << ": " << future.failure();
> } else {
> // Out-of-memory event happened, call the handler.
> oom(frameworkId, executorId, uuid);
> }
> }
> void CgroupsIsolator::oom(
> const FrameworkID& frameworkId,
> const ExecutorID& executorId,
> const UUID& uuid)
> {
> CgroupInfo* info = findCgroupInfo(frameworkId, executorId);
> if (info == NULL) {
> // It is likely that processExited is executed before this function (e.g.
> // The kill and OOM events happen at the same time, and the process exit
> // event arrives first.) Therefore, we should not report a fatal error here.
> LOG(INFO) << "OOM detected for an already terminated executor";
> return;
> }
> // We can also ignore an OOM event that we are late to process for a
> // previous instance of an executor.
> CHECK_SOME(info->uuid);
> if (uuid != info->uuid.get()) {
> LOG(INFO) << "OOM detected for a previous executor instance";
> return;
> }
> // If killed is set, the OOM notifier will be discarded in oomWaited.
> // Therefore, we should not be able to reach this point.
> // XXX: ^ This comment is false. oomWaited does not ignore killed executors.
> CHECK(!info->killed) << "OOM detected for an already killed executor";
> ...
--
This message was sent by Atlassian JIRA
(v6.1#6144)