You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2013/11/05 21:46:17 UTC

[jira] [Created] (MESOS-800) CHECK failure in cgroups_isolator.

Benjamin Mahler created MESOS-800:
-------------------------------------

             Summary: CHECK failure in cgroups_isolator.
                 Key: MESOS-800
                 URL: https://issues.apache.org/jira/browse/MESOS-800
             Project: Mesos
          Issue Type: Bug
    Affects Versions: 0.14.1, 0.14.0, 0.15.0, 0.14.2
            Reporter: Benjamin Mahler
            Assignee: Benjamin Mahler
             Fix For: 0.16.0


F1105 17:22:04.206166 35860 cgroups_isolator.cpp:1205] Check failed: !info->killed OOM detected for an already killed executor
*** Check failure stack trace: ***
    @     0x7f3ad114262d  google::LogMessage::Fail()
    @     0x7f3ad1146617  google::LogMessage::SendToLog()
    @     0x7f3ad1144f14  google::LogMessage::Flush()
    @     0x7f3ad1145146  google::LogMessageFatal::~LogMessageFatal()
    @     0x7f3ad0f0c142  mesos::internal::slave::CgroupsIsolator::oom()
    @     0x7f3ad0f0c571  mesos::internal::slave::CgroupsIsolator::oomWaited()
    @     0x7f3ad0f1de61  std::tr1::_Function_handler<>::_M_invoke()
    @     0x7f3ad0f1fb54  std::tr1::_Function_handler<>::_M_invoke()
    @     0x7f3ad1033f84  process::ProcessManager::resume()
    @     0x7f3ad10349df  process::schedule()
    @     0x7f3ad079d83d  start_thread
    @     0x7f3acf17ff8d  clone

This is because we're not ignoring killed executors in the OOM handler, see my comments below:

void CgroupsIsolator::killExecutor(
    const FrameworkID& frameworkId,
    const ExecutorID& executorId)
{
  ...
  // Stop the OOM listener if needed.
  // XXX: The OOM listener can already be ready at this point! This means we need to ignore killed executors in the OOM handler.
  if (info->oomNotifier.isPending()) {
    info->oomNotifier.discard();
  }

  info->killed = true;
}

Given my comment above, we need to ignore killed executors in the OOM handler. Instead, we perform a CHECK which can fail when the race between kill and OOM occurs:

void CgroupsIsolator::oomWaited(
    const FrameworkID& frameworkId,
    const ExecutorID& executorId,
    const UUID& uuid,
    const Future<uint64_t>& future)
{
  LOG(INFO) << "OOM notifier is triggered for executor "
            << executorId << " of framework " << frameworkId
            << " with uuid " << uuid;

  if (future.isDiscarded()) {
    LOG(INFO) << "Discarded OOM notifier for executor "
              << executorId << " of framework " << frameworkId
              << " with uuid " << uuid;
  } else if (future.isFailed()) {
    LOG(ERROR) << "Listening on OOM events failed for executor "
               << executorId << " of framework " << frameworkId
               << " with uuid " << uuid << ": " << future.failure();
  } else {
    // Out-of-memory event happened, call the handler.
    oom(frameworkId, executorId, uuid);
  }
}


void CgroupsIsolator::oom(
    const FrameworkID& frameworkId,
    const ExecutorID& executorId,
    const UUID& uuid)
{
  CgroupInfo* info = findCgroupInfo(frameworkId, executorId);
  if (info == NULL) {
    // It is likely that processExited is executed before this function (e.g.
    // The kill and OOM events happen at the same time, and the process exit
    // event arrives first.) Therefore, we should not report a fatal error here.
    LOG(INFO) << "OOM detected for an already terminated executor";
    return;
  }

  // We can also ignore an OOM event that we are late to process for a
  // previous instance of an executor.
  CHECK_SOME(info->uuid);
  if (uuid != info->uuid.get()) {
    LOG(INFO) << "OOM detected for a previous executor instance";
    return;
  }

  // If killed is set, the OOM notifier will be discarded in oomWaited.
  // Therefore, we should not be able to reach this point.
  // XXX: ^ This comment is false. oomWaited does not ignore killed executors.
  CHECK(!info->killed) << "OOM detected for an already killed executor";
  ...



--
This message was sent by Atlassian JIRA
(v6.1#6144)