You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Bill Farner (JIRA)" <ji...@apache.org> on 2014/11/11 22:07:35 UTC

[jira] [Created] (MESOS-2078) Scheduler driver may ACK status updates when the scheduler threw an exception

Bill Farner created MESOS-2078:
----------------------------------

             Summary: Scheduler driver may ACK status updates when the scheduler threw an exception
                 Key: MESOS-2078
                 URL: https://issues.apache.org/jira/browse/MESOS-2078
             Project: Mesos
          Issue Type: Bug
          Components: java api
            Reporter: Bill Farner
            Assignee: Vinod Kone
            Priority: Critical


[~vinod] discovered that this can happen if the scheduler calls {{SchedulerDriver#stop}} before or while handling {{Scheduler#statusUpdate}}.

In src/sched/sched.cpp:
The driver invokes {{statusUpdate}} and later checks the {{aborted}} flag to determine whether to send an ACK.
{code}
  void statusUpdate(
      const UPID& from,
      const StatusUpdate& update,
      const UPID& pid)
  {

...

    scheduler->statusUpdate(driver, status);

    VLOG(1) << "Scheduler::statusUpdate took " << stopwatch.elapsed();

    // Note that we need to look at the volatile 'aborted' here to
    // so that we don't acknowledge the update if the driver was
    // aborted during the processing of the update.
    if (aborted) {
      VLOG(1) << "Not sending status update acknowledgment message because "
              << "the driver is aborted!";
      return;
    }
...
{code}

In src/java/jni/org_apache_mesos_MesosSchedulerDriver.cpp:
The {{statusUpdate}} implementation checks for an exception and invokes {{driver->abort()}].
{code}
void JNIScheduler::statusUpdate(SchedulerDriver* driver,
                                const TaskStatus& status)
{
  jvm->AttachCurrentThread(JNIENV_CAST(&env), NULL);

  jclass clazz = env->GetObjectClass(jdriver);

  jfieldID scheduler = env->GetFieldID(clazz, "scheduler", "Lorg/apache/mesos/Scheduler;");
  jobject jscheduler = env->GetObjectField(jdriver, scheduler);

  clazz = env->GetObjectClass(jscheduler);

  // scheduler.statusUpdate(driver, status);
  jmethodID statusUpdate =
    env->GetMethodID(clazz, "statusUpdate",
                     "(Lorg/apache/mesos/SchedulerDriver;"
                     "Lorg/apache/mesos/Protos$TaskStatus;)V");

  jobject jstatus = convert<TaskStatus>(env, status);

  env->ExceptionClear();

  env->CallVoidMethod(jscheduler, statusUpdate, jdriver, jstatus);

  if (env->ExceptionCheck()) {
    env->ExceptionDescribe();
    env->ExceptionClear();
    jvm->DetachCurrentThread();
    driver->abort();
    return;
  }

  jvm->DetachCurrentThread();
}
{code}

In src/sched/sched.cpp:
The {{abort()}} implementation exits early if {{status != DRIVER_RUNNING}}, and *does not set the aborted flag*.
{code}
Status MesosSchedulerDriver::abort()
{
  Lock lock(&mutex);

  if (status != DRIVER_RUNNING) {
    return status;
  }

  CHECK(process != NULL);

  // We set the volatile aborted to true here to prevent any further
  // messages from being processed in the SchedulerProcess. However,
  // if abort() is called from another thread as the SchedulerProcess,
  // there may be at most one additional message processed.
  // TODO(bmahler): Use an atomic boolean.
  process->aborted = true;

  // Dispatching here ensures that we still process the outstanding
  // requests *from* the scheduler, since those do proceed when
  // aborted is true.
  dispatch(process, &SchedulerProcess::abort);

  return status = DRIVER_ABORTED;
}
{code}

As a result, the code will ACK despite an exception being thrown.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)