You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mesos.apache.org by "Benjamin Mahler (JIRA)" <ji...@apache.org> on 2013/11/26 21:14:35 UTC

[jira] [Created] (MESOS-851) Scheduler Driver does not guarantee that abort() prevents further calls on the Scheduler.

Benjamin Mahler created MESOS-851:
-------------------------------------

             Summary: Scheduler Driver does not guarantee that abort() prevents further calls on the Scheduler.
                 Key: MESOS-851
                 URL: https://issues.apache.org/jira/browse/MESOS-851
             Project: Mesos
          Issue Type: Bug
          Components: c++ api, java api, python api
            Reporter: Benjamin Mahler
            Priority: Critical
             Fix For: 0.16.0


This came up while reviewing: https://reviews.apache.org/r/15853/

Our documentation for abort mentions that no more callbacks can be made to the scheduler:
  /**
   * Aborts the driver so that no more callbacks can be made to the
   * scheduler. The semantics of abort and stop have deliberately been
   * separated so that code can detect an aborted driver (i.e., via
   * the return status of SchedulerDriver::join, see below), and
   * instantiate and start another driver if desired (from within the
   * same process). Note that 'stop()' is not automatically called
   * inside 'abort()'.
   */
  virtual Status abort() = 0;

However, this is inaccurate as we perform a dispatch to the SchedulerProcess which means that any already queued messages will be processed prior to abort:


Status MesosSchedulerDriver::abort()
{
  Lock lock(&mutex);

  if (status != DRIVER_RUNNING) {
    return status;
  }

  CHECK(process != NULL);

  // XXX: This does not immediately signal the SchedulerProcess to stop
  // processing messages!
  dispatch(process, &SchedulerProcess::abort);

  return status = DRIVER_ABORTED;
}

The driver's stop() call has a similar issue in terms of possibly making additional calls on the Scheduler after stop() is called.

This problem is mirrored in the ExecutorDriver's stop and abort functions as well.

So far, I see a few possible fixes:

1. Expose the 'volatile bool aborted' member variable of SchedulerProcess and set it inside MesosSchedulerDriver::abort. stop() would need a similar boolean.

2. Provide a "priority dispatch" mechanism in libprocess, wherein the DispatchEvent can be sent to the front of the queue. (stop() can also use this).

3. Terminate the process when abort/stop are called and handle it appropriately in the finalize() function, however, this changes the existing functionality in that schedulers can no longer make driver calls to kill tasks, launch tasks, etc after being aborted.



--
This message was sent by Atlassian JIRA
(v6.1#6144)