You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@mesos.apache.org by "Maged Michael (JIRA)" <ji...@apache.org> on 2015/11/09 12:16:10 UTC

[jira] [Created] (MESOS-3855) Add deterministic simulation tools for Mesos testing and debugging

Maged Michael created MESOS-3855:
------------------------------------

             Summary: Add deterministic simulation tools for Mesos testing and debugging
                 Key: MESOS-3855
                 URL: https://issues.apache.org/jira/browse/MESOS-3855
             Project: Mesos
          Issue Type: Improvement
            Reporter: Maged Michael


Test case-driven testing of Mesos master and allocator under non-deterministic and system-dependent conditions is subject to the lack of ability to reproduce problems and missing problems that may only occur on systems with different characteristics. Furthermore, test case-driven testing requires identifying the test cases beforehand.

The proposed simulation tools aim to run unmodified Mesos master and allocator code deterministically, driven by pseudo random events occurring within the constraints of configurable cluster models. Deterministic simulation guarantees repeatability of results. The pseudo random configurable model drives the exploration of the Mesos master and allocator state space without the need to identify specific test cases beforehand. 

Basic Requirements:
- Simulation results are deterministic. All runs with the same parameters generate identical results regardless of the host system.
- Automatic integration of Mesos master and allocator code into the simulator without manual modification, by adding capabilities in the libprocess and stout libraries to control timing and communication among threads and among nodes.
- Support for configurable cluster models to generate pseudo-random events to drive the execution of operations in Mesos master and allocator.
- Support for invariants and statistics in the cluster model in order to detect errors and suboptimal behavior in the tested Mesos master and allocator implementation.

Examples of problems to be detected by the simulator:
- Liveness problems such as deadlock, livelock, starvation.
- Safety problems such as unintentional overallocation of resources, lost tasks, failure to recover resources.
- Fairness problems such as allowing one or more frameworks to dominate resource usage at the expense of other frameworks.
- Violations of invariants in the Mesos master and allocator code.

Possible extensions that leverage common infrastructure:
- Performance testing: E.g., high response time, low resource utilization, low throughput
- Framework plug-in interface for testing framework task scheduling policies with Mesos allocators and against other framework policies.
- Cluster performance modeling to establish performance bounds for Mesos configurations of interest and what-if scenarios without the need to run on a real cloud.

Subitems (initial list):
- Add deterministic simulation capabilities to libprocess and stout.
  -- Replace real time with simulated time.
  -- Intercept inter-thread and inter-node communication.
  -- Schedule deterministic simulated communication events.
- Add libprocess-based test cases for deterministic simulation tools.
  -- Programs with inter-thread and inter-node communication using libprocess.
- Add Mesos cluster simulated event scheduler.
 -- To manage and order events (inter-node and inter-thread communication).
- Add configurable Mesos cluster model for driving deterministic simulation.
 -- Minimal (extensible) models of frameworks, roles, jobs, tasks, agents, and resources.
 -- Cluster invariants and statistics
- Add mock Zookeeper model for deterministic Mesos cluster simulation.

Assumptions:
- The data race freedom of the tested code.
- Correctness of 3rd party packages (zookeeper, protobuf, ...).

Link to high-level design document (in progress) https://goo.gl/9wfPef




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)