You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mesos.apache.org by Yan Xu <ya...@jxu.me> on 2014/08/05 22:48:36 UTC

Introducing Framework Rate Limiting in Mesos

Hi Mesos users and developers,

In Mesos 0.20.0 we’d like to release a feature called framework rate limiting.

What is Framework Rate Limiting

In a multi-framework environment, this feature aims to protect the throughput of high-SLA (e.g., production, service) frameworks by having the master throttle messages from other (e.g., development, batch) frameworks.

To throttle messages from a framework, the operator sets a qps (queries per seconds) value for each framework identified by its principal (You can also throttle a group of frameworks together but we’ll assume individual frameworks in this email; see the config definition here; we’ll publish a user doc soon). The master then promises not to process messages from that framework at a rate above qps (but possibly below it, especially if the qps is set too high that master cannot keep up). The outstanding messages are stored in memory on the master.

We’ll publish more detailed guidance on how to set the qps (as well as the capacity which is described below) in the user doc.

Limitation and Future Work

To prevent the number of outstanding messages from growing unboundedly and crashing the master due to out of memory, the operator can set a capacity for the framework which is the max number of outstanding messages allowed. When a framework exceeds the capacity, a FrameworkErrorMessage is sent back to the framework which will abort the scheduler driver and invoke the error() callback. It doesn’t kill any tasks or the scheduler itself and the framework developer can choose to restart or failover the scheduler instance to remedy the consequences of dropped messages (unless your framework doesn’t assume all messages sent to the master are processed).

The above is the behavior we are planning for 0.20.0 with respect to framework rate limiting. We are going to iterate on this feature by having the master send an early alert when the message queue for this framework starts to build up (MESOS–1664). The scheduler can react by throttling itself (to avoid the error message) or ignoring this alert if it’s just a temporary burst. MESOS–1664 is likely not going to land in 0.20.0.

Due to this limitation we don’t recommend using the rate limiting feature to throttle production frameworks for now unless you are sure about the consequences of the error message. Of course it’s OK to use it to protect production frameworks by throttling other frameworks and it doesn’t have any effect on the master if it’s not explicitly enabled.

We are excited about this feature and hope to release it in 0.20.0. Your comments about the feature and this limitation are welcome and appreciated! If there is no objection within 3 days we’ll assume lazy consensus and put it in the release so you can start using it.

Thanks!
Yan