You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@openwhisk.apache.org by Markus Thömmes <ma...@me.com> on 2018/02/01 13:57:27 UTC

Introducing sharding as an alternative for state sharing

Hi folks,

we (Christian Bickel and I) just opened a pull-request for comments on introducing a notion of sharding instead of sharing the state between our controllers (loadbalancers). It also addresses a few deficiencies of the old loadbalancer to remove any kinds of bottlenecks there and make it as fast as possible.

Commit message for posterity:

The current ContainerPoolBalancer suffers a couple of problems and bottlenecks:

Inconsistent state: The data-structures keeping the state for that loadbalancer are not thread-safely handled, meaning there can be queuing to some invokers even though there is free capacity on other invokers.
Asynchronously shared state: Sharing the state is needed for a high-available deployment of multiple controllers and for horizontal scale in those. Said state-sharing makes point 1 even worse and isn't anywhere fast enough to be able to efficiently schedule quick bursts.
Bottlenecks: Getting the state from the outside (like for the ActivationThrottle) is a very costly operation (at least in the shared state case) and actually bottlenecks the whole invocation path. Getting the current state of the invokers is a second bottleneck, where one request is made to the corresponding actor for each invocation.
This new implementation aims to solve the problems mentioned above as follows:

All state is local: There is no shared state. Resources are managed through horizontal sharding. Horizontal sharding means: The invokers' slots are evenly divided between the loadbalancers in existence. If we deploy 2 loadbalancers and each invoker has 16 slots, each of the loadbalancers will have access to 8 slots on each invoker.
Slots are given away atomically: When scheduling an activation, the slot is immediately assigned to that activation (implemented through Semaphores). That means: Even in concurrent schedules, there will not be an overload on an invoker as long as there is capacity left on that invoker.
Asynchronous updates of slow data: Slowly changing data, like a change in the invoker's state, is asynchronously handled and updated to a local version of the state. Querying the state is as cheap as it can be.

A few words on the implementation details:

We chose to use horizontal sharding (evenly dividing the capacity of each invoker) vs. vertical sharding (evenly dividing the invokers as a whole) for the sake of staging these changes mainly. Once we divide vertically, we'll need a good loadbalancing strategy in front of our controllers themselves, to keep unnecessary cold-starts to a minimum and maximize container reuse. By dividing horizontally, we maintain the same reuse policies as today and can even keep the same loadbalancing strategies intact. Horizontal sharding of course only scales so far (maybe to 4 controllers, assuming 16 slots on each invoker) but it will give us time to figure out good strategies for vertical sharding and learn along the way. For vertical sharding to work, it will also be crucial to have the centralized overflow queue to be able to offload work between shards through workstealing. All in all: Vertical sharding is a much bigger change than horizontal sharding.

We tried to implement everything in a single actor first, but that seemed to impose a bottleneck again. Note that this is very frequented code, it needs to be very tight. That might not match the actor model too well.

Subsuming everything: This keeps all proposed changes intact (most notably Tyson's parallel executions and overloading queue).

A note on the gains made by this: Our non-blocking invoke performance is now quite close to the raw Kafka produce performance that we have in the system (not that it's good in itself, but that's the next theoretical bottleneck). Before the changes, this was roughly bottlenecked to 4k requests per second on a sufficiently powerful machine. Blocking invoke performance was roughly doubled.

Any comments, thoughts?

Cheers,
Markus