You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@openwhisk.apache.org by Markus Thoemmes <ma...@de.ibm.com> on 2018/05/09 21:50:13 UTC

Proposal: Memory Aware Scheduling

Heya, lengthy text ahead, bear with me :)

# Proposal: Memory Aware Scheduling

## Problem

Today, an invoker has a fixed number of slots, which is determined by the number of CPUs * the number of shares per CPU. This does not reflect how a serverless platform actually "sells" its resources: The user defines the amount of **memory** she wants to use.

Instead of relying on the cpu shares I propose to schedule based on memory consumption.

An example:

Let's assume an invoker has 16 slots throughout the proposal. Today, OpenWhisk would schedule 16 * 512MB containers to one machine, even though it might not have sufficient memory to actually host those containers. The operator could adjust the coreshare accordingly to fit the worst-case amount of MAX_MEMORY containers, but that's not feasible either since that means unused memory when most users consume less memory.

After this proposal, OpenWhisk will schedule based on the available memory on the invoker machine, let's assume 4GB throughout the proposal. If all users use 512MB containers, fine, there will be at most 8 containers of those on one machine. On the other hand, if all users use 128MB containers, there will be 32 containers on one machine.

## Benefits

* Scheduling is tied to the resource the user can choose
* Allows a broader range of memory options (MAX_MEMORY could be equal to the memory available on the biggest machine in theory) without risking over-/undercommits

## Risks

* Fragmentation. It's certainly an issue the more options we provide. I'm wondering how much worse this will get compared to today though, given we already "waste" 50% of the machine, if users choose 128MB containers (and the settings are as described above). Fragmentation is "configurable" by making the list of choices the user has greater or smaller.

## Implementation

The following changes are needed to implement memory based scheduling throughout the system.

### 1. Loadbalancer

The loadbalancer keeps track of busy/idle slots on each invoker today. In our example, there's a Semaphore with 16 slots which are given away during scheduling. The notion of slots can be kept, but their count will be calculated/given away differently. Assuming our 4GB invokers (for simplicity I propose homogeneous invokers as a first step), each invoker will get `AVAILABLE_MEMORY / MIN_MEMORY` slots (the most parallel amount of containers possible) since that is the most containers that could be scheduled to it. When an action is scheduled, it will need to get `NEEDED_MEMORY / MIN_MEMORY` slots to be scheduled to an invoker. Following, the assignment of slots to a request will be called "weight".

The loadbalancer should furthermore emit a metric of the maximum capacity of the system (`NUM_INVOKERS * AVAILABLE_MEMORY`) so the operator can tell how much is left on the system. It would also be useful to have this metric per invoker to be able to tell how much fragmentation there is in the system.

### 2. Invoker

In the invoker, the `MessageFeed` logic will need adjustment. Like the loadbalancer, it could reduce it's capacity based on the weight of each request rather than reducing it by the fixed value 1. Capacity starts with the most parallel amount of containers possible (`AVAILABLE_MEMORY / MIN_MEMORY`) and is degraded by the number of slots taken by each request.

This cannot be implemented in an exact way though, since the consumer always pulls N records (and records themselves don't have a weight). For example an invocation which needs 512MB in our case gets 4 slots. Upon releasing this, the feed needs to pull 4 new messages, since it doesn't know what the weight of each message is beforehand. If the next message has a weight of 4 though, 3 messages will stay in the buffer and need to wait for more capacity to be freed. In case of a crash, the messages in the buffer will be lost.

The dataloss is only relevant in an overload + crash scenario since in any other case, there should not be more messages on the invoker topic than it can handle.

### 3. Configuration

The operator defines a list of possible values and the system determines at startup whether that list is usable for scheduling (as in: all values are divisible by the smallest value).

To configure the available memory per invoker I propose a fixed value (4GB in our example) first, passed to both controller and invoker. Future work can be to make this dynamic via the ping mechanism.

Fingers bleeding, I'll stop here :)

Looking forward to your input!

Cheers,
Markus

Re: Proposal: Memory Aware Scheduling

Posted by David P Grove <gr...@us.ibm.com>.

Awesome!

I'm working on the matching PR for the kube-deploy repo now.

--dave


Christian Bickel <cb...@apache.org> wrote on 08/23/2018 05:11:56 AM:

> From: Christian Bickel <cb...@apache.org>
> To: dev@openwhisk.apache.org
> Date: 08/23/2018 05:12 AM
> Subject: Re: Proposal: Memory Aware Scheduling
>
> Hi everyone,
>
> The implementation of this proposal has just been merged with
> INVALID URI REMOVED
>
u=https-3A__github.com_apache_incubator-2Dopenwhisk_commit_5b3e0b6a334b78fc783a2cd655f0f30ea58a68e8&d=DwIFaQ&c=jf_iaSHvJObTbx-

> siA1ZOg&r=Fe4FicGBU_20P2yihxV-
> apaNSFb6BSj6AlkptSF2gMk&m=VOpQ5QsG1GRW56XCvNa-
> az2CYAsttqp6PPkbeIfPBQo&s=wDjvpqtvYQWtwaXze--kHVmWazxfaFUnwk71YNR0wwo&e=.
>
> Greetings
> Christian
> Am Do., 10. Mai 2018 um 13:36 Uhr schrieb Markus Thoemmes
> <ma...@de.ibm.com>:
> >
> > Thanks Dominic!
> >
> > Yep, that's exactly the thought.
> >
> > Towards your questions:
> >
> > # 1. How do loadbalancers keep the state:
> >
> > They stay as they are. The Semaphores today have the cpu-share
> based calculated slots and will have the memory based slots in the
> future. No change needed there in my opinion.
> >
> > # 2. How are slots shared among loadbalancers:
> >
> > Same answer as above: Like today! In your example, each
> loadbalancer will have 16 slots to give away (assuming 2
> controllers). This has a wrinkle in that the maximum possible memory
> size must be proportional to the amount of loadbalancers in the
> system. For a first step, this might be fine. In the future we need
> to implement vertical sharding where the loadbalancers divide the
> invoker pool to make bigger memory sizes possible again. Good one!
> >
> > Another wrinkle here is, that with an increasing amount of
> loadbalancers fragmentation gets worse. Again, I think for now this
> is acceptable in that the recommendation on the amount of
> controllers is rather small today.
> >
> > # 3. Throttling mechanism:
> >
> > Very good one, I missed that in my initial proposal: Today, we
> limit the number of concurrent activations, or differently phrased
> the number of slots occupied at any point in time. Likewise, the
> throttling can change to stay "number of slots occupied at any point
> in time" and will effectively limit the amount of memory a user can
> consume in the system, i.e. if a user has 1000 slots free, she can
> have 250x 512MB activations running, or 500x 256MB activations (or
> any mixture of course).
> >
> > It's important that we provide a migration path though as this
> will change the behavior in production systems. We could make the
> throttling strategy configurable and decide between
> "maximumConcurrentActivations", which ignores the weight of an
> action and behaves just like today and "memoryAwareWeights" which is
> the described new way of throttling.
> >
> > Cheers,
> > Markus
> >
>

Re: Proposal: Memory Aware Scheduling

Posted by Christian Bickel <cb...@apache.org>.

Hi everyone,

The implementation of this proposal has just been merged with
https://github.com/apache/incubator-openwhisk/commit/5b3e0b6a334b78fc783a2cd655f0f30ea58a68e8.

Greetings
Christian
Am Do., 10. Mai 2018 um 13:36 Uhr schrieb Markus Thoemmes
<ma...@de.ibm.com>:
>
> Thanks Dominic!
>
> Yep, that's exactly the thought.
>
> Towards your questions:
>
> # 1. How do loadbalancers keep the state:
>
> They stay as they are. The Semaphores today have the cpu-share based calculated slots and will have the memory based slots in the future. No change needed there in my opinion.
>
> # 2. How are slots shared among loadbalancers:
>
> Same answer as above: Like today! In your example, each loadbalancer will have 16 slots to give away (assuming 2 controllers). This has a wrinkle in that the maximum possible memory size must be proportional to the amount of loadbalancers in the system. For a first step, this might be fine. In the future we need to implement vertical sharding where the loadbalancers divide the invoker pool to make bigger memory sizes possible again. Good one!
>
> Another wrinkle here is, that with an increasing amount of loadbalancers fragmentation gets worse. Again, I think for now this is acceptable in that the recommendation on the amount of controllers is rather small today.
>
> # 3. Throttling mechanism:
>
> Very good one, I missed that in my initial proposal: Today, we limit the number of concurrent activations, or differently phrased the number of slots occupied at any point in time. Likewise, the throttling can change to stay "number of slots occupied at any point in time" and will effectively limit the amount of memory a user can consume in the system, i.e. if a user has 1000 slots free, she can have 250x 512MB activations running, or 500x 256MB activations (or any mixture of course).
>
> It's important that we provide a migration path though as this will change the behavior in production systems. We could make the throttling strategy configurable and decide between "maximumConcurrentActivations", which ignores the weight of an action and behaves just like today and "memoryAwareWeights" which is the described new way of throttling.
>
> Cheers,
> Markus
>

Re: Proposal: Memory Aware Scheduling

Posted by Markus Thoemmes <ma...@de.ibm.com>.

Thanks Dominic!

Yep, that's exactly the thought.

Towards your questions:

# 1. How do loadbalancers keep the state:

They stay as they are. The Semaphores today have the cpu-share based calculated slots and will have the memory based slots in the future. No change needed there in my opinion.

# 2. How are slots shared among loadbalancers:

Same answer as above: Like today! In your example, each loadbalancer will have 16 slots to give away (assuming 2 controllers). This has a wrinkle in that the maximum possible memory size must be proportional to the amount of loadbalancers in the system. For a first step, this might be fine. In the future we need to implement vertical sharding where the loadbalancers divide the invoker pool to make bigger memory sizes possible again. Good one!

Another wrinkle here is, that with an increasing amount of loadbalancers fragmentation gets worse. Again, I think for now this is acceptable in that the recommendation on the amount of controllers is rather small today.

# 3. Throttling mechanism:

Very good one, I missed that in my initial proposal: Today, we limit the number of concurrent activations, or differently phrased the number of slots occupied at any point in time. Likewise, the throttling can change to stay "number of slots occupied at any point in time" and will effectively limit the amount of memory a user can consume in the system, i.e. if a user has 1000 slots free, she can have 250x 512MB activations running, or 500x 256MB activations (or any mixture of course).

It's important that we provide a migration path though as this will change the behavior in production systems. We could make the throttling strategy configurable and decide between "maximumConcurrentActivations", which ignores the weight of an action and behaves just like today and "memoryAwareWeights" which is the described new way of throttling.

Cheers,
Markus

Re: Proposal: Memory Aware Scheduling

Posted by Dominic Kim <st...@gmail.com>.

Hi Markus.

Great proposal!
So, as per your proposal now operator can configure all memory types and
loadbalancer schdules actions with different weight on slots based on
memory.
And max slots will be calculated based on memory and minimum memory type.
In case we have 4 GB, with following memory types; 128MB, 256MB, 512MB,
slot size will be 32(4GB/128MB) and loadbalancer will assign 2 slots, 4
slots for 256MB, 512MB actions respectively.

Few questions are, how do loadbalancers keep the state of memory footprint,
how are slots shared among loadbalancers, and are you also thinking of any
change on throttling mechanism?
As now one action can occupy multiple slots, there might be some changes on
throttling as well.

And I expect with this change, operator can configure types and the number
of prewarm containers as well.

Thanks
Regards
Dominic





2018-05-10 6:50 GMT+09:00 Markus Thoemmes <ma...@de.ibm.com>:

> Heya, lengthy text ahead, bear with me :)
>
> # Proposal: Memory Aware Scheduling
>
> ## Problem
>
> Today, an invoker has a fixed number of slots, which is determined by the
> number of CPUs * the number of shares per CPU. This does not reflect how a
> serverless platform actually "sells" its resources: The user defines the
> amount of **memory** she wants to use.
>
> Instead of relying on the cpu shares I propose to schedule based on memory
> consumption.
>
> An example:
>
> Let's assume an invoker has 16 slots throughout the proposal. Today,
> OpenWhisk would schedule 16 * 512MB containers to one machine, even though
> it might not have sufficient memory to actually host those containers. The
> operator could adjust the coreshare accordingly to fit the worst-case
> amount of MAX_MEMORY containers, but that's not feasible either since that
> means unused memory when most users consume less memory.
>
> After this proposal, OpenWhisk will schedule based on the available memory
> on the invoker machine, let's assume 4GB throughout the proposal. If all
> users use 512MB containers, fine, there will be at most 8 containers of
> those on one machine. On the other hand, if all users use 128MB containers,
> there will be 32 containers on one machine.
>
> ## Benefits
>
> * Scheduling is tied to the resource the user can choose
> * Allows a broader range of memory options (MAX_MEMORY could be equal to
> the memory available on the biggest machine in theory) without risking
> over-/undercommits
>
> ## Risks
>
> * Fragmentation. It's certainly an issue the more options we provide. I'm
> wondering how much worse this will get compared to today though, given we
> already "waste" 50% of the machine, if users choose 128MB containers (and
> the settings are as described above). Fragmentation is "configurable" by
> making the list of choices the user has greater or smaller.
>
> ## Implementation
>
> The following changes are needed to implement memory based scheduling
> throughout the system.
>
> ### 1. Loadbalancer
>
> The loadbalancer keeps track of busy/idle slots on each invoker today. In
> our example, there's a Semaphore with 16 slots which are given away during
> scheduling. The notion of slots can be kept, but their count will be
> calculated/given away differently. Assuming our 4GB invokers (for
> simplicity I propose homogeneous invokers as a first step), each invoker
> will get `AVAILABLE_MEMORY / MIN_MEMORY` slots (the most parallel amount of
> containers possible) since that is the most containers that could be
> scheduled to it. When an action is scheduled, it will need to get
> `NEEDED_MEMORY / MIN_MEMORY` slots to be scheduled to an invoker.
> Following, the assignment of slots to a request will be called "weight".
>
> The loadbalancer should furthermore emit a metric of the maximum capacity
> of the system (`NUM_INVOKERS * AVAILABLE_MEMORY`) so the operator can tell
> how much is left on the system. It would also be useful to have this metric
> per invoker to be able to tell how much fragmentation there is in the
> system.
>
> ### 2. Invoker
>
> In the invoker, the `MessageFeed` logic will need adjustment. Like the
> loadbalancer, it could reduce it's capacity based on the weight of each
> request rather than reducing it by the fixed value 1. Capacity starts with
> the most parallel amount of containers possible (`AVAILABLE_MEMORY /
> MIN_MEMORY`) and is degraded by the number of slots taken by each request.
>
> This cannot be implemented in an exact way though, since the consumer
> always pulls N records (and records themselves don't have a weight). For
> example an invocation which needs 512MB in our case gets 4 slots. Upon
> releasing this, the feed needs to pull 4 new messages, since it doesn't
> know what the weight of each message is beforehand. If the next message has
> a weight of 4 though, 3 messages will stay in the buffer and need to wait
> for more capacity to be freed. In case of a crash, the messages in the
> buffer will be lost.
>
> The dataloss is only relevant in an overload + crash scenario since in any
> other case, there should not be more messages on the invoker topic than it
> can handle.
>
> ### 3. Configuration
>
> The operator defines a list of possible values and the system determines
> at startup whether that list is usable for scheduling (as in: all values
> are divisible by the smallest value).
>
> To configure the available memory per invoker I propose a fixed value (4GB
> in our example) first, passed to both controller and invoker. Future work
> can be to make this dynamic via the ping mechanism.
>
>
> Fingers bleeding, I'll stop here :)
>
> Looking forward to your input!
>
> Cheers,
> Markus
>
>