You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@openwhisk.apache.org by cb...@apache.org on 2018/07/03 07:00:11 UTC

[incubator-openwhisk] branch master updated: Add documentation to the loadbalancer. (#3778)

This is an automated email from the ASF dual-hosted git repository.

cbickel pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/incubator-openwhisk.git


The following commit(s) were added to refs/heads/master by this push:
     new 06818a4  Add documentation to the loadbalancer. (#3778)
06818a4 is described below

commit 06818a4a8056aea4c0c555033f2a947ff15e33fa
Author: Markus Thömmes <ma...@me.com>
AuthorDate: Tue Jul 3 09:00:04 2018 +0200

    Add documentation to the loadbalancer. (#3778)
    
    * Add documentation to the loadbalancer.
    
    * Add information on the overflow and other edge cases.
    
    * Incooperating more feedback to make prose description clearer.
    
    * Clarify capacity determination.
    
    * Clarify health protocol.
---
 .../ShardingContainerPoolBalancer.scala            | 90 +++++++++++++++++++++-
 1 file changed, 88 insertions(+), 2 deletions(-)

diff --git a/core/controller/src/main/scala/whisk/core/loadBalancer/ShardingContainerPoolBalancer.scala b/core/controller/src/main/scala/whisk/core/loadBalancer/ShardingContainerPoolBalancer.scala
index eac4aff..72124ec 100644
--- a/core/controller/src/main/scala/whisk/core/loadBalancer/ShardingContainerPoolBalancer.scala
+++ b/core/controller/src/main/scala/whisk/core/loadBalancer/ShardingContainerPoolBalancer.scala
@@ -45,10 +45,96 @@ import scala.concurrent.{ExecutionContext, Future, Promise}
 import scala.util.{Failure, Success}
 
 /**
- * A loadbalancer that uses "horizontal" sharding to not collide with fellow loadbalancers.
+ * A loadbalancer that schedules workload based on a hashing-algorithm.
+ *
+ * ## Algorithm
+ *
+ * At first, for every namespace + action pair a hash is calculated and then an invoker is picked based on that hash
+ * (`hash % numInvokers`). The determined index is the so called "home-invoker". This is the invoker where the following
+ * progression will **always** start. If this invoker is healthy (see "Invoker health checking") and if there is
+ * capacity on that invoker (see "Capacity checking"), the request is scheduled to it.
+ *
+ * If one of these prerequisites is not true, the index is incremented by a step-size. The step-sizes available are the
+ * all coprime numbers smaller than the amount of invokers available (coprime, to minimize collisions while progressing
+ * through the invokers). The step-size is picked by the same hash calculated above (`hash & numStepSizes`). The
+ * home-invoker-index is now incremented by the step-size and the checks (healthy + capacity) are done on the invoker
+ * we land on now.
+ *
+ * This procedure is repeated until all invokers have been checked at which point the "overload" strategy will be
+ * employed, which is to choose a healthy invoker randomly. In a steadily running system, that overload means that there
+ * is no capacity on any invoker left to schedule the current request to.
+ *
+ * If no invokers are available or if there are no healthy invokers in the system, the loadbalancer will return an error
+ * stating that no invokers are available to take any work. Requests are not queued anywhere in this case.
+ *
+ * An example:
+ * - availableInvokers: 10 (all healthy)
+ * - hash: 13
+ * - homeInvoker: hash % availableInvokers = 13 % 10 = 3
+ * - stepSizes: 1, 3, 7 (note how 2 and 5 is not part of this because it's not coprime to 10)
+ * - stepSizeIndex: hash % numStepSizes = 13 % 3 = 1 => stepSize = 3
+ *
+ * Progression to check the invokers: 3, 6, 9, 2, 5, 8, 1, 4, 7, 0 --> done
+ *
+ * This heuristic is based on the assumption, that the chance to get a warm container is the best on the home invoker
+ * and degrades the more steps you make. The hashing makes sure that all loadbalancers in a cluster will always pick the
+ * same home invoker and do the same progression for a given action.
+ *
+ * Known caveats:
+ * - This assumption is not always true. For instance, two heavy workloads landing on the same invoker can override each
+ *   other, which results in many cold starts due to all containers being evicted by the invoker to make space for the
+ *   "other" workload respectively. Future work could be to keep a buffer of invokers last scheduled for each action and
+ *   to prefer to pick that one. Then the second-last one and so forth.
+ *
+ * ## Capacity checking
+ *
+ * The maximum capacity per invoker is configured using `invoker-busy-threshold`, which is the maximum amount of actions
+ * running in parallel on that invoker.
+ *
+ * Spare capacity is determined by what the loadbalancer thinks it scheduled to each invoker. Upon scheduling, an entry
+ * is made to update the books and a slot in a Semaphore is taken. That slot is only released after the response from
+ * the invoker (active-ack) arrives **or** after the active-ack times out. The Semaphore has as many slots as are
+ * configured via `invoker-busy-threshold`.
+ *
+ * Known caveats:
+ * - In an overload scenario, activations are queued directly to the invokers, which makes the active-ack timeout
+ *   unpredictable. Timing out active-acks in that case can cause the loadbalancer to prematurely assign new load to an
+ *   overloaded invoker, which can cause uneven queues.
+ * - The same is true if an invoker is extraordinarily slow in processing activations. The queue on this invoker will
+ *   slowly rise if it gets slow to the point of still sending pings, but handling the load so slowly, that the
+ *   active-acks time out. The loadbalancer again will think there is capacity, when there is none.
+ *
+ * Both caveats could be solved in future work by not queueing to invoker topics on overload, but to queue on a
+ * centralized overflow topic. Timing out an active-ack can then be seen as a system-error, as described in the
+ * following.
+ *
+ * ## Invoker health checking
+ *
+ * Invoker health is determined via a kafka-based protocol, where each invoker pings the loadbalancer every second. If
+ * no ping is seen for a defined amount of time, the invoker is considered "Offline".
+ *
+ * Moreover, results from all activations are inspected. If more than 3 out of the last 10 activations contained system
+ * errors, the invoker is considered "Unhealthy". If an invoker is unhealty, no user workload is sent to it, but
+ * test-actions are sent by the loadbalancer to check if system errors are still happening. If the
+ * system-error-threshold-count in the last 10 activations falls below 3, the invoker is considered "Healthy" again.
+ *
+ * To summarize:
+ * - "Offline": Ping missing for > 10 seconds
+ * - "Unhealthy": > 3 **system-errors** in the last 10 activations, pings arriving as usual
+ * - "Healthy": < 3 **system-errors** in the last 10 activations, pings arriving as usual
+ *
+ * ## Horizontal sharding
+ *
+ * Sharding is employed to avoid both loadbalancers having to share any data, because the metrics used in scheduling
+ * are very fast changing.
  *
  * Horizontal sharding means, that each invoker's capacity is evenly divided between the loadbalancers. If an invoker
- * has at most 16 slots available, those will be divided to 8 slots for each loadbalancer (if there are 2).
+ * has at most 16 slots available (invoker-busy-threshold = 16), those will be divided to 8 slots for each loadbalancer
+ * (if there are 2).
+ *
+ * Known caveats:
+ * - If a loadbalancer leaves or joins the cluster, all state is removed and created from scratch. Those events should
+ *   not happen often.
  */
 class ShardingContainerPoolBalancer(config: WhiskConfig, controllerInstance: ControllerInstanceId)(
   implicit val actorSystem: ActorSystem,