You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@openwhisk.apache.org by Steffen Rost <sr...@de.ibm.com> on 2019/10/18 14:00:52 UTC

Make the formula to calculate the action time limit more configurable

Recently, we observed a considerable amount of forced completion acks on 
our systems. While forced completion acks make sense in some scenarios, 
they cause trouble in our scenario - please see below for details. As a 
band-aid for our scenario, we want to make the forced completion ack 
timeout more configurable.

The completion ack timeout is the timeout within a completion ack must be 
received for an activation. It is calculated based on the action time 
limit. The current formula is: 
(actionTimeLimit.max(TimeLimit.STD_DURATION) * lbConfig.timeoutFactor) + 
1.minute  (for implementation details please follow the link under [1])

The default timeout factor is 2 which bases on invoker behavior that a 
cold invocation's init duration may be as long as its run duration. Based 
on this formula the calculated completion ack for an action with a timout 
limit of 60 seconds is be 180 seconds.

The motivation behind the completion ack timeout and discarding 
activations from the system that do not complete within that time is to 
not wait "forever" for activations that get lost. This could happen if 
activations were already read and committed from the kafka topic by the 
message feed but their processing is still in flight while at the same 
time the invoker is restarted for whatever reason.

While restarting invokers will rather remain the exception we often have 
the case that image pulls for cold black box invocations take a long time 
and exceed the calculated completion ack timeout for these invocation in 
our environment. By discarding activations that are still being processed 
by an invoker the controllers bookkeeping is invalidated step by step 
because the controller assumes that for each of the discarded invocations 
one invoker slot get freed up while it is not. As consequence the 
controller will make false decisions and what is even worse its 
bookkeeping that is out of sync won't repair by itself but remain in this 
state as long as the workload remains high. Activations have to wait for 
its processing on the chosen invoker as no free slots are available and 
hence will potentially exceed their completion ack timeout and in the end 
being discarded by the controller.

To make a long story short we would like to have the possibility to have 
the constant duration of 1 minute configurable.By increasing the duration 
to an appropriate number and by this the calculated completion ack timeout 
we think we can avoid the forced completion of activations in our system 
for many of the situations we observed in the past.

Please let me know what you think.


[1] 
https://github.com/apache/openwhisk/blob/81ac503f7efc8ee99ea1a37ef9ec3d6163d96c85/core/controller/src/main/scala/org/apache/openwhisk/core/loadBalancer/CommonLoadBalancer.scala#L86-L104


Mit freundlichen Gruessen / Kind regards
Steffen Rost
------------------------------------------------------------------------------------------------------------------------------------------
IBM Cloud Functions Development
Phone +49-7031-16-4841 (Fax: -3545)
E-Mail: srost@de.ibm.com
------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Matthias Hartmann -- Geschäftsführung: 
Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, 
HRB 243294

Re: Make the formula to calculate the action time limit more configurable

Posted by David P Grove <gr...@us.ibm.com>.

"Steffen Rost" <sr...@de.ibm.com> wrote on 10/18/2019 10:00:52 AM:
>
> To make a long story short we would like to have the possibility to have
> the constant duration of 1 minute configurable.By increasing the duration

> to an appropriate number and by this the calculated completion ack
timeout
> we think we can avoid the forced completion of activations in our system
> for many of the situations we observed in the past.
>

This makes sense to me.  Thanks for the detailed explanation of why this is
an important value to be able to configure.

--dave

RE: Make the formula to calculate the action time limit more configurable

Posted by Steffen Rost <sr...@de.ibm.com>.

Thanks for the discussion. 

The problem with doing the image pull beforehand is that it his hard to 
predict if and when a certain blackbox action is really invoked. To shift 
the pul to create time the controller could execute a newly created 
blackbox action as kind of NOP on the home invoker, but we haven't looked 
in detail into those strategies. Besides this we also encountered 
situations were we observed a high degree of forced completion acks in our 
system that were not forced by image pulls.

For now I will prepare a PR to make the constant duration of 1 minute that 
is currently used in the code configurable.



> Makes sense to me also.

> Docker images are also not capped and always pulled on demand. Have you
> considered an alternate and more efficient management strategy that 
shifts
> the pulls to create time perhaps?



Mit freundlichen Gruessen / Kind regards
Steffen Rost
------------------------------------------------------------------------------------------------------------------------------------------
IBM Cloud Functions Development
Phone +49-7031-16-4841 (Fax: -3545)
E-Mail: srost@de.ibm.com
------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Matthias Hartmann -- Geschäftsführung: 
Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart, 
HRB 243294

Re: Make the formula to calculate the action time limit more configurable

Posted by Rodric Rabbah <ro...@gmail.com>.

Makes sense to me also.

Docker images are also not capped and always pulled on demand. Have you
considered an alternate and more efficient management strategy that shifts
the pulls to create time perhaps?

-r
On Fri, Oct 18, 2019 at 10:01 AM Steffen Rost <sr...@de.ibm.com> wrote:

> Recently, we observed a considerable amount of forced completion acks on
> our systems. While forced completion acks make sense in some scenarios,
> they cause trouble in our scenario - please see below for details. As a
> band-aid for our scenario, we want to make the forced completion ack
> timeout more configurable.
>
> The completion ack timeout is the timeout within a completion ack must be
> received for an activation. It is calculated based on the action time
> limit. The current formula is:
> (actionTimeLimit.max(TimeLimit.STD_DURATION) * lbConfig.timeoutFactor) +
> 1.minute  (for implementation details please follow the link under [1])
>
> The default timeout factor is 2 which bases on invoker behavior that a
> cold invocation's init duration may be as long as its run duration. Based
> on this formula the calculated completion ack for an action with a timout
> limit of 60 seconds is be 180 seconds.
>
> The motivation behind the completion ack timeout and discarding
> activations from the system that do not complete within that time is to
> not wait "forever" for activations that get lost. This could happen if
> activations were already read and committed from the kafka topic by the
> message feed but their processing is still in flight while at the same
> time the invoker is restarted for whatever reason.
>
> While restarting invokers will rather remain the exception we often have
> the case that image pulls for cold black box invocations take a long time
> and exceed the calculated completion ack timeout for these invocation in
> our environment. By discarding activations that are still being processed
> by an invoker the controllers bookkeeping is invalidated step by step
> because the controller assumes that for each of the discarded invocations
> one invoker slot get freed up while it is not. As consequence the
> controller will make false decisions and what is even worse its
> bookkeeping that is out of sync won't repair by itself but remain in this
> state as long as the workload remains high. Activations have to wait for
> its processing on the chosen invoker as no free slots are available and
> hence will potentially exceed their completion ack timeout and in the end
> being discarded by the controller.
>
> To make a long story short we would like to have the possibility to have
> the constant duration of 1 minute configurable.By increasing the duration
> to an appropriate number and by this the calculated completion ack timeout
> we think we can avoid the forced completion of activations in our system
> for many of the situations we observed in the past.
>
> Please let me know what you think.
>
>
> [1]
>
> https://github.com/apache/openwhisk/blob/81ac503f7efc8ee99ea1a37ef9ec3d6163d96c85/core/controller/src/main/scala/org/apache/openwhisk/core/loadBalancer/CommonLoadBalancer.scala#L86-L104
>
>
> Mit freundlichen Gruessen / Kind regards
> Steffen Rost
>
> ------------------------------------------------------------------------------------------------------------------------------------------
> IBM Cloud Functions Development
> Phone +49-7031-16-4841 (Fax: -3545)
> E-Mail: srost@de.ibm.com
>
> ------------------------------------------------------------------------------------------------------------------------------------------
> IBM Deutschland Research & Development GmbH
> Vorsitzender des Aufsichtsrats: Matthias Hartmann -- Geschäftsführung:
> Dirk Wittkopp
> Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart,
> HRB 243294
>
>