You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@openwhisk.apache.org by Steffen Rost <sr...@de.ibm.com> on 2019/10/18 14:00:52 UTC
Make the formula to calculate the action time limit more configurable
Recently, we observed a considerable amount of forced completion acks on
our systems. While forced completion acks make sense in some scenarios,
they cause trouble in our scenario - please see below for details. As a
band-aid for our scenario, we want to make the forced completion ack
timeout more configurable.
The completion ack timeout is the timeout within a completion ack must be
received for an activation. It is calculated based on the action time
limit. The current formula is:
(actionTimeLimit.max(TimeLimit.STD_DURATION) * lbConfig.timeoutFactor) +
1.minute (for implementation details please follow the link under [1])
The default timeout factor is 2 which bases on invoker behavior that a
cold invocation's init duration may be as long as its run duration. Based
on this formula the calculated completion ack for an action with a timout
limit of 60 seconds is be 180 seconds.
The motivation behind the completion ack timeout and discarding
activations from the system that do not complete within that time is to
not wait "forever" for activations that get lost. This could happen if
activations were already read and committed from the kafka topic by the
message feed but their processing is still in flight while at the same
time the invoker is restarted for whatever reason.
While restarting invokers will rather remain the exception we often have
the case that image pulls for cold black box invocations take a long time
and exceed the calculated completion ack timeout for these invocation in
our environment. By discarding activations that are still being processed
by an invoker the controllers bookkeeping is invalidated step by step
because the controller assumes that for each of the discarded invocations
one invoker slot get freed up while it is not. As consequence the
controller will make false decisions and what is even worse its
bookkeeping that is out of sync won't repair by itself but remain in this
state as long as the workload remains high. Activations have to wait for
its processing on the chosen invoker as no free slots are available and
hence will potentially exceed their completion ack timeout and in the end
being discarded by the controller.
To make a long story short we would like to have the possibility to have
the constant duration of 1 minute configurable.By increasing the duration
to an appropriate number and by this the calculated completion ack timeout
we think we can avoid the forced completion of activations in our system
for many of the situations we observed in the past.
Please let me know what you think.
[1]
https://github.com/apache/openwhisk/blob/81ac503f7efc8ee99ea1a37ef9ec3d6163d96c85/core/controller/src/main/scala/org/apache/openwhisk/core/loadBalancer/CommonLoadBalancer.scala#L86-L104
Mit freundlichen Gruessen / Kind regards
Steffen Rost
------------------------------------------------------------------------------------------------------------------------------------------
IBM Cloud Functions Development
Phone +49-7031-16-4841 (Fax: -3545)
E-Mail: srost@de.ibm.com
------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Matthias Hartmann -- Geschäftsführung:
Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart,
HRB 243294
Re: Make the formula to calculate the action time limit more configurable
Posted by David P Grove <gr...@us.ibm.com>.
"Steffen Rost" <sr...@de.ibm.com> wrote on 10/18/2019 10:00:52 AM:
>
> To make a long story short we would like to have the possibility to have
> the constant duration of 1 minute configurable.By increasing the duration
> to an appropriate number and by this the calculated completion ack
timeout
> we think we can avoid the forced completion of activations in our system
> for many of the situations we observed in the past.
>
This makes sense to me. Thanks for the detailed explanation of why this is
an important value to be able to configure.
--dave
RE: Make the formula to calculate the action time limit more configurable
Posted by Steffen Rost <sr...@de.ibm.com>.
Thanks for the discussion.
The problem with doing the image pull beforehand is that it his hard to
predict if and when a certain blackbox action is really invoked. To shift
the pul to create time the controller could execute a newly created
blackbox action as kind of NOP on the home invoker, but we haven't looked
in detail into those strategies. Besides this we also encountered
situations were we observed a high degree of forced completion acks in our
system that were not forced by image pulls.
For now I will prepare a PR to make the constant duration of 1 minute that
is currently used in the code configurable.
> Makes sense to me also.
> Docker images are also not capped and always pulled on demand. Have you
> considered an alternate and more efficient management strategy that
shifts
> the pulls to create time perhaps?
Mit freundlichen Gruessen / Kind regards
Steffen Rost
------------------------------------------------------------------------------------------------------------------------------------------
IBM Cloud Functions Development
Phone +49-7031-16-4841 (Fax: -3545)
E-Mail: srost@de.ibm.com
------------------------------------------------------------------------------------------------------------------------------------------
IBM Deutschland Research & Development GmbH
Vorsitzender des Aufsichtsrats: Matthias Hartmann -- Geschäftsführung:
Dirk Wittkopp
Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart,
HRB 243294
Re: Make the formula to calculate the action time limit more configurable
Posted by Rodric Rabbah <ro...@gmail.com>.
Makes sense to me also.
Docker images are also not capped and always pulled on demand. Have you
considered an alternate and more efficient management strategy that shifts
the pulls to create time perhaps?
-r
On Fri, Oct 18, 2019 at 10:01 AM Steffen Rost <sr...@de.ibm.com> wrote:
> Recently, we observed a considerable amount of forced completion acks on
> our systems. While forced completion acks make sense in some scenarios,
> they cause trouble in our scenario - please see below for details. As a
> band-aid for our scenario, we want to make the forced completion ack
> timeout more configurable.
>
> The completion ack timeout is the timeout within a completion ack must be
> received for an activation. It is calculated based on the action time
> limit. The current formula is:
> (actionTimeLimit.max(TimeLimit.STD_DURATION) * lbConfig.timeoutFactor) +
> 1.minute (for implementation details please follow the link under [1])
>
> The default timeout factor is 2 which bases on invoker behavior that a
> cold invocation's init duration may be as long as its run duration. Based
> on this formula the calculated completion ack for an action with a timout
> limit of 60 seconds is be 180 seconds.
>
> The motivation behind the completion ack timeout and discarding
> activations from the system that do not complete within that time is to
> not wait "forever" for activations that get lost. This could happen if
> activations were already read and committed from the kafka topic by the
> message feed but their processing is still in flight while at the same
> time the invoker is restarted for whatever reason.
>
> While restarting invokers will rather remain the exception we often have
> the case that image pulls for cold black box invocations take a long time
> and exceed the calculated completion ack timeout for these invocation in
> our environment. By discarding activations that are still being processed
> by an invoker the controllers bookkeeping is invalidated step by step
> because the controller assumes that for each of the discarded invocations
> one invoker slot get freed up while it is not. As consequence the
> controller will make false decisions and what is even worse its
> bookkeeping that is out of sync won't repair by itself but remain in this
> state as long as the workload remains high. Activations have to wait for
> its processing on the chosen invoker as no free slots are available and
> hence will potentially exceed their completion ack timeout and in the end
> being discarded by the controller.
>
> To make a long story short we would like to have the possibility to have
> the constant duration of 1 minute configurable.By increasing the duration
> to an appropriate number and by this the calculated completion ack timeout
> we think we can avoid the forced completion of activations in our system
> for many of the situations we observed in the past.
>
> Please let me know what you think.
>
>
> [1]
>
> https://github.com/apache/openwhisk/blob/81ac503f7efc8ee99ea1a37ef9ec3d6163d96c85/core/controller/src/main/scala/org/apache/openwhisk/core/loadBalancer/CommonLoadBalancer.scala#L86-L104
>
>
> Mit freundlichen Gruessen / Kind regards
> Steffen Rost
>
> ------------------------------------------------------------------------------------------------------------------------------------------
> IBM Cloud Functions Development
> Phone +49-7031-16-4841 (Fax: -3545)
> E-Mail: srost@de.ibm.com
>
> ------------------------------------------------------------------------------------------------------------------------------------------
> IBM Deutschland Research & Development GmbH
> Vorsitzender des Aufsichtsrats: Matthias Hartmann -- Geschäftsführung:
> Dirk Wittkopp
> Sitz der Gesellschaft: Böblingen / Registergericht: Amtsgericht Stuttgart,
> HRB 243294
>
>