You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@openwhisk.apache.org by GitBox <gi...@apache.org> on 2017/12/04 09:42:21 UTC

[GitHub] mhenke1 commented on issue #3040: Adjust controller side action time-out to avoid invokers marked as unhealthy

mhenke1 commented on issue #3040: Adjust controller side action time-out to avoid invokers marked as unhealthy 
URL: https://github.com/apache/incubator-openwhisk/pull/3040#issuecomment-348910022
 
 
   @rabbah Let's take the two controller HA case. In case of non-perfect state synchronization between controllers more than 16 actions can be scheduled to one invoker.  16 of them get executed and the others queued. 
   
   When all 16 executed actions run to their maximal run time and expire, the next batch (the one waiting in the queue) can start after around one minute elapsed time. If these next actions also run to their maximal run time they return after around two minutes 
   (1 minute waiting for the first set of actions to stop plus  1 minute run time plus some overhead). 
   
   At the moment the controller only allows two minutes for an action to finish. So we might have the case that actions complete shortly after that two minute time and get regarded as failed by the controller. If we have to many of those cases the given invoker will be regarded as unhealthy. 
   
   During the last days we see a lot of these cases where invokers were marked as unhealthy and recovered after a short time. In the cases the invokers were busy with  batches of actions that were all timing-out nearly at the same time.
   
   In the case we add more controller to the HA game, the summed up waiting time might go up even more. Lets take the most pathological case in which the HA state is not at all synced*. 
   In this case all n controllers might place actions on one and the same invoker. The resulting overall wait time is the time for the first batch to be executed plus the time for the (n-1) batches  waiting in the queue. Therefore with this PR the wait time is calculated as (1 + (n-1)) minutes + some overhead.
   
   *Of course this last case is unlikely and hopefully a theoretical edge case. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services