You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@openwhisk.apache.org by GitBox <gi...@apache.org> on 2022/07/19 01:37:05 UTC

[GitHub] [openwhisk] style95 commented on issue #5286: [New Scheduler] Brief etcd unavailability can result in specific action queue to get stuck if unlucky

style95 commented on issue #5286:
URL: https://github.com/apache/openwhisk/issues/5286#issuecomment-1188500211

   Let me look into it.
   But I see some issues and questions at first glance.
   
   I found the default lease timeout is 1 second.
   https://github.com/apache/openwhisk/blob/master/ansible/group_vars/all#L467
   An intermittent network rupture can happen at any time, this should be bigger than 1s.
   It could easily break the system with short network unavailability and we are using 10s in our downstream.
   I think it is better to update the default.
   
   Regarding the error `The activation has not been processed`, the queue manager is supposed to retry until it fetches the endpoint. It is supposed to retry up to 13 times with exponential backoff starting with 1ms and the total wait time would be around 8 seconds(`1ms + 2ms + 4ms + ... 4096ms`).
   Was there such a log?
   
   Also, when an endpoint is removed while there is actually a queue, the system is supposed to restore the etcd data until the data is explicitly requested to be deleted.
   
   Regarding the error, `No scheduler endpoint available`, it conforms to the existing behavior of the ShardingPoolBalancer that no retry is performed when there is any issue in Kafka.
   https://github.com/apache/openwhisk/blob/master/core/controller/src/main/scala/org/apache/openwhisk/core/loadBalancer/CommonLoadBalancer.scala#L210
   But I feel it would be better to add a retry mechanism here too.
   
   In general, there was a but in the code and we recently fixed it with the following.
   https://github.com/apache/openwhisk/pull/5251
   Need to see if it could cause any regression.
   
   
   @ningyougang @jiangpengcheng 
   Do you have any idea?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@openwhisk.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org