You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@openwhisk.apache.org by GitBox <gi...@apache.org> on 2022/09/01 16:54:01 UTC

[GitHub] [openwhisk] bdoyle0182 opened a new issue, #5322: [New Scheduler] Container Unpausing Path is Suboptimal Due to Potential Function Cache Miss

bdoyle0182 opened a new issue, #5322:
URL: https://github.com/apache/openwhisk/issues/5322

   The new scheduler will send a container creation message from the scheduler to the invoker over kafka. When this message is consumed in the invoker, it immediately does a get from the artifact db to retrieve the function. However container creation messages can be sent even when there is already an available warm container to execute the function, in the code this is referred to as a `warmedCreation`. This can occur when there is a container in the paused state or one of the fsm's has gone into a paused / idle state and needs to be woken up.
   
   The problem here is that if for every container creation message, the db is attempted to be hit to download the function to the invoker; it's possible to have to download the function if the cache has been invalidated. This is suboptimal since if we know the action name and revision to execute and there's a matching container; that should be all the information we need to deliver the warm execution. The download can be very expensive if the function attachment is 10's of mb's.
   
   I believe the problem is exacerbated on the new scheduler because of two things. 1. There is no longer a concept of a home invoker for it to be likely for the function to constantly be refreshed in the cache and 2. the cache is now only refreshed on cold starts / container creation messages. On the old scheduling algorithm, the db hit attempt for the function occurs for every activation received and the cache invalidation time is access based so the timeout clock is refreshed on every activation within an invoker.
   
   Take for example the configuration of:
   `pauseGrace=5 minutes`
   `idleTimeout=10 minutes`
   
    The cache timeout is hardcoded to 5 minutes and at this point not configurable.
    
    So cold start occurs and function is first downloaded. No other execution comes in for five minutes, the container is paused. Right around this time, the cache invalidates the entry at minute five since it hasn't been refreshed.  At minute 8 a new activation comes in and the scheduler decides to send a warmed container creation message to wake up the container that has been paused but still exists. The invoker consumes the message and attempts to get the function from the cache, but it's been invalidated and has to be re-downloaded.
   
   A simple solution to reduce the possibility of this is to simply increase the cache invalidation time to something like 1 hour, however that doesn't solve the problem as the cache is only restarted on cold starts.
   
   Let's look at the same example from above again but instead of it executing only once; let's say the function takes 50ms to run and an execute request is sent every 75ms. The initial cold start occurs creating the cache entry and then the scheduler reuses this same container every 75ms for two hours. During this time the cache is never refreshed due to the nature of the new scheduler. At the one hour mark the cache is invalidated. At the two hour mark there is a five minute gap in function execution and the container is paused. Then execution begins after five minutes and the first execution with waking up the container has to re-download the function attachment.
   
   So the real solution here imo is to take function loading off the critical path of a warm execution in all cases which I think should be doable. Or if that's not possible and the function metadata may need to be loaded; then optimize to not load the attachment after getting the action document because the db action document itself is guaranteed to be <1mb and latency for that should be ms latency.
   
   This is for the most part a tp99 problem so not too bad, but it can really be detrimental to large function packages.
   
   As a side issue that I realized upon investigating this issue, the `getDocument` metric stops recording after the document is retrieved before the post processing to download the attachment. We either need a separate metric to download the attachment or include it as a part of the `getDocument` recording. The `getDocument` metric displaying 20ms tp99 made it take a long time for me to realize what was really happening here.
    
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@openwhisk.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [openwhisk] bdoyle0182 commented on issue #5322: [New Scheduler] Container Unpausing Path is Suboptimal Due to Potential Function Cache Miss

Posted by GitBox <gi...@apache.org>.

bdoyle0182 commented on issue #5322:
URL: https://github.com/apache/openwhisk/issues/5322#issuecomment-1234619671

   Actually it seems like the cache is still used in the function pulling container proxy when getting every new activation. Maybe it is just a problem of increasing the cache timeout to larger than the max container idle time. Will try that and report back


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@openwhisk.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [openwhisk] bdoyle0182 closed issue #5322: [New Scheduler] Container Unpausing Path is Suboptimal Due to Potential Function Cache Miss

Posted by GitBox <gi...@apache.org>.

bdoyle0182 closed issue #5322: [New Scheduler] Container Unpausing Path is Suboptimal Due to Potential Function Cache Miss
URL: https://github.com/apache/openwhisk/issues/5322


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@openwhisk.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org

[GitHub] [openwhisk] bdoyle0182 commented on issue #5322: [New Scheduler] Container Unpausing Path is Suboptimal Due to Potential Function Cache Miss

Posted by GitBox <gi...@apache.org>.

bdoyle0182 commented on issue #5322:
URL: https://github.com/apache/openwhisk/issues/5322#issuecomment-1279394540

   Fixed with periodic cache pinging from container proxy


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@openwhisk.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org