You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@openwhisk.apache.org by GitBox <gi...@apache.org> on 2022/09/14 22:25:33 UTC

[GitHub] [openwhisk] bdoyle0182 opened a new issue, #5325: [New Scheduler] Container Unpausing results in key remaining in etcd for existing container

bdoyle0182 opened a new issue, #5325:
URL: https://github.com/apache/openwhisk/issues/5325

   This is consistently reproducible for me now. When a container is unpaused, the keys are first unwatched and possibly deleted before the unpause operation completes. At the same time as the unpause operation, a container remove can happen (still trying to figure out what triggers this, but my theory was that unwatching the container keys results in something watching it request to delete the container). When the unpause operation happens, it seems that no matter the result of the operation the future that requested it to be unpaused will always be returned successful. This results in the etcd key for the container being re-written as a part of the `.map` after the unpause future completes. Since the container was deleted, this key is now no longer tracked and will live indefinitely unless the component is restarted. This is a critical bug for me moving forward with the scheduler since it's so easily reproducible.
   
   Relevant logs (a couple of them I added locally to follow the exact code path better):
   
   ```
   [2022-09-14T00:00:12.526Z] [INFO] [#tid_sid_unknown] [ContainerManager] Choose a warmed container whisk/warmed
   
   [2022-09-14T00:00:12.580Z] [INFO] [#tid_sid_unknown] [FunctionPullingContainerPool] Choose warmed container
   
   [2022-09-14T00:00:12.581Z] [INFO] [#tid_sid_containerCreation] [DockerClientWithFileAccess] running /usr/bin/docker unpause ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d (timeout: 10 seconds)
   
   #these two unwatch logs are from above where I said the unregister data can happen before the register #data when the unpause completes
   [2022-09-14T00:00:12.581Z] [INFO] [#tid_sid_unknown] [WatcherService] unwatch endpoint: UnwatchEndpoint(whisk/warmed
   
   [2022-09-14T00:00:12.581Z] [INFO] [#tid_sid_unknown] [WatcherService] unwatch endpoint: UnwatchEndpoint(whisk/namespace
   
   [2022-09-14T00:00:12.581Z] [INFO] [#tid_sid_invokerNanny] [DockerClientWithFileAccess] running /usr/bin/docker rm -f ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d (timeout: 1 minute)
   
   #added this log locally indicating the etcd key is removed
   [2022-09-14T00:00:12.585Z] [INFO] [#tid_sid_unknown] [EtcdWorker] successfully removed whisk/warmed
   
   [2022-09-14T00:00:12.588Z] [INFO] [#tid_sid_unknown] [EtcdWorker] successfully removed whisk/namespace
   
   [2022-09-14T00:00:12.595Z] [INFO] [#tid_DgxZq5kjsHIwixkSsI5onK8zYSgpOkxM] [MemoryQueue] Remove containerId because ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d is not alive
   
   #unpause operation completes
   [2022-09-14T00:00:12.705Z] [INFO] [#tid_sid_containerCreation] [FunctionPullingContainerProxy] Resumed container ContainerId(ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d)
   
   [2022-09-14T00:00:12.706Z] [INFO] [#tid_sid_unknown] [WatcherService] watch endpoint: WatchEndpoint(whisk/namespace
   
   #key is added back to etcd for deleted container
   [2022-09-14T00:00:12.711Z] [INFO] [#tid_sid_unknown] [EtcdWorker] successfully stored whisk/namespace
   
   #added this log locally to denote when the memory queue sees an existing container and adds it to #its existing set
   [2022-09-14T00:00:12.711Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [redacted:Running] start tracking container ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d
   
   [2022-09-14T00:11:09.557Z] [INFO] [#tid_sid_unknown] [MemoryQueue] [redacted:Running] The queue is timed out but there are still 0 activation messages or (running: 1 -> Set(ca688d9b746b1da6a992e8c43e03a6df44a56608ec20ce3cccff07dd0940a22d), in-progress: 0 -> Set()) containers
   ```
   
   Relevant code:
   
   ```
     when(Paused) {
       case Event(job: Initialize, data: WarmData) =>
         implicit val transId = job.transId
         val parent = context.parent
         cancelTimer(IdleTimeoutName)
         cancelTimer(KeepingTimeoutName)
         data.container
           .resume()
           .map { _ =>
             logging.info(this, s"Resumed container ${data.container.containerId}")
             // put existing key again
             dataManagementService ! RegisterData(
               s"${ContainerKeys.existingContainers(data.invocationNamespace, data.action.fullyQualifiedName(true), data.action.rev, Some(instance), Some(data.container.containerId))}",
               "")
             parent ! Resumed(data)
             // the new queue may locates on an different scheduler, so recreate the activation client when necessary
             // since akka port will no be used, we can put any value except 0 here
             data.clientProxy ! RequestActivation(
               newScheduler = Some(SchedulerEndpoints(job.schedulerHost, job.rpcPort, 10)))
             startSingleTimer(UnusedTimeoutName, StateTimeout, unusedTimeout)
             timedOut = false
           }
           .recover {
             case t: Throwable =>
               logging.error(this, s"Failed to resume container ${data.container.containerId}, error: $t")
               parent ! ResumeFailed(data)
               self ! ResumeFailed(data)
           }
   
         // always clean data in etcd regardless of success and failure
         dataManagementService ! UnregisterData(
           ContainerKeys.warmedContainers(
             data.invocationNamespace,
             data.action.fullyQualifiedName(false),
             data.revision,
             instance,
             data.container.containerId))
         goto(Running)
   ```
   
   You can see here that dataManagementService sends a request to unregister the etcd keys not chained to the unpause operation so this can happen before the docker unpause occurs. The `.map` is always entered even if the docker unpause fails (verifiable through metric emission for unpause error from `LoggingMarkers.INVOKER_DOCKER_CMD`.  The only thing left for me to really figure out is why `docker rm` occurs at the same time as the unpause operation. Maybe it's as simple as the unpause fails for whatever reason triggering the removal in which case maybe the fix is as simple as making sure the etcd write doesn't occur if the unpause fails; but I'm not yet convinced that the docker rm does not occur because of some side effect


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@openwhisk.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [openwhisk] bdoyle0182 closed issue #5325: [New Scheduler] Container unpausing results in key remaining in etcd for deleted container

Posted by GitBox <gi...@apache.org>.
bdoyle0182 closed issue #5325: [New Scheduler] Container unpausing results in key remaining in etcd for deleted container
URL: https://github.com/apache/openwhisk/issues/5325


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@openwhisk.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [openwhisk] bdoyle0182 commented on issue #5325: [New Scheduler] Container unpausing results in key remaining in etcd for deleted container

Posted by GitBox <gi...@apache.org>.
bdoyle0182 commented on issue #5325:
URL: https://github.com/apache/openwhisk/issues/5325#issuecomment-1247467187

   Actually it's a bit simpler and an unhandled failure case in the paused state as this is the only place that could return an etcd error in the FunctionPullingContainerProxy
   
   ```
       case Event(StateTimeout, data: WarmData) =>
         (for {
           count <- getLiveContainerCount(data.invocationNamespace, data.action.fullyQualifiedName(false), data.revision)
           (warmedContainerKeepingCount, warmedContainerKeepingTimeout) <- getWarmedContainerLimit(
             data.invocationNamespace)
         } yield {
           logging.info(
             this,
             s"Live container count: ${count}, warmed container keeping count configuration: ${warmedContainerKeepingCount} in namespace: ${data.invocationNamespace}")
           if (count <= warmedContainerKeepingCount) {
             Keep(warmedContainerKeepingTimeout)
           } else {
             Remove
           }
         }).pipeTo(self)
         stay 
   ```
   
   the state times out. The query to etcd fails which pipes the failure message to itself. However the message is uncaught in the state so it ends up getting stashed with this and now the container sits around waiting indefinitely until a new activation comes in.
   
   `case _ => delay`
   
   New activation comes in and tries to wake up the warmed container playing out all of the events I described in my previous messages starting with attempting to unpause the container. It then transitions to `Running` and onTransition it unstashes the failure message to etcd. So we just need to put proper failure handling on the state timeout event for paused. For additional confirmation this is exactly what's happening, the gap in logs between when it was paused and when the new activation comes in is much greater than the idleTimeout.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@openwhisk.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [openwhisk] style95 commented on issue #5325: [New Scheduler] Container unpausing results in key remaining in etcd for deleted container

Posted by GitBox <gi...@apache.org>.
style95 commented on issue #5325:
URL: https://github.com/apache/openwhisk/issues/5325#issuecomment-1247996217

   Awesome analysis!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@openwhisk.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [openwhisk] bdoyle0182 commented on issue #5325: [New Scheduler] Container unpausing results in key remaining in etcd for deleted container

Posted by GitBox <gi...@apache.org>.
bdoyle0182 commented on issue #5325:
URL: https://github.com/apache/openwhisk/issues/5325#issuecomment-1247415358

   Figured it out.
   
   So when the container is unpaused by the `FunctionPullingContainerProxy`, it changes state to `Running` while the async unpause code runs and a failure message is received triggering container cleanup while the unpause is in process.
   
   So here are the exact order of events:
   
   1. Unpause starts and goes to running
   
   ```
         implicit val transId = job.transId
         val parent = context.parent
         cancelTimer(IdleTimeoutName)
         cancelTimer(KeepingTimeoutName)
         data.container
           .resume()
           .map { _ =>
             logging.info(this, s"Resumed container ${data.container.containerId}")
             // put existing key again
             dataManagementService ! RegisterData(
               s"${ContainerKeys.existingContainers(data.invocationNamespace, data.action.fullyQualifiedName(true), data.action.rev, Some(instance), Some(data.container.containerId))}",
               "")
             parent ! Resumed(data)
             // the new queue may locates on an different scheduler, so recreate the activation client when necessary
             // since akka port will no be used, we can put any value except 0 here
             data.clientProxy ! RequestActivation(
               newScheduler = Some(SchedulerEndpoints(job.schedulerHost, job.rpcPort, 10)))
             startSingleTimer(UnusedTimeoutName, StateTimeout, unusedTimeout)
             timedOut = false
           }
           .recover {
             case t: Throwable =>
               logging.error(this, s"Failed to resume container ${data.container.containerId}, error: $t")
               parent ! ResumeFailed(data)
               self ! ResumeFailed(data)
           }
   
         // always clean data in etcd regardless of success and failure
         dataManagementService ! UnregisterData(
           ContainerKeys.warmedContainers(
             data.invocationNamespace,
             data.action.fullyQualifiedName(false),
             data.revision,
             instance,
             data.container.containerId))
         goto(Running)
   ```
   
   2. Some future code triggered in the fsm to talk to etcd fails and pipes the failure message back to the fsm to process which is now in the running state.
   
   ```
       case Event(t: FailureMessage, data: WarmData) =>
         logging.error(this, s"failed to init or run in state: $stateName caused by: $t")
         // Stop containerProxy and ActivationClientProxy both immediately,
         // and don't send unhealthy state message to the health manager, it's already sent.
         cleanUp(
           data.container,
           data.invocationNamespace,
           data.action.fullyQualifiedName(withVersion = true),
           data.action.rev,
           Some(data.clientProxy))
   ```
   
   this is evidenced by this additional log I found that lines up the exact microsecond the container removal occurs from the initial shared logs.
   
   ```
   [2022-09-14T08:33:34.837Z] [ERROR] [#tid_sid_unknown] [FunctionPullingContainerProxy] failed to init or run in state: Running caused by: Failure(io.grpc.StatusRuntimeException: INTERNAL: HTTP/2 error code: PROTOCOL_ERRORReceived Goaway[p]req HEADERS: max concurrency reached)
   --
   
   3. The original unpause and the cleanup are now occurring at the same time. Since the state transitioned to `Running`, the cleanup now attempts to occur without attempting to unpause and goes straight to `destroyContainer`. The unpause succeeds and the subsequent `.map` on it to rewrite the etcd keys. Since the state transitioned to `Running`, the cleanup now attempts to occur without attempting to unpause and goes straight to `destroyContainer`.
   
   ```
     private def cleanUp(container: Container, clientProxy: Option[ActorRef], replacePrewarm: Boolean = true): State = {
   
       context.parent ! ContainerRemoved(replacePrewarm)
       val unpause = stateName match {
         case Paused => container.resume()(TransactionId.invokerNanny)
         case _      => Future.successful(())
       }
       unpause.andThen {
         case Success(_) => destroyContainer(container)
         case Failure(t) =>
           // docker may hang when try to remove a paused container, so we shouldn't remove it
           logging.error(this, s"Failed to resume container ${container.containerId}, error: $t")
       }
       clientProxy match {
         case Some(clientProxy) => clientProxy ! StopClientProxy
         case None              => self ! ClientClosed
       }
       gotoIfNotThere(Removing)
     }
   ```
   
   4. The unpause operation successfully completes and rewrites the key to etcd after the container is destroyed or while it's in the process of being destroyed.
   
   So what's the best way to go about having extra protection here. I think the easiest thing to do would be to stay in the `Paused` state until the unpause operation completes. Then on completion send a message within the pauses state on success or failure for what to do next (if success go to running)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@openwhisk.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org