You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@openwhisk.apache.org by GitBox <gi...@apache.org> on 2022/09/15 00:00:09 UTC
[GitHub] [openwhisk] bdoyle0182 commented on issue #5325: [New Scheduler] Container unpausing results in key remaining in etcd for deleted container

bdoyle0182 commented on issue #5325:
URL: https://github.com/apache/openwhisk/issues/5325#issuecomment-1247415358

   Figured it out.
   
   So when the container is unpaused by the `FunctionPullingContainerProxy`, it changes state to `Running` while the async unpause code runs and a failure message is received triggering container cleanup while the unpause is in process.
   
   So here are the exact order of events:
   
   1. Unpause starts and goes to running
   
   ```
         implicit val transId = job.transId
         val parent = context.parent
         cancelTimer(IdleTimeoutName)
         cancelTimer(KeepingTimeoutName)
         data.container
           .resume()
           .map { _ =>
             logging.info(this, s"Resumed container ${data.container.containerId}")
             // put existing key again
             dataManagementService ! RegisterData(
               s"${ContainerKeys.existingContainers(data.invocationNamespace, data.action.fullyQualifiedName(true), data.action.rev, Some(instance), Some(data.container.containerId))}",
               "")
             parent ! Resumed(data)
             // the new queue may locates on an different scheduler, so recreate the activation client when necessary
             // since akka port will no be used, we can put any value except 0 here
             data.clientProxy ! RequestActivation(
               newScheduler = Some(SchedulerEndpoints(job.schedulerHost, job.rpcPort, 10)))
             startSingleTimer(UnusedTimeoutName, StateTimeout, unusedTimeout)
             timedOut = false
           }
           .recover {
             case t: Throwable =>
               logging.error(this, s"Failed to resume container ${data.container.containerId}, error: $t")
               parent ! ResumeFailed(data)
               self ! ResumeFailed(data)
           }
   
         // always clean data in etcd regardless of success and failure
         dataManagementService ! UnregisterData(
           ContainerKeys.warmedContainers(
             data.invocationNamespace,
             data.action.fullyQualifiedName(false),
             data.revision,
             instance,
             data.container.containerId))
         goto(Running)
   ```
   
   2. Some future code triggered in the fsm to talk to etcd fails and pipes the failure message back to the fsm to process which is now in the running state.
   
   ```
       case Event(t: FailureMessage, data: WarmData) =>
         logging.error(this, s"failed to init or run in state: $stateName caused by: $t")
         // Stop containerProxy and ActivationClientProxy both immediately,
         // and don't send unhealthy state message to the health manager, it's already sent.
         cleanUp(
           data.container,
           data.invocationNamespace,
           data.action.fullyQualifiedName(withVersion = true),
           data.action.rev,
           Some(data.clientProxy))
   ```
   
   this is evidenced by this additional log I found that lines up the exact microsecond the container removal occurs from the initial shared logs.
   
   ```
   [2022-09-14T08:33:34.837Z] [ERROR] [#tid_sid_unknown] [FunctionPullingContainerProxy] failed to init or run in state: Running caused by: Failure(io.grpc.StatusRuntimeException: INTERNAL: HTTP/2 error code: PROTOCOL_ERRORReceived Goaway[p]req HEADERS: max concurrency reached)
   --
   
   3. The original unpause and the cleanup are now occurring at the same time. Since the state transitioned to `Running`, the cleanup now attempts to occur without attempting to unpause and goes straight to `destroyContainer`. The unpause succeeds and the subsequent `.map` on it to rewrite the etcd keys. Since the state transitioned to `Running`, the cleanup now attempts to occur without attempting to unpause and goes straight to `destroyContainer`.
   
   ```
     private def cleanUp(container: Container, clientProxy: Option[ActorRef], replacePrewarm: Boolean = true): State = {
   
       context.parent ! ContainerRemoved(replacePrewarm)
       val unpause = stateName match {
         case Paused => container.resume()(TransactionId.invokerNanny)
         case _      => Future.successful(())
       }
       unpause.andThen {
         case Success(_) => destroyContainer(container)
         case Failure(t) =>
           // docker may hang when try to remove a paused container, so we shouldn't remove it
           logging.error(this, s"Failed to resume container ${container.containerId}, error: $t")
       }
       clientProxy match {
         case Some(clientProxy) => clientProxy ! StopClientProxy
         case None              => self ! ClientClosed
       }
       gotoIfNotThere(Removing)
     }
   ```
   
   4. The unpause operation successfully completes and rewrites the key to etcd after the container is destroyed or while it's in the process of being destroyed.
   
   So what's the best way to go about having extra protection here. I think the easiest thing to do would be to stay in the `Paused` state until the unpause operation completes. Then on completion send a message within the pauses state on success or failure for what to do next (if success go to running)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@openwhisk.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org