You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@openwhisk.apache.org by GitBox <gi...@apache.org> on 2019/12/04 18:03:43 UTC

[GitHub] [openwhisk] tysonnorris opened a new issue #4762: restarting a single controller instance results in extra errors in concurrency handling

tysonnorris opened a new issue #4762: restarting a single controller instance results in extra errors in concurrency handling
URL: https://github.com/apache/openwhisk/issues/4762
 
 
   <!--
   We use the issue tracker for bugs and feature requests. For general questions and discussion please use http://slack.openwhisk.org/ or https://openwhisk.apache.org/contact.html instead.
   
   Do NOT share passwords, credentials or other confidential information.
   
   Before creating a new issue, please check if there is one already open that
   fits the defect you are reporting.
   If you open an issue and realize later it is a duplicate of a pre-existing
   open issue, please close yours and add a comment to the other.
   
   Issues can be created for either defects or enhancement requests. If you are a committer than please add the labels "bug" or "feature". If you are not a committer please make clear in the comments which one it is, so that committers can add these labels later.
   
   If you are reporting a defect, please edit the issue description to include the
   information shown below.
   
   If you are reporting an enhancement request, please include information on what you are trying to achieve and why that enhancement would help you.
   
   For more information about reporting issues, see
   https://github.com/apache/openwhisk/blob/master/CONTRIBUTING.md#raising-issues
   
   Use the commands below to provide key information from your environment:
   You do not have to include this information if this is a feature request.
   -->
   
   ## Environment details:
   
   * mesos (but likely happens in all other multi-controller envs)  
   * actions that support concurrency > 1
   
   ## Steps to reproduce the issue:
   
   1.   run a load test - not sure how much load is required, but this test ran ~6000 users, ~1k rps
   2.   while test is running, restart a single controller
   3.   
   
   
   ## Provide the expected results and outputs:
   
   Some errors are expected related to in-flight requests at that controller that was killed.
   
   
   ## Provide the actual results and outputs:
   
   There is possibly some issue with graceful shutdown of `NestedSemaphore` or related code. See logged errors below.
   
   Separately, when a replacement controller comes back online, it becomes used immediately despite having only set invoker states to Offline or Unhealthy, resulting in extra errors. 
   
   
   ```
   [2019-12-04T17:40:55.689Z] [INFO] [#tid_oXbBU0OYYp1bB5CyEhGe7jUb0KrRpivc] [BasicHttpService] [marker:http_get.200_counter:5010:5010]
   [2019-12-04T17:40:55.694Z] [INFO] [#tid_YaUx7tDZOJY5aubZq5lC4GfZQo7TvQzw] [ShardingContainerPoolBalancer] received completion ack for '836225a9509e42e1a225a9509e02e136', system error=false
   [2019-12-04T17:40:55.694Z] [INFO] [#tid_YaUx7tDZOJY5aubZq5lC4GfZQo7TvQzw] [ShardingContainerPoolBalancer] received completion ack for '836225a9509e42e1a225a9509e02e136'
   [2019-12-04T17:40:55.694Z] [INFO] [#tid_YaUx7tDZOJY5aubZq5lC4GfZQo7TvQzw] [ShardingContainerPoolBalancer] received result ack for '836225a9509e42e1a225a9509e02e136'
   [2019-12-04T17:40:55.694Z] [INFO] [#tid_YaUx7tDZOJY5aubZq5lC4GfZQo7TvQzw] [WebActionsApi]  [marker:controller_blockingActivation_finish:5009:5009]
   [2019-12-04T17:40:55.694Z] [INFO] [#tid_YaUx7tDZOJY5aubZq5lC4GfZQo7TvQzw] [BasicHttpService] [marker:http_get.200_counter:5009:5009]
   [2019-12-04T17:40:55.704Z] [INFO] [#tid_XKy3TepK8GP7tggjfBgpzdjex9yahxwE] [ShardingContainerPoolBalancer] received completion ack for '539e835782fb4da49e835782fb6da4c9', system error=false
   [2019-12-04T17:40:55.704Z] [INFO] [#tid_XKy3TepK8GP7tggjfBgpzdjex9yahxwE] [ShardingContainerPoolBalancer] received completion ack for '539e835782fb4da49e835782fb6da4c9'
   [2019-12-04T17:40:55.704Z] [INFO] [#tid_XKy3TepK8GP7tggjfBgpzdjex9yahxwE] [ShardingContainerPoolBalancer] received result ack for '539e835782fb4da49e835782fb6da4c9'
   [2019-12-04T17:40:55.704Z] [INFO] [#tid_XKy3TepK8GP7tggjfBgpzdjex9yahxwE] [WebActionsApi]  [marker:controller_blockingActivation_finish:5012:5012]
   [2019-12-04T17:40:55.704Z] [INFO] [#tid_XKy3TepK8GP7tggjfBgpzdjex9yahxwE] [BasicHttpService] [marker:http_get.200_counter:5012:5012]
   [2019-12-04T17:40:55.704Z] [INFO] Cluster Node [akka.tcp://controller-actor-system@10.66.22.81:1105] - Exiting completed
   [2019-12-04T17:40:55.708Z] [INFO] Cluster Node [akka.tcp://controller-actor-system@10.66.22.81:1105] - Shutting down...
   [2019-12-04T17:40:55.709Z] [INFO] Cluster Node [akka.tcp://controller-actor-system@10.66.22.81:1105] - Successfully shut down
   [2019-12-04T17:40:55.710Z] [INFO] [#tid_sid_loadbalancer] [ShardingContainerPoolBalancerState] loadbalancer cluster size changed from 6 to 5 active nodes. 12 invokers with 30000000 MB average memory size - total invoker memory 360000000 MB.
   [2019-12-04T17:40:55.710Z] [INFO] [#tid_sid_loadbalancer] [ShardingContainerPoolBalancerState] loadbalancer cluster size changed from 5 to 4 active nodes. 12 invokers with 37500000 MB average memory size - total invoker memory 450000000 MB.
   [2019-12-04T17:40:55.710Z] [INFO] [#tid_sid_loadbalancer] [ShardingContainerPoolBalancerState] loadbalancer cluster size changed from 4 to 3 active nodes. 12 invokers with 50000000 MB average memory size - total invoker memory 600000000 MB.
   [2019-12-04T17:40:55.710Z] [INFO] [#tid_sid_loadbalancer] [ShardingContainerPoolBalancerState] loadbalancer cluster size changed from 3 to 2 active nodes. 12 invokers with 75000000 MB average memory size - total invoker memory 900000000 MB.
   [2019-12-04T17:40:55.710Z] [INFO] Message [akka.cluster.GossipEnvelope] from Actor[akka.tcp://controller-actor-system@10.66.22.93:30204/system/cluster/core/daemon#-960490949] to Actor[akka://controller-actor-system/system/cluster/core/daemon#-221480326] was not delivered. [2] dead letters encountered. If this is not an expected behavior, then [Actor[akka://controller-actor-system/system/cluster/core/daemon#-221480326]] may have terminated unexpectedly, This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
   [2019-12-04T17:40:55.710Z] [INFO] [#tid_sid_loadbalancer] [ShardingContainerPoolBalancerState] loadbalancer cluster size changed from 2 to 1 active nodes. 12 invokers with 150000000 MB average memory size - total invoker memory 1800000000 MB.
   [2019-12-04T17:40:55.711Z] [INFO] Message [akka.cluster.GossipEnvelope] from Actor[akka.tcp://controller-actor-system@10.66.22.159:11643/system/cluster/core/daemon#-790933116] to Actor[akka://controller-actor-system/system/cluster/core/daemon#-221480326] was not delivered. [3] dead letters encountered. If this is not an expected behavior, then [Actor[akka://controller-actor-system/system/cluster/core/daemon#-221480326]] may have terminated unexpectedly, This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
   [2019-12-04T17:40:55.714Z] [INFO] [#tid_sid_unknown] [Controller] Shutting down Kamon with coordinated shutdown
   [2019-12-04T17:40:55.725Z] [INFO] Message [org.apache.openwhisk.core.connector.MessageFeed$FillCompleted] from Actor[akka://controller-actor-system/user/$e#-1558842808] to Actor[akka://controller-actor-system/user/$e#-1558842808] was not delivered. [4] dead letters encountered. If this is not an expected behavior, then [Actor[akka://controller-actor-system/user/$e#-1558842808]] may have terminated unexpectedly, This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
   [2019-12-04T17:40:55.727Z] [INFO] Shutting down remote daemon.
   [2019-12-04T17:40:55.728Z] [INFO] Remote daemon shut down; proceeding with flushing remote transports.
   [2019-12-04T17:40:55.745Z] [INFO] Remoting shut down
   [2019-12-04T17:40:55.745Z] [INFO] Remoting shut down.
   [ERROR] [12/04/2019 17:40:55.768] [controller-actor-system-akka.actor.default-dispatcher-96] [akka.actor.ActorSystemImpl(controller-actor-system)] exception while executing timer task
   java.util.NoSuchElementException
   	at scala.collection.concurrent.TrieMap.apply(TrieMap.scala:833)
   	at org.apache.openwhisk.common.NestedSemaphore.releaseConcurrent(NestedSemaphore.scala:103)
   	at org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer.$anonfun$releaseInvoker$1(ShardingContainerPoolBalancer.scala:329)
   	at org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer.$anonfun$releaseInvoker$1$adapted(ShardingContainerPoolBalancer.scala:329)
   	at scala.Option.foreach(Option.scala:407)
   	at org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer.releaseInvoker(ShardingContainerPoolBalancer.scala:329)
   	at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer.processCompletion(CommonLoadBalancer.scala:285)
   	at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer.$anonfun$setupActivation$4(CommonLoadBalancer.scala:150)
   	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:202)
   	at akka.actor.LightArrayRevolverScheduler$TaskHolder.run(LightArrayRevolverScheduler.scala:343)
   	at akka.actor.LightArrayRevolverScheduler.$anonfun$close$1(LightArrayRevolverScheduler.scala:145)
   	at akka.actor.LightArrayRevolverScheduler.$anonfun$close$1$adapted(LightArrayRevolverScheduler.scala:144)
   	at scala.collection.Iterator.foreach(Iterator.scala:941)
   	at scala.collection.Iterator.foreach$(Iterator.scala:941)
   	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
   	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
   	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
   	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
   	at akka.actor.LightArrayRevolverScheduler.close(LightArrayRevolverScheduler.scala:144)
   	at akka.actor.ActorSystemImpl.stopScheduler(ActorSystem.scala:980)
   	at akka.actor.ActorSystemImpl.$anonfun$_start$1(ActorSystem.scala:910)
   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   	at akka.actor.ActorSystemImpl$$anon$2.run(ActorSystem.scala:931)
   	at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$addRec$1$1.applyOrElse(ActorSystem.scala:1118)
   	at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$addRec$1$1.applyOrElse(ActorSystem.scala:1118)
   	at scala.concurrent.Future.$anonfun$andThen$1(Future.scala:536)
   	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
   	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
   	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
   	at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
   	at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:92)
   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
   	at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:92)
   	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
   	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
   	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
   [ERROR] [12/04/2019 17:40:55.775] [controller-actor-system-akka.actor.default-dispatcher-96] [akka.actor.ActorSystemImpl(controller-actor-system)] exception while executing timer task
   java.util.NoSuchElementException
   	at scala.collection.concurrent.TrieMap.apply(TrieMap.scala:833)
   	at org.apache.openwhisk.common.NestedSemaphore.releaseConcurrent(NestedSemaphore.scala:103)
   	at org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer.$anonfun$releaseInvoker$1(ShardingContainerPoolBalancer.scala:329)
   	at org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer.$anonfun$releaseInvoker$1$adapted(ShardingContainerPoolBalancer.scala:329)
   	at scala.Option.foreach(Option.scala:407)
   	at org.apache.openwhisk.core.loadBalancer.ShardingContainerPoolBalancer.releaseInvoker(ShardingContainerPoolBalancer.scala:329)
   	at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer.processCompletion(CommonLoadBalancer.scala:285)
   	at org.apache.openwhisk.core.loadBalancer.CommonLoadBalancer.$anonfun$setupActivation$4(CommonLoadBalancer.scala:150)
   	at akka.actor.Scheduler$$anon$4.run(Scheduler.scala:202)
   	at akka.actor.LightArrayRevolverScheduler$TaskHolder.run(LightArrayRevolverScheduler.scala:343)
   	at akka.actor.LightArrayRevolverScheduler.$anonfun$close$1(LightArrayRevolverScheduler.scala:145)
   	at akka.actor.LightArrayRevolverScheduler.$anonfun$close$1$adapted(LightArrayRevolverScheduler.scala:144)
   	at scala.collection.Iterator.foreach(Iterator.scala:941)
   	at scala.collection.Iterator.foreach$(Iterator.scala:941)
   	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
   	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
   	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
   	at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
   	at akka.actor.LightArrayRevolverScheduler.close(LightArrayRevolverScheduler.scala:144)
   	at akka.actor.ActorSystemImpl.stopScheduler(ActorSystem.scala:980)
   	at akka.actor.ActorSystemImpl.$anonfun$_start$1(ActorSystem.scala:910)
   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   	at akka.actor.ActorSystemImpl$$anon$2.run(ActorSystem.scala:931)
   	at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$addRec$1$1.applyOrElse(ActorSystem.scala:1118)
   	at akka.actor.ActorSystemImpl$TerminationCallbacks$$anonfun$addRec$1$1.applyOrElse(ActorSystem.scala:1118)
   	at scala.concurrent.Future.$anonfun$andThen$1(Future.scala:536)
   	at scala.concurrent.impl.Promise.liftedTree1$1(Promise.scala:33)
   	at scala.concurrent.impl.Promise.$anonfun$transform$1(Promise.scala:33)
   	at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:64)
   	at akka.dispatch.BatchingExecutor$AbstractBatch.processBatch(BatchingExecutor.scala:55)
   	at akka.dispatch.BatchingExecutor$BlockableBatch.$anonfun$run$1(BatchingExecutor.scala:92)
   	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
   	at scala.concurrent.BlockContext$.withBlockContext(BlockContext.scala:85)
   	at akka.dispatch.BatchingExecutor$BlockableBatch.run(BatchingExecutor.scala:92)
   	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
   	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(ForkJoinExecutorConfigurator.scala:49)
   	at akka.dispatch.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
   	at akka.dispatch.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
   	at akka.dispatch.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
   	at akka.dispatch.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
   ```
   
   
   
   ## Additional information you deem important:
   * Primary issue here is prematurely making use of controller before some number of invokers are Healthy. This may be accomplished via deployment healthchecks, but ideally we could expose a configurable `/health` endpoint to indicate that a number of invokers are healthy. I know this is available at `/invokers` endpoint, but it employs no logic for assigning healthiness in current form.
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services