You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@openwhisk.apache.org by GitBox <gi...@apache.org> on 2018/08/07 05:10:36 UTC
[GitHub] tysonnorris opened a new issue #3948: ActorSystem prematurely terminates; cluster nodes cannot leave gracefully

tysonnorris opened a new issue #3948: ActorSystem prematurely terminates; cluster nodes cannot leave gracefully
URL: https://github.com/apache/incubator-openwhisk/issues/3948
 
 
   <!--
   We use the issue tracker for bugs and feature requests. For general questions and discussion please use http://slack.openwhisk.org/ or https://openwhisk.apache.org/contact.html instead.
   
   Do NOT share passwords, credentials or other confidential information.
   
   Before creating a new issue, please check if there is one already open that
   fits the defect you are reporting.
   If you open an issue and realize later it is a duplicate of a pre-existing
   open issue, please close yours and add a comment to the other.
   
   Issues can be created for either defects or enhancement requests. If you are a committer than please add the labels "bug" or "feature". If you are not a committer please make clear in the comments which one it is, so that committers can add these labels later.
   
   If you are reporting a defect, please edit the issue description to include the
   information shown below.
   
   If you are reporting an enhancement request, please include information on what you are trying to achieve and why that enhancement would help you.
   
   For more information about reporting issues, see
   https://github.com/apache/incubator-openwhisk/blob/master/CONTRIBUTING.md#raising-issues
   
   Use the commands below to provide key information from your environment:
   You do not have to include this information if this is a feature request.
   -->
   
   ## Environment details:
   
   * local deployment, Mac OS
   
   ## Steps to reproduce the issue:
   
   1.   run controller locally
   2.   stop controller
   3.   check logs
   
   
   ## Provide the expected results and outputs:
   You should see the cluster stages gracefully shutdown, which allows a terminating node (and the cluster leader) to remain in a usable state. The node should go through states as:
   * Leaving
   * Exiting
   * Exiting completed
   * Shutting down
   * Shut down
   
   ```
   [2018-08-07T04:46:23.981Z] [INFO] Cluster Node [akka.tcp://controller-actor-system@172.17.0.1:8000] - Marked address [akka.tcp://controller-actor-system@172.17.0.1:8000] as [Leaving]
   [2018-08-07T04:46:24.307Z] [INFO] Cluster Node [akka.tcp://controller-actor-system@172.17.0.1:8000] - Exiting (leader), starting coordinated shutdown
   [2018-08-07T04:46:24.309Z] [INFO] Cluster Node [akka.tcp://controller-actor-system@172.17.0.1:8000] - Leader is moving node [akka.tcp://controller-actor-system@172.17.0.1:8000] to [Exiting]
   [2018-08-07T04:46:24.310Z] [INFO] Cluster Node [akka.tcp://controller-actor-system@172.17.0.1:8000] - Exiting completed
   [2018-08-07T04:46:24.313Z] [INFO] Cluster Node [akka.tcp://controller-actor-system@172.17.0.1:8000] - Shutting down...
   [2018-08-07T04:46:24.318Z] [INFO] Cluster Node [akka.tcp://controller-actor-system@172.17.0.1:8000] - Successfully shut down
   [2018-08-07T04:46:24.328Z] [INFO] [#tid_sid_unknown] [Controller] Shutting down Kamon with coordinated shutdown
   [2018-08-07T04:46:24.340Z] [INFO] Shutting down remote daemon.
   [2018-08-07T04:46:24.345Z] [INFO] Remote daemon shut down; proceeding with flushing remote transports.
   [2018-08-07T04:46:24.505Z] [INFO] Remoting shut down
   [2018-08-07T04:46:24.506Z] [INFO] Remoting shut down.
   ```
   
   
   ## Provide the actual results and outputs:
   
   Notice you only get a `Marked address ... as [Leaving]` and an abrupt shut down:
   ```
   [2018-08-07T04:19:18.894Z] [INFO] Cluster Node [akka.tcp://controller-actor-system@172.17.0.1:8000] - Marked address [akka.tcp://controller-actor-system@172.17.0.1:8000] as [Leaving]
   [2018-08-07T04:19:18.911Z] [INFO] Shutting down remote daemon.
   [2018-08-07T04:19:18.917Z] [INFO] Remote daemon shut down; proceeding with flushing remote transports.
   [2018-08-07T04:19:18.997Z] [INFO] Remoting shut down
   [2018-08-07T04:19:18.997Z] [INFO] Remoting shut down.
   [WARN] [08/07/2018 04:19:19.028] [controller-actor-system-akka.actor.default-dispatcher-4] [CoordinatedShutdown(akka://controller-actor-system)] Task [exiting-completed] failed in phase [cluster-exiting-done]: Recipient[Actor[akka://controller-actor-system/system/cluster/core/daemon#-278779491]] had already been terminated. Sender[Actor[akka://controller-actor-system/system/cluster/core/daemon#-278779491]] sent the message of type "akka.cluster.InternalClusterAction$ExitingCompleted$".
   ```
   
   ## Additional information you deem important:
   * in cases where the termination is too abrupt, the remaining cluster nodes will check for this node, with logs like:
   ```
   [2018-08-07T05:05:27.133Z] [WARN] Association with remote system [akka.tcp://controller-actor-system@10.66.20.90:18578] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://controller-actor-system@10.66.20.90:18578]] Caused by: [Connection refused: /10.66.20.90:18578]
   ```
   * this was how I noticed the issue, testing controller cluster resizing, although this is a more general problem than cluster resizing.
   * In the case of clustering, the terminated node will need to be automatically (or manually) downed 
   * CoordinatedShutdown has a `terminate-actor-system` which is [default](https://doc.akka.io/docs/akka/2.5/general/configuration.html) to `on`  , which will terminate the actor system - we should not do it explicitly, afaik.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services