You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@brooklyn.apache.org by he...@apache.org on 2021/08/31 12:58:13 UTC

[brooklyn-docs] branch master updated: add troubleshooting for startup and rebind issues

This is an automated email from the ASF dual-hosted git repository.

heneveld pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/brooklyn-docs.git


The following commit(s) were added to refs/heads/master by this push:
     new f4f52c1  add troubleshooting for startup and rebind issues
     new 634a88f  Merge branch 'master' of https://gitbox.apache.org/repos/asf/brooklyn-docs
f4f52c1 is described below

commit f4f52c12c6a9f4cdf788fc1ff87a35d84ac169ca
Author: Alex Heneveld <al...@cloudsoftcorp.com>
AuthorDate: Tue Aug 31 13:43:53 2021 +0100

    add troubleshooting for startup and rebind issues
---
 guide/ops/persistence/index.md                     | 34 +--------
 guide/ops/troubleshooting/fails-to-start.md        | 81 ++++++++++++++++++++++
 .../troubleshooting/going-deep-in-java-and-logs.md |  4 +-
 guide/ops/troubleshooting/index.md                 |  1 +
 4 files changed, 88 insertions(+), 32 deletions(-)

diff --git a/guide/ops/persistence/index.md b/guide/ops/persistence/index.md
index 5574a0c..231cfef 100644
--- a/guide/ops/persistence/index.md
+++ b/guide/ops/persistence/index.md
@@ -105,37 +105,9 @@ any registered policies.
 
 ## Handling Rebind Failures
 
-If rebind fails fail for any reason, details of the underlying failures will be reported 
-in the [`brooklyn.debug.log`](../paths.html). This will include the entities, locations or policies which caused an issue, and in what 
-way it failed. There are several approaches to resolving problems.
-
-1) Determine Underlying Cause
-
-Go through the log and identify the likely areas in the code from the error message.
-
-2) Seek Help
-
- Help can be found by contacting the Apache Brooklyn mailing list.
-
-3) Fix-up the State
-
-The state of each entity, location, policy and enricher is persisted in XML. 
-It is thus human readable and editable.
-
-After first taking a backup of the state, it is possible to modify the state. For example,
-an offending entity could be removed, or references to that entity removed, or its XML 
-could be fixed to remove the problem.
-
-
-4) Fixing with Groovy Scripts
-
-The final (powerful and dangerous!) tool is to execute Groovy code on the running Brooklyn 
-instance. If authorized, the REST API allows arbitrary Groovy scripts to be passed in and 
-executed. This allows the state of entities to be modified (and thus fixed) at runtime.
-
-If used, it is strongly recommended that Groovy scripts are run against a disconnected Brooklyn
-instance. After fixing the entities, locations and/or policies, the Brooklyn instance's 
-new persisted state can be copied and used to fix the production instance.
+It is possible to confuse Apache Brooklyn such that it is unable to rebind to previously persisted
+state after a restart or when running from a different instance.
+Detailed steps to troubleshoot and correct these situations can be found [here](../troubleshooting/fails-to-start.md).
 
 
 # Writing Persistable Code
diff --git a/guide/ops/troubleshooting/fails-to-start.md b/guide/ops/troubleshooting/fails-to-start.md
new file mode 100644
index 0000000..b4afa5c
--- /dev/null
+++ b/guide/ops/troubleshooting/fails-to-start.md
@@ -0,0 +1,81 @@
+---
+layout: website-normal
+title: "Brooklyn Fails to Start"
+toc: /guide/toc.json
+---
+
+If Apache Brooklyn does not start, or starts with errors, the problem is usually easy to resolve.
+The first place to look is the [logs](/guide/ops/logging.html):  `grep` for the first `ERROR`,
+and sometimes look backwards for the first `WARN` message.
+
+There are a handful of common causes.
+
+## Memory
+
+If there is not enough memory available either on the system or for the software, it will have problems.
+This may manifest itself as the process being killed, e.g. if the OS does not have enough memory
+(and there will usually be a message in the system log, e.g. `/var/log/syslog`);
+or some modules failing to load with an `OutOfMemoryException` in the log.
+
+If either of these occurs, you can assign additional memory if available on your system 
+by editing the files in `bin/`, such as `JAVA_MAX_MEM` in `setenv` (or `setenv.bat` on Windows),
+or by running Apache Brooklyn on a system with more memory.
+
+
+## Rebind Errors
+
+It is possible to get the persistent state into an incompatible state, where Apache Brooklyn
+cannot load its previous state. In this case it fails fast so as not to corrupt the state further.
+In addition, a backup of the persistent state will be written to the `backups/` folder in
+the persistent state directory.
+
+The log files contain detailed information about what is unable to be loaded and why;
+some causes include:
+
+* A type that is deployed is no longer available, e.g. because a `SNAPSHOT` bundle was installed,
+  say with a type `X`, the type `X` is used in an active deployment, and then the bundle
+  was either uninstalled or a new version installed at the same version (for `SNAPSHOT` or forced)
+  that did not contain the type in use (`X`)
+
+* A deployment did not correctly clean up and leaked resources;
+  this will happen only with Java entities or adjuncts that are incorrectly unmanaged
+
+* A dependency is unavailable, possibly because it was added via the `dropins/` folder or
+  is not installed in the Brooklyn instance being started
+
+There are some good practices which can help avoid these errors:
+
+* Avoid the use of `SNAPSHOT` bundles in production (and do not `force` install bundles)
+* If `SNAPSHOT` bundles are updated in an incompatible way in a dev environment (eg blueprint name change), 
+  take care to remove pre-existing incompatible deployments 
+* When upgrading or restarting Brooklyn, it is recommended to start a second instance as hot-standby first: 
+  this will flag the issue that there is an existing deployment which cannot be re-read on a clean start, 
+  and it can be removed from the primary Brooklyn
+
+If a rebind problem does occur, all is not lost.  There are several ways that recovery can be achieved:
+
+* Delete the incompatible persisted state item files indicated in the logs
+  (or simply delete all the persisted state in a dev environment)
+* Restore to a previous backup state (automatically written to the `backups/` folder with a datestamp)
+* Tell Brooklyn to ignore a certain number of rebind errors with settings in `brooklyn.cfg`:
+  * `rebind.failureMode.danglingRefs.minRequiredHealthy`: takes `QuorumCheck` syntax, consisting
+    of points on a line, e.g. `[[0,0],[10,5],[20,14]]` to allow up to 1 failure for every 2 items up to 10 items
+    (5 needed when 10 items are persisted, per the second point), then subsequently 1 failure for every additional 10 items deployed
+    (14 needed when 20 items are persisted, per the third point)
+  * `rebind.failureMode.rebind`: either `FAIL_FAST`, `FAIL_AT_END`, or `CONTINUE`, for how to treat serious rebind problems
+    (default `FAIL_AT_END`)
+  * Further options available as per the JavaDoc on `RebindManagerImpl` config keys
+* When Brooklyn is stopped, remove the persisted state; then restart in a pristine environment, install any missing bundles,
+  then import the offending persistent state via the UI (About) or REST API;
+  alternatively in some cases it may be possible to add additional/missing bundles via the `dropins/` folder of Karaf 
+  or using the `karaf` console (`bundle:install -s ...`)
+* If the broken persisted state is critical, it is possible to edit them:  they are simply an XML model of the items
+  using a lot of unique identifiers designed so that references can be easily found using `grep`
+* Finally, if all else fails, open a support ticket:  there are a number of other advanced techniques available,
+  such as specifying that types should be automatically renamed or migrated by new bundles ([see the Persistence section here](../upgrades/)).
+
+It may also be useful to review the sections on [Persistence](../persistence/) and [HA](../high-availability/).
+
+
+
+
diff --git a/guide/ops/troubleshooting/going-deep-in-java-and-logs.md b/guide/ops/troubleshooting/going-deep-in-java-and-logs.md
index 581a27b..3c072b4 100644
--- a/guide/ops/troubleshooting/going-deep-in-java-and-logs.md
+++ b/guide/ops/troubleshooting/going-deep-in-java-and-logs.md
@@ -475,4 +475,6 @@ SEVERE: Cannot start server. Server instance is not configured.
 
 {% endhighlight %}
 
-As expected, we can see here that the `unmatched-element` element has not been terminated in the `server.xml` file
+As expected, we can see here that the `unmatched-element` element has not been terminated in the `server.xml` file.
+
+
diff --git a/guide/ops/troubleshooting/index.md b/guide/ops/troubleshooting/index.md
index 331e267..7909648 100644
--- a/guide/ops/troubleshooting/index.md
+++ b/guide/ops/troubleshooting/index.md
@@ -3,6 +3,7 @@ title: Troubleshooting
 layout: website-normal
 children:
 - { path: overview.md, title: Overview }
+- { path: fails-to-start.md }
 - { path: web-console-issues.md, title: Web Console Issues }
 - { path: deployment.md, title: Deployment }
 - { path: connectivity.md, title: Server Connectivity }