You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@brooklyn.apache.org by Aled Sage <al...@gmail.com> on 2014/11/20 00:54:09 UTC
Handling transient provisioning failures
Hi all,
I've spent today doing a lot of QA, deploying various apps repeatedly.
It went mostly very well, but there were a few failures.
One thing this highlights is the need to write our entities + blueprints
to handle transient failures.
Three areas spring to mind.
_*VM Provisioning Failures*_
Clouds can fail to give us an ssh'able VM.
Setting `machineCreateAttempts` will tell Brooklyn to retry if a VM
fails to be created or comes back dead-on-arrival (e.g. can't ssh).
This value currently defaults to 1 (i.e. if first attempt fails, then
abort).
Perhaps we should change the default to 2?
_*Cluster quorum size*_
When starting a cluster (e.g. 16 Cassandra nodes, or whatever), we can
get some failures.
With the default configuration, any failures result in the cluster
reporting itself as failed.
There is a configuration option, `cluster.initial.quorumSize`, which
says the minimum number of initial nodes that must come up successfully
for the cluster to be considered healthy.
e.g. cluster.initial.quorumSize of 12 and cluster.initial.size of 16
means that we'll accept a maximum of 4 failures on initial deployment.
Should we have a more lenient default (e.g. two thirds of the
cluster.initial.size)?
_*Command retries*_
Provisioning commands, e.g. ssh'ing to install software, sometimes fail.
For example, today I saw:
Execution failed, invalid result -1 for installing CouchbaseNodeImpl
which most likely means there was an ssh connection failure while executing.
In situations like that, we should retry.
We should also retry by default on some other idempotent operations -
installing, customizing and stopping are good contenders; launching is
harder - it's up to the implementer to explicitly enable retry (but only
if it is written to be idempotent; otherwise stop-then-start might be
required for retry).
Aled