You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@brooklyn.apache.org by Aled Sage <al...@gmail.com> on 2014/11/20 00:54:09 UTC

Handling transient provisioning failures

Hi all,

I've spent today doing a lot of QA, deploying various apps repeatedly.
It went mostly very well, but there were a few failures.

One thing this highlights is the need to write our entities + blueprints 
to handle transient failures.

Three areas spring to mind.

_*VM Provisioning Failures*_
Clouds can fail to give us an ssh'able VM.

Setting `machineCreateAttempts` will tell Brooklyn to retry if a VM 
fails to be created or comes back dead-on-arrival (e.g. can't ssh).
This value currently defaults to 1 (i.e. if first attempt fails, then 
abort).

Perhaps we should change the default to 2?


_*Cluster quorum size*_
When starting a cluster (e.g. 16 Cassandra nodes, or whatever), we can 
get some failures.
With the default configuration, any failures result in the cluster 
reporting itself as failed.

There is a configuration option, `cluster.initial.quorumSize`, which 
says the minimum number of initial nodes that must come up successfully 
for the cluster to be considered healthy.
e.g. cluster.initial.quorumSize of 12 and cluster.initial.size of 16 
means that we'll accept a maximum of 4 failures on initial deployment.

Should we have a more lenient default (e.g. two thirds of the 
cluster.initial.size)?


_*Command retries*_
Provisioning commands, e.g. ssh'ing to install software, sometimes fail.

For example, today I saw:
     Execution failed, invalid result -1 for installing CouchbaseNodeImpl
which most likely means there was an ssh connection failure while executing.

In situations like that, we should retry.

We should also retry by default on some other idempotent operations - 
installing, customizing and stopping are good contenders; launching is 
harder - it's up to the implementer to explicitly enable retry (but only 
if it is written to be idempotent; otherwise stop-then-start might be 
required for retry).

Aled