You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@whirr.apache.org by "Tibor Kiss (JIRA)" <ji...@apache.org> on 2011/01/09 22:58:45 UTC
[jira] Updated: (WHIRR-167) Improve bootstrapping and configuration to be able to isolate and repair or evict failing nodes on EC2

     [ https://issues.apache.org/jira/browse/WHIRR-167?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Tibor Kiss updated WHIRR-167:
-----------------------------

    Attachment: whirr-167-1.patch

I attached whirr-167-1.patch. I'm sure that is not a final one, but I would like to hear your opinions too.

I changed the ClusterSpec and InstanceTemplate, in order to be able to tell a minimum percentage of successfully started nodes.
If we are not specify anything, it means %100, so a value
whirr.instance-templates=1 jt+nn,4 dn+tt%60
would mean that "jt+nn" roles passess only when 100% of the nodes start successfully and
"dn+tt" roles passess only when 60% of the nodes starts successfully.

If any of the roles didn't passed the minim requirement, it will initiate a retry phase in which the failing nodes on each roles will be replaced with new ones. That means that even a namenode startup problem wouldn't mean a complete lost cluster.
Without any retries a failure in namenode would break an entire cluster with many dn+tt successfuly started. I think that it worst to minimize the chance to fail in this way, therefore I introduced a retry cycle.
If there are some failure in dn+tt only while passing the minimum limit, the cluster will start up only with that amount of nodes without any retry.
A retry cycle would mean a chance for both roles to increase the number of nodes until the maximum value.

At this moment I don't think that more than one retry it worst! The target is just to replace a few sporadic service problems.
My question would be that we can leave a retry in case of insufficient nodes or we would leave the default value as without retry and add an extra parameter? Initially I wouldn't like the ideea to add more parameters.

About failing nodes... There are 2 different cases:
1. In case when the minimum required nodes couldn't be satisfied by a retry cycle, in that case all of the lost nodes will be left as it is. A full cluster destroy will be able to remove them.
2. In case when the number of nodes is satisfied from the first round or a retry, all the failed nodes (from first round and from retry cycle) will be destroyed automatically at the end of BootstrapClusterAction.doAction.

I experienced some difficulties in destroying the nodes. Initially I used a destroyNodesMatching(Predicate<NodeMetadata> filter) method which would terminate all my enumerated nodes in parallel. But this method would like to delete also the security group and placement group. Then I had to use the simple destroyNode(String id), which now deletes the nodes sequentially and I cannot control the KeyPair delition. My opinion that jclouds library is missing some convenient methods to revoke some nodes without optional propagation of KeyPair, SecurityGroup and PlacementGroup cleanup. Effectively here I get screwed up and I feel I couldn't find an elegant solution which does not incurr the revoke process.

> Improve bootstrapping and configuration to be able to isolate and repair or evict failing nodes on EC2
> ------------------------------------------------------------------------------------------------------
>
>                 Key: WHIRR-167
>                 URL: https://issues.apache.org/jira/browse/WHIRR-167
>             Project: Whirr
>          Issue Type: Improvement
>         Environment: Amazon EC2
>            Reporter: Tibor Kiss
>            Assignee: Tibor Kiss
>         Attachments: whirr-167-1.patch, whirr.log
>
>
> Actually it is very unstable the cluster startup process on Amazon EC2 instances. How the number of nodes to be started up is increasing the startup process it fails more often. But sometimes even 2-3 nodes startup process fails. We don't know how many number of instance startup is going on at the same time at Amazon side when it fails or when it successfully starting up. The only think I see is that when I am starting around 10 nodes, the statistics of failing nodes are higher then with smaller number of nodes and is not direct proportional with the number of nodes, looks like it is exponentialy higher probability to fail some nodes.
> Lookint into BootstrapCluterAction.java, there is a note "// TODO: Check for RunNodesException and don't bail out if only a few " which indicated the current unreliable startup process. So we should improve it.
> We could add a "max percent failure" property (per instance template), so that if the number failures exceeded this value the whole cluster fails to launch and is shutdown. For the master node the value would be 100%, but for datanodes it would be more like 75%. (Tom White also mentioned in an email).
> Let's discuss if there are any other requirements to this improvement.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.