You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stratos.apache.org by "Shaheedur Haque (shahhaqu)" <sh...@cisco.com> on 2015/07/07 13:09:45 UTC

Problem with Stratos 4.1 autohealing (was RE: Problems with Stratos 4.1 failover and autohealing)

I met with several of the wso2 team, and we agreed the following next steps:


1.      Treat the “Stratos failover” issue as separate from the “duplicate instances” issue, and use a new thread for the former, keeping this thread for the latter.

2.      The lack of synchronisation between the cleanup of the faulty VMs and the startup of the replacement VMs was noted.

a.      It was mentioned that the period of overlap should be seconds, perhaps upto 60s.

b.      I noted that for many of our scenarios, where the hardware is 1:1 mapped to VMs (i.e. neither over-subscribed nor under-subscribed), the overlap may be a problem. I will consult with others and confirm. Imesh ack’d this point.

c.      I will provide the wso2 logs for my test case (summarised below).

d.      I will check with the test team where the problem was originally detected if this is a transient  (upto 60s) or long-lived, and gather logs as needed.

From: Imesh Gunaratne [mailto:imesh@apache.org]
Sent: 06 July 2015 17:11
To: dev
Subject: Re: Problems with Stratos 4.1 failover and autohealing

On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <ra...@wso2.com>> wrote:

If it detects that cartridge agent has not published any health event within the given time window (60mins), it will publish the MemberFault event.

This should be 60 seconds.

On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <ra...@wso2.com>> wrote:
Hi Shaheed,

Regarding Stratos fail-over scenario, by looking at the logs it seems that there are multiple member fault cases detected by CEP. This FaultHandlingWindowProcessor [1] running inside CEP is implemented to check for health stats published by cartridge agents. It does not check whether actual VM or processes running inside the VM is active or not. If it detects that cartridge agent has not published any health event within the given time window (60mins), it will publish the MemberFault event.
Ideally this should not happen regardless of whether the deployment is HA or not. Because FaultHandlingWindowProcessor will count the 60mins interval from the point Stratos server started. Only reason I could think of which may have caused these faulty member logs is Stratos (or external CEP) did not receive health stats published by cartridge agents on time.

Are you running Stratos in single-JVM mode? If so, I'd highly recommend that you deploy the CEP profile separately. Please refer to [2] for reference deployment architecture of Stratos HA setup.

Regarding your second concern; Kill multiple Cartridges, I believe your concern is that the first message relating to recovery of the CPS came 66s after faulty member detection and recovery of other VMs started before that. I'd like to summarize few points regarding Stratos recovery process which I think will explain that behavior.

 - Stratos detects faulty members only by checking the health stats published by cartridge agent and not by checking the existence of the actual VM.
 - Upon detecting such faulty member, CEP will publish a faulty member event and after receiving such event AS will move that instance to obsolete list. AS will iterate over this obsolete list and try to terminate those instances via CC. When the termination is complete, CC will publish member terminated event, upon receiving that AS will remove that member from obsolete list.
 - If CC fails to terminate the instance, it will keep retrying until a obsolete member timeout interval has passed. When the timeout expires, AS will forcefully remove that member from obsolete list.
 - Stratos will not try to re-spawn the faulty instance right after detecting it. This recovery process is triggered by the Cluster monitor object. There is a Drools logic which gets executed periodically (monitor interval is configurable, default is 90s) which will execute the min-check rule, scale-up rule and scale-down rule and decide whether to spawn new instances. So this recovery process could take some time depending on your monitor interval.


So in retrospect;
- There is no way to control which cluster monitor gets executed first, so therefore it is possible that CPS recovery takes place after other VMs.
- The reason for seeing lot of termination logs is because Stratos continuously tries to terminate those faulty members until obsolete member timeout has expired. This could occur if the actual VM was killed manually or by other means.

However, we need to figure out why you got these faulty member events. Could you send us the CEP trace log for further analysis? Also you can enable debug logs for FaultHandlingWindowProcessor class to get more insights as to what's going on.

I hope this would explain your concerns. We can have a Hangout session to discuss this further.

[1] https://cwiki.apache.org/confluence/display/STRATOS/4.1.0+Configuring+HA+in+Single+JVM+Mode+on+EC2
[2] https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java

Thanks.

On Mon, Jul 6, 2015 at 5:00 PM, Shaheedur Haque (shahhaqu) <sh...@cisco.com>> wrote:
Hi,

I am seeing some issues around Stratos 4.1 around failover on the one hand (i.e. stop one running instance of Startos, start another as a replacement), and autohealing (i.e. a Cartridge dies and need to be replaced by Stratos) which feel as though they might have a common cause, so I am lumping them together for now. I’ll explain the two scenarios as I can…and the attach some logs explaining one scenario in detail.

Failover of Stratos

The situation here is that I have 3 servers, all running MySQL in active-active mode, and Stratos running on at most one of them under Pacemaker control. The test case is that I kill the server (actually a VM) running Stratos, and Pacemaker responds by starting Stratos on one of the remaining servers. Around the time of the action, there is no ongoing REST activity in terms of defining new Cartridges, or Applications or anything like that.


•        Expected result: the replacement Stratos instance “acquires” the running Cartridges, and the system continues without killing/replacing any Cartridges.

•        Actual result: all Cartridges are kill and restarted.

Kill multiple Cartridges

I have a test where several Cartridges are killed at the same time as follows. We have a Cartridge Proxy Server (CPS) that hosts the (Java) Cartridge agent for VMs which cannot run one natively. The CPS therefore runs N+1 JCAs, one for itself plus N for the Cartridges it is proxying. Killing the CPS therefore causes all N+1 JCAs to disappear, and from a Startos point of view, all N+1 VMs need to be restarted.


•        Expected result: each Cartridges is replaced by Stratos.

•        Actual result: looking at the Stratos logs, it seems that there is a race between the termination logic which cleans up as old cluster members dies against the startup logic which is spinning up replacement VMs for the terminated ones.

Both cases previously seemed to work fine, where we *think* previously means versions of Stratos 4.1 dating back ~3 weeks or so. I attach a series of log extracts for the second scenario; I suspect this also covers the first, even if not, at least we can start here.


1.      Original spin up of CPS is at 2015-06-11 19:50:20,585. CPS fault seen around 2015-06-12 18:54:51, contemporaneously, all the other faults are seen:

ID: [0] [STRATOS] [2015-06-12 18:54:51,788]  INFO {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty member detected [member-id] di-000-007.di-000-007.cisco-qvpc-sf-0.domain88ef85b5-5ec6-4d4e-b7b2-1ab9d3f7b160 with [last time-stamp] 1434135185318 [time-out] 60000 milliseconds
...
TID: [0] [STRATOS] [2015-06-12 18:54:51,803]  INFO {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty member detected [member-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8 with [last time-stamp] 1434135192556 [time-out] 60000 milliseconds
...
TID: [0] [STRATOS] [2015-06-12 18:54:51,837]  INFO {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty member detected [member-id] cartridge-proxy.cartridge-proxy.cartridge-proxy.domain95a58a3e-e7a9-4071-b42b-3551bad25e6a with [last time-stamp] 1434135189149 [time-out] 60000 milliseconds
...
etc
...
TID: [0] [STRATOS] [2015-06-12 18:54:51,862]  INFO {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty member detected [member-id] di-000-010.di-000-010.cisco-qvpc-sf-0.domain2ab24c1d-53cb-452e-8aa2-f02eecf2db78 with [last time-stamp] 1434135193054 [time-out] 60000 milliseconds
...
etc
...


2.      The last message relating to the CPS failure is at 18:54:52. The first message relating to recovery of the CPS is some 66 seconds later:

TID: [0] [STRATOS] [2015-06-12 18:55:58,516]  INFO {org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition available, hence trying to spawn an instance to fulfil minimum count! [cluster] cartridge-proxy.cartridge-proxy.cartridge-proxy.domain


3.      However, recovery of other VMs started as much as 54 seconds before this point to as much as 26 seconds after:

TID: [0] [STRATOS] [2015-06-12 18:55:04,564]  INFO {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to spawn an instance via cloud controller: [cluster] di-000-008.di-000-008.cisco-qvpc-sf-0.domain [partition] whole-region [network-partition-id] RegionOne
...
TID: [0] [STRATOS] [2015-06-12 18:56:20,603]  INFO {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to spawn an instance via cloud controller: [cluster] di-000-007.di-000-007.cisco-qvpc-sf-0.domain [partition] whole-region [network-partition-id] RegionOne


4.      Between point 2 and 3 (with fuzz for the races), I see Stratos cleaning the old VMs. For example:

TID: [0] [STRATOS] [2015-06-12 18:55:04,580]  INFO {org.apache.stratos.cloud.controller.iaases.JcloudsIaas} -  Starting to terminate member: [cartridge-type] cisco-qvpc-sf-0 [member-id] di-000-008.di-000-008.cisco-qvpc-sf-0.domaina0f15eed-9599-4f3e-a70a-93ddd02ccf5f


5.      However, around that point, look at this:

TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO {org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale up over max, hence not scaling up cluster itself and
                        notifying to parent for possible group scaling or app bursting.
                        [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1 [max] 1



6.      Is it possible that there is a race between the terminates and restarts? It certainly seems so, though a detailed knowledge of Startos is needed to confirm it:

TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO {org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition available, hence trying to spawn an instance to fulfil minimum count! [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to spawn an instance via cloud controller: [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain [partition] whole-region [network-partition-id] RegionOne
TID: [0] [STRATOS] [2015-06-12 18:55:12,799]  INFO {org.apache.stratos.common.client.CloudControllerServiceClient} -  Terminating instance via cloud controller: [member] di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
TID: [0] [STRATOS] [2015-06-12 18:55:12,812]  INFO {org.apache.stratos.cloud.controller.messaging.publisher.TopologyEventPublisher} -  Publishing member created event: [service-name] cisco-qvpc-sf-0 [cluster-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain [cluster-instance-id] di-000-005-1 [member-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d [instance-id] null [network-partition-id] RegionOne [partition-id] whole-region [lb-cluster-id] null
TID: [0] [STRATOS] [2015-06-12 18:55:12,817]  INFO {org.apache.stratos.messaging.message.processor.topology.MemberCreatedMessageProcessor} -  Member created: [service-name] cisco-qvpc-sf-0 [cluster-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain [member-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d [cluster-instance-id] di-000-005-1
TID: [0] [STRATOS] [2015-06-12 18:55:12,834] ERROR {org.apache.stratos.autoscaler.rule.RuleTasksDelegator} -  Request in flight threshold is Zero
TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO {org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale up over max, hence not scaling up cluster itself and
                        notifying to parent for possible group scaling or app bursting.
                        [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1 [max] 1

(I tried to file a JIRA on this, but I’m having a bit of trouble, hence posting here to get the ball rolling).

Thanks, Shaheed





--
Akila Ravihansa Perera
Software Engineer, WSO2

Blog: http://ravihansa3000.blogspot.com



--
Imesh Gunaratne

Senior Technical Lead, WSO2
Committer & PMC Member, Apache Stratos

Re: Problem with Stratos 4.1 autohealing (was RE: Problems with Stratos 4.1 failover and autohealing)

Posted by Reka Thirunavukkarasu <re...@wso2.com>.

Hi Shaheed,

Shall we have a call to discuss on the issue further?

Thanks,
Reka

On Mon, Jul 13, 2015 at 7:04 PM, Shaheedur Haque (shahhaqu) <
shahhaqu@cisco.com> wrote:

>  Hi Reka,
>
>
>
> First, this test was likely done with Stratos from about a month ago.
>
>
>
> Second, as discussed, the test scenario here was that on 12-Jun, around
> 18:53, the “cartridge-proxy” server was killed. Since the cartridge-proxy
> is a Cartridge (with its own Java Cartridge Agent) which also hosts the
> Cartridge Agents for other VMs, then it was killed, Stratos will detect *
> *all** the VMs as down (i.e the cartridge-proxy as well as the ones
> associated with the extra Cartridge Agents). Thus, from a Stratos point of
> view we are expecting lots of member termination activity followed by
> recovery activity.
>
>
>
> As per the call, the problem here is that there is a race between the two,
> and sometimes we end up with some subscriptions (each is a single Cartridge
> Application, with max=1) having multiple VMs instead of the max=1 setting
> being honoured.
>
>
>
> (I’m not sure I understand the significance of the logs you quoted from
> 15-Jun, but I am assuming that is not directly related to these events).
>
>
>
> Hope that clears it up, if not, let’s do a call to sync up.
>
>
>
> Thanks, Shaheed
>
>
>
> *From:* Reka Thirunavukkarasu [mailto:reka@wso2.com]
> *Sent:* 13 July 2015 13:22
> *To:* dev
> *Subject:* Re: Problem with Stratos 4.1 autohealing (was RE: Problems
> with Stratos 4.1 failover and autohealing)
>
>
>
> Hi Shaheed,
>
> One more observation as below. This could be the reason why there were
> lots of faulty member detection in the CEP side as the member didn't get
> actually terminated from the IaaS or topology rather it just got removed
> from the obsolete list as the obsolete timeout exceeded. Autoscaler should
> keep the member into obsolete list until 1 day as obsolete timeout set to
> 1 day. It is unlikely that the member got removed within few seconds. As
> i have tested with latest code base with 4.1.0-rc4, i could see that the
> autoscaler keeps the obsolete member until 1 day. Can you let us know
> whether you are testing with latest stratos build? Because, we have already
> fixed an issue of terminating the instance once before removing it from the
> obsolete list.
>
>
>
> Fault member detection
> -------------------------------
> TID: [0] [STRATOS] [2015-06-15 19:47:20,960]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> di-000-009.di-000-009.cisco-qvpc-sf-0.domainc8ade134-ff57-4c02-a546-856e7197d78c
> with [last time-stamp] 1434397528340 [time-out] 60000 milliseconds
>
>
> Obsolete member removal
> -----------------------------------
>
> TID: [0] [STRATOS] [2015-06-15 19:47:35,525]  INFO
> {org.apache.stratos.autoscaler.context.partition.ClusterLevelPartitionContext$ObsoletedMemberWatcher}
> -  Obsolete state of member is expired, member will be disposed and will
> not be tracked anymore [obsolete member]
> di-000-009.di-000-009.cisco-qvpc-sf-0.domainc8ade134-ff57-4c02-a546-856e7197d78c
> [expiry time] 86400000 [cluster]
> di-000-009.di-000-009.cisco-qvpc-sf-0.domain [cluster instance] di-000-009-1
>
> If you encounter this again, can you enable debug logs for autscaler and
> share the logs with us?
>
> Thanks,
>
> Reka
>
>
>
>
> On Mon, Jul 13, 2015 at 4:45 PM, Reka Thirunavukkarasu <re...@wso2.com>
> wrote:
>
> Hi Shaheed,
>
> As i went though the logs "*duplicated cartridges*" case, the observation
> is actually correct and it is what expected from stratos side as well. The
> expected flow is as below currently as Akila explained. I have added a
> point where the termination and spawning can happen at the same time:
>
> 1. Faulty member is detected and received by autoscaler
>
> 2. Autoscaler immediately move this from active list to obsolete list
>
> 3. Cluster monitor periodic task executes minimum drool, then scaling
> drool and obsolete drool as parallel to other drools.
>
> 4. There is a chance that while obsolete drool is terminating the
> instance, minimum drool can start a new instance as both are executed in
> two different thread.
>
> 5. After minimum drool got executed, scaling drool will get executed. In
> that according to the cluster based stats, it can take any decision to
> scaleup/down.
>
> Please let us know whether this behavior has any impact on your scenarios.
>
> Thanks,
>
> Reka
>
>
>
>
>
>
>
> On Tue, Jul 7, 2015 at 4:39 PM, Shaheedur Haque (shahhaqu) <
> shahhaqu@cisco.com> wrote:
>
> I met with several of the wso2 team, and we agreed the following next
> steps:
>
>
>
> 1.      Treat the “Stratos failover” issue as separate from the
> “duplicate instances” issue, and use a new thread for the former, keeping
> this thread for the latter.
>
> 2.      The lack of synchronisation between the cleanup of the faulty VMs
> and the startup of the replacement VMs was noted.
>
> a.      It was mentioned that the period of overlap should be seconds,
> perhaps upto 60s.
>
> b.      I noted that for many of our scenarios, where the hardware is 1:1
> mapped to VMs (i.e. neither over-subscribed nor under-subscribed), the
> overlap may be a problem. I will consult with others and confirm. Imesh
> ack’d this point.
>
> c.      I will provide the wso2 logs for my test case (summarised below).
>
> d.      I will check with the test team where the problem was originally
> detected if this is a transient  (upto 60s) or long-lived, and gather logs
> as needed.
>
>
>
> *From:* Imesh Gunaratne [mailto:imesh@apache.org]
> *Sent:* 06 July 2015 17:11
> *To:* dev
> *Subject:* Re: Problems with Stratos 4.1 failover and autohealing
>
>
>
> On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <ravihansa@wso2.com
> > wrote:
>
>
>
>  If it detects that cartridge agent has not published any health event
> within the given time window (60mins), it will publish the MemberFault
> event.
>
>
>
>  This should be 60 seconds.
>
>
>
> On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <ra...@wso2.com>
> wrote:
>
> Hi Shaheed,
>
>
> Regarding Stratos fail-over scenario, by looking at the logs it seems that
> there are multiple member fault cases detected by CEP. This
> FaultHandlingWindowProcessor [1] running inside CEP is implemented to check
> for health stats published by cartridge agents. It does not check whether
> actual VM or processes running inside the VM is active or not. If it
> detects that cartridge agent has not published any health event within the
> given time window (60mins), it will publish the MemberFault event.
>
> Ideally this should not happen regardless of whether the deployment is HA
> or not. Because FaultHandlingWindowProcessor will count the 60mins interval
> from the point Stratos server started. Only reason I could think of which
> may have caused these faulty member logs is Stratos (or external CEP) did
> not receive health stats published by cartridge agents on time.
>
>
>
> Are you running Stratos in single-JVM mode? If so, I'd highly recommend
> that you deploy the CEP profile separately. Please refer to [2] for
> reference deployment architecture of Stratos HA setup.
>
>
>
> Regarding your second concern; Kill multiple Cartridges, I believe your
> concern is that the first message relating to recovery of the CPS came
> 66s after faulty member detection and recovery of other VMs started before
> that. I'd like to summarize few points regarding Stratos recovery process
> which I think will explain that behavior.
>
>
>
>  - Stratos detects faulty members only by checking the health stats
> published by cartridge agent and not by checking the existence of the
> actual VM.
>
>  - Upon detecting such faulty member, CEP will publish a faulty member
> event and after receiving such event AS will move that instance to obsolete
> list. AS will iterate over this obsolete list and try to terminate those
> instances via CC. When the termination is complete, CC will publish member
> terminated event, upon receiving that AS will remove that member from
> obsolete list.
>
>  - If CC fails to terminate the instance, it will keep retrying until a
> obsolete member timeout interval has passed. When the timeout expires, AS
> will forcefully remove that member from obsolete list.
>
>  - Stratos will not try to re-spawn the faulty instance right after
> detecting it. This recovery process is triggered by the Cluster monitor
> object. There is a Drools logic which gets executed periodically (monitor
> interval is configurable, default is 90s) which will execute the min-check
> rule, scale-up rule and scale-down rule and decide whether to spawn new
> instances. So this recovery process could take some time depending on your
> monitor interval.
>
>
>
>
>
> So in retrospect;
>
> - There is no way to control which cluster monitor gets executed first, so
> therefore it is possible that CPS recovery takes place after other VMs.
>
> - The reason for seeing lot of termination logs is because
> Stratos continuously tries to terminate those faulty members until obsolete
> member timeout has expired. This could occur if the actual VM was killed
> manually or by other means.
>
>
>
> However, we need to figure out why you got these faulty member events.
> Could you send us the CEP trace log for further analysis? Also you can
> enable debug logs for FaultHandlingWindowProcessor class to get more
> insights as to what's going on.
>
>
>
> I hope this would explain your concerns. We can have a Hangout session to
> discuss this further.
>
>
>
> [1]
> https://cwiki.apache.org/confluence/display/STRATOS/4.1.0+Configuring+HA+in+Single+JVM+Mode+on+EC2
>
> [2]
> https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java
>
>
>
> Thanks.
>
>
>
> On Mon, Jul 6, 2015 at 5:00 PM, Shaheedur Haque (shahhaqu) <
> shahhaqu@cisco.com> wrote:
>
> Hi,
>
>
>
> I am seeing some issues around Stratos 4.1 around failover on the one hand
> (i.e. stop one running instance of Startos, start another as a
> replacement), and autohealing (i.e. a Cartridge dies and need to be
> replaced by Stratos) which feel as though they might have a common cause,
> so I am lumping them together for now. I’ll explain the two scenarios as I
> can…and the attach some logs explaining one scenario in detail.
>
>
>
> *Failover of Stratos*
>
>
>
> The situation here is that I have 3 servers, all running MySQL in
> active-active mode, and Stratos running on at most one of them under
> Pacemaker control. The test case is that I kill the server (actually a VM)
> running Stratos, and Pacemaker responds by starting Stratos on one of the
> remaining servers. Around the time of the action, there is no ongoing REST
> activity in terms of defining new Cartridges, or Applications or anything
> like that.
>
>
>
> ·        Expected result: the replacement Stratos instance “acquires” the
> running Cartridges, and the system continues without killing/replacing any
> Cartridges.
>
> ·        Actual result: all Cartridges are kill and restarted.
>
>
>
> *Kill multiple Cartridges*
>
>
>
> I have a test where several Cartridges are killed at the same time as
> follows. We have a Cartridge Proxy Server (CPS) that hosts the (Java)
> Cartridge agent for VMs which cannot run one natively. The CPS therefore
> runs N+1 JCAs, one for itself plus N for the Cartridges it is proxying.
> Killing the CPS therefore causes all N+1 JCAs to disappear, and from a
> Startos point of view, all N+1 VMs need to be restarted.
>
>
>
> ·        Expected result: each Cartridges is replaced by Stratos.
>
> ·        Actual result: looking at the Stratos logs, it seems that there
> is a race between the termination logic which cleans up as old cluster
> members dies against the startup logic which is spinning up replacement VMs
> for the terminated ones.
>
>
>
> Both cases previously seemed to work fine, where we **think** previously
> means versions of Stratos 4.1 dating back ~3 weeks or so. I attach a series
> of log extracts for the second scenario; I suspect this also covers the
> first, even if not, at least we can start here.
>
>
>
> 1.      Original spin up of CPS is at 2015-06-11 19:50:20,585. CPS fault
> seen around 2015-06-12 18:54:51, contemporaneously, all the other faults
> are seen:
>
>
>
> ID: [0] [STRATOS] [2015-06-12 18:54:51,788]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> di-000-007.di-000-007.cisco-qvpc-sf-0.domain88ef85b5-5ec6-4d4e-b7b2-1ab9d3f7b160
> with [last time-stamp] 1434135185318 [time-out] 60000 milliseconds
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:54:51,803]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
> with [last time-stamp] 1434135192556 [time-out] 60000 milliseconds
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:54:51,837]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> cartridge-proxy.cartridge-proxy.cartridge-proxy.domain95a58a3e-e7a9-4071-b42b-3551bad25e6a
> with [last time-stamp] 1434135189149 [time-out] 60000 milliseconds
>
> ...
>
> etc
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:54:51,862]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> di-000-010.di-000-010.cisco-qvpc-sf-0.domain2ab24c1d-53cb-452e-8aa2-f02eecf2db78
> with [last time-stamp] 1434135193054 [time-out] 60000 milliseconds
>
> ...
>
> etc
>
> ...
>
>
>
> 2.      The last message relating to the CPS failure is at 18:54:52. The
> first message relating to recovery of the CPS is some 66 seconds later:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:58,516]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition
> available, hence trying to spawn an instance to fulfil minimum count!
> [cluster] cartridge-proxy.cartridge-proxy.cartridge-proxy.domain
>
>
>
> 3.      However, recovery of other VMs started as much as 54 seconds
> before this point to as much as 26 seconds after:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:04,564]  INFO
> {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to
> spawn an instance via cloud controller: [cluster]
> di-000-008.di-000-008.cisco-qvpc-sf-0.domain [partition] whole-region
> [network-partition-id] RegionOne
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:56:20,603]  INFO
> {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to
> spawn an instance via cloud controller: [cluster]
> di-000-007.di-000-007.cisco-qvpc-sf-0.domain [partition] whole-region
> [network-partition-id] RegionOne
>
>
>
> 4.      Between point 2 and 3 (with fuzz for the races), I see Stratos
> cleaning the old VMs. For example:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:04,580]  INFO
> {org.apache.stratos.cloud.controller.iaases.JcloudsIaas} -  Starting to
> terminate member: [cartridge-type] cisco-qvpc-sf-0 [member-id]
> di-000-008.di-000-008.cisco-qvpc-sf-0.domaina0f15eed-9599-4f3e-a70a-93ddd02ccf5f
>
>
>
> 5.      However, around that point, look at this:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale
> up over max, hence not scaling up cluster itself and
>
>                         notifying to parent for possible group scaling or
> app bursting.
>
>                         [cluster]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1
> [max] 1
>
>
>
>
>
> 6.      Is it possible that there is a race between the terminates and
> restarts? It certainly seems so, though a detailed knowledge of Startos is
> needed to confirm it:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition
> available, hence trying to spawn an instance to fulfil minimum count!
> [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO
> {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to
> spawn an instance via cloud controller: [cluster]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [partition] whole-region
> [network-partition-id] RegionOne
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,799]  INFO
> {org.apache.stratos.common.client.CloudControllerServiceClient} -
> Terminating instance via cloud controller: [member]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,812]  INFO
> {org.apache.stratos.cloud.controller.messaging.publisher.TopologyEventPublisher}
> -  Publishing member created event: [service-name] cisco-qvpc-sf-0
> [cluster-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
> [cluster-instance-id] di-000-005-1 [member-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
> [instance-id] null [network-partition-id] RegionOne [partition-id]
> whole-region [lb-cluster-id] null
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,817]  INFO
> {org.apache.stratos.messaging.message.processor.topology.MemberCreatedMessageProcessor}
> -  Member created: [service-name] cisco-qvpc-sf-0 [cluster-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [member-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
> [cluster-instance-id] di-000-005-1
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,834] ERROR
> {org.apache.stratos.autoscaler.rule.RuleTasksDelegator} -  Request in
> flight threshold is Zero
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale
> up over max, hence not scaling up cluster itself and
>
>                         notifying to parent for possible group scaling or
> app bursting.
>
>                         [cluster]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1
> [max] 1
>
>
>
> (I tried to file a JIRA on this, but I’m having a bit of trouble, hence
> posting here to get the ball rolling).
>
>
>
> Thanks, Shaheed
>
>
>
>
>
>
>
>
>
> --
>
> Akila Ravihansa Perera
> Software Engineer, WSO2
>
> Blog: http://ravihansa3000.blogspot.com
>
>
>
>
>
> --
>
> Imesh Gunaratne
>
>
>
> Senior Technical Lead, WSO2
>
> Committer & PMC Member, Apache Stratos
>
>
>
>   --
>
> Reka Thirunavukkarasu
> Senior Software Engineer,
> WSO2, Inc.:http://wso2.com,
>
> Mobile: +94776442007
>
>
>
>
>
>
> --
>
> Reka Thirunavukkarasu
> Senior Software Engineer,
> WSO2, Inc.:http://wso2.com,
>
> Mobile: +94776442007
>
>
>



-- 
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007

RE: Problem with Stratos 4.1 autohealing (was RE: Problems with Stratos 4.1 failover and autohealing)

Posted by "Shaheedur Haque (shahhaqu)" <sh...@cisco.com>.

Hi Reka,

First, this test was likely done with Stratos from about a month ago.

Second, as discussed, the test scenario here was that on 12-Jun, around 18:53, the “cartridge-proxy” server was killed. Since the cartridge-proxy is a Cartridge (with its own Java Cartridge Agent) which also hosts the Cartridge Agents for other VMs, then it was killed, Stratos will detect *all* the VMs as down (i.e the cartridge-proxy as well as the ones associated with the extra Cartridge Agents). Thus, from a Stratos point of view we are expecting lots of member termination activity followed by recovery activity.

As per the call, the problem here is that there is a race between the two, and sometimes we end up with some subscriptions (each is a single Cartridge Application, with max=1) having multiple VMs instead of the max=1 setting being honoured.

(I’m not sure I understand the significance of the logs you quoted from 15-Jun, but I am assuming that is not directly related to these events).

Hope that clears it up, if not, let’s do a call to sync up.

Thanks, Shaheed

From: Reka Thirunavukkarasu [mailto:reka@wso2.com]
Sent: 13 July 2015 13:22
To: dev
Subject: Re: Problem with Stratos 4.1 autohealing (was RE: Problems with Stratos 4.1 failover and autohealing)

Hi Shaheed,

One more observation as below. This could be the reason why there were lots of faulty member detection in the CEP side as the member didn't get actually terminated from the IaaS or topology rather it just got removed from the obsolete list as the obsolete timeout exceeded. Autoscaler should keep the member into obsolete list until 1 day as obsolete timeout set to 1 day. It is unlikely that the member got removed within few seconds. As i have tested with latest code base with 4.1.0-rc4, i could see that the autoscaler keeps the obsolete member until 1 day. Can you let us know whether you are testing with latest stratos build? Because, we have already fixed an issue of terminating the instance once before removing it from the obsolete list.


Fault member detection
-------------------------------
TID: [0] [STRATOS] [2015-06-15 19:47:20,960]  INFO {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty member detected [member-id] di-000-009.di-000-009.cisco-qvpc-sf-0.domainc8ade134-ff57-4c02-a546-856e7197d78c with [last time-stamp] 1434397528340 [time-out] 60000 milliseconds


Obsolete member removal
-----------------------------------

TID: [0] [STRATOS] [2015-06-15 19:47:35,525]  INFO {org.apache.stratos.autoscaler.context.partition.ClusterLevelPartitionContext$ObsoletedMemberWatcher} -  Obsolete state of member is expired, member will be disposed and will not be tracked anymore [obsolete member] di-000-009.di-000-009.cisco-qvpc-sf-0.domainc8ade134-ff57-4c02-a546-856e7197d78c [expiry time] 86400000 [cluster] di-000-009.di-000-009.cisco-qvpc-sf-0.domain [cluster instance] di-000-009-1
If you encounter this again, can you enable debug logs for autscaler and share the logs with us?
Thanks,
Reka


On Mon, Jul 13, 2015 at 4:45 PM, Reka Thirunavukkarasu <re...@wso2.com>> wrote:
Hi Shaheed,
As i went though the logs "duplicated cartridges" case, the observation is actually correct and it is what expected from stratos side as well. The expected flow is as below currently as Akila explained. I have added a point where the termination and spawning can happen at the same time:
1. Faulty member is detected and received by autoscaler
2. Autoscaler immediately move this from active list to obsolete list
3. Cluster monitor periodic task executes minimum drool, then scaling drool and obsolete drool as parallel to other drools.
4. There is a chance that while obsolete drool is terminating the instance, minimum drool can start a new instance as both are executed in two different thread.
5. After minimum drool got executed, scaling drool will get executed. In that according to the cluster based stats, it can take any decision to scaleup/down.

Please let us know whether this behavior has any impact on your scenarios.
Thanks,
Reka



On Tue, Jul 7, 2015 at 4:39 PM, Shaheedur Haque (shahhaqu) <sh...@cisco.com>> wrote:
I met with several of the wso2 team, and we agreed the following next steps:


1.      Treat the “Stratos failover” issue as separate from the “duplicate instances” issue, and use a new thread for the former, keeping this thread for the latter.

2.      The lack of synchronisation between the cleanup of the faulty VMs and the startup of the replacement VMs was noted.

a.      It was mentioned that the period of overlap should be seconds, perhaps upto 60s.

b.      I noted that for many of our scenarios, where the hardware is 1:1 mapped to VMs (i.e. neither over-subscribed nor under-subscribed), the overlap may be a problem. I will consult with others and confirm. Imesh ack’d this point.

c.      I will provide the wso2 logs for my test case (summarised below).

d.      I will check with the test team where the problem was originally detected if this is a transient  (upto 60s) or long-lived, and gather logs as needed.

From: Imesh Gunaratne [mailto:imesh@apache.org<ma...@apache.org>]
Sent: 06 July 2015 17:11
To: dev
Subject: Re: Problems with Stratos 4.1 failover and autohealing

On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <ra...@wso2.com>> wrote:

If it detects that cartridge agent has not published any health event within the given time window (60mins), it will publish the MemberFault event.

This should be 60 seconds.

On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <ra...@wso2.com>> wrote:
Hi Shaheed,

Regarding Stratos fail-over scenario, by looking at the logs it seems that there are multiple member fault cases detected by CEP. This FaultHandlingWindowProcessor [1] running inside CEP is implemented to check for health stats published by cartridge agents. It does not check whether actual VM or processes running inside the VM is active or not. If it detects that cartridge agent has not published any health event within the given time window (60mins), it will publish the MemberFault event.
Ideally this should not happen regardless of whether the deployment is HA or not. Because FaultHandlingWindowProcessor will count the 60mins interval from the point Stratos server started. Only reason I could think of which may have caused these faulty member logs is Stratos (or external CEP) did not receive health stats published by cartridge agents on time.

Are you running Stratos in single-JVM mode? If so, I'd highly recommend that you deploy the CEP profile separately. Please refer to [2] for reference deployment architecture of Stratos HA setup.

Regarding your second concern; Kill multiple Cartridges, I believe your concern is that the first message relating to recovery of the CPS came 66s after faulty member detection and recovery of other VMs started before that. I'd like to summarize few points regarding Stratos recovery process which I think will explain that behavior.

 - Stratos detects faulty members only by checking the health stats published by cartridge agent and not by checking the existence of the actual VM.
 - Upon detecting such faulty member, CEP will publish a faulty member event and after receiving such event AS will move that instance to obsolete list. AS will iterate over this obsolete list and try to terminate those instances via CC. When the termination is complete, CC will publish member terminated event, upon receiving that AS will remove that member from obsolete list.
 - If CC fails to terminate the instance, it will keep retrying until a obsolete member timeout interval has passed. When the timeout expires, AS will forcefully remove that member from obsolete list.
 - Stratos will not try to re-spawn the faulty instance right after detecting it. This recovery process is triggered by the Cluster monitor object. There is a Drools logic which gets executed periodically (monitor interval is configurable, default is 90s) which will execute the min-check rule, scale-up rule and scale-down rule and decide whether to spawn new instances. So this recovery process could take some time depending on your monitor interval.


So in retrospect;
- There is no way to control which cluster monitor gets executed first, so therefore it is possible that CPS recovery takes place after other VMs.
- The reason for seeing lot of termination logs is because Stratos continuously tries to terminate those faulty members until obsolete member timeout has expired. This could occur if the actual VM was killed manually or by other means.

However, we need to figure out why you got these faulty member events. Could you send us the CEP trace log for further analysis? Also you can enable debug logs for FaultHandlingWindowProcessor class to get more insights as to what's going on.

I hope this would explain your concerns. We can have a Hangout session to discuss this further.

[1] https://cwiki.apache.org/confluence/display/STRATOS/4.1.0+Configuring+HA+in+Single+JVM+Mode+on+EC2
[2] https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java

Thanks.

On Mon, Jul 6, 2015 at 5:00 PM, Shaheedur Haque (shahhaqu) <sh...@cisco.com>> wrote:
Hi,

I am seeing some issues around Stratos 4.1 around failover on the one hand (i.e. stop one running instance of Startos, start another as a replacement), and autohealing (i.e. a Cartridge dies and need to be replaced by Stratos) which feel as though they might have a common cause, so I am lumping them together for now. I’ll explain the two scenarios as I can…and the attach some logs explaining one scenario in detail.

Failover of Stratos

The situation here is that I have 3 servers, all running MySQL in active-active mode, and Stratos running on at most one of them under Pacemaker control. The test case is that I kill the server (actually a VM) running Stratos, and Pacemaker responds by starting Stratos on one of the remaining servers. Around the time of the action, there is no ongoing REST activity in terms of defining new Cartridges, or Applications or anything like that.


•        Expected result: the replacement Stratos instance “acquires” the running Cartridges, and the system continues without killing/replacing any Cartridges.

•        Actual result: all Cartridges are kill and restarted.

Kill multiple Cartridges

I have a test where several Cartridges are killed at the same time as follows. We have a Cartridge Proxy Server (CPS) that hosts the (Java) Cartridge agent for VMs which cannot run one natively. The CPS therefore runs N+1 JCAs, one for itself plus N for the Cartridges it is proxying. Killing the CPS therefore causes all N+1 JCAs to disappear, and from a Startos point of view, all N+1 VMs need to be restarted.


•        Expected result: each Cartridges is replaced by Stratos.

•        Actual result: looking at the Stratos logs, it seems that there is a race between the termination logic which cleans up as old cluster members dies against the startup logic which is spinning up replacement VMs for the terminated ones.

Both cases previously seemed to work fine, where we *think* previously means versions of Stratos 4.1 dating back ~3 weeks or so. I attach a series of log extracts for the second scenario; I suspect this also covers the first, even if not, at least we can start here.


1.      Original spin up of CPS is at 2015-06-11 19:50:20,585. CPS fault seen around 2015-06-12 18:54:51, contemporaneously, all the other faults are seen:

ID: [0] [STRATOS] [2015-06-12 18:54:51,788]  INFO {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty member detected [member-id] di-000-007.di-000-007.cisco-qvpc-sf-0.domain88ef85b5-5ec6-4d4e-b7b2-1ab9d3f7b160 with [last time-stamp] 1434135185318 [time-out] 60000 milliseconds
...
TID: [0] [STRATOS] [2015-06-12 18:54:51,803]  INFO {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty member detected [member-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8 with [last time-stamp] 1434135192556 [time-out] 60000 milliseconds
...
TID: [0] [STRATOS] [2015-06-12 18:54:51,837]  INFO {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty member detected [member-id] cartridge-proxy.cartridge-proxy.cartridge-proxy.domain95a58a3e-e7a9-4071-b42b-3551bad25e6a with [last time-stamp] 1434135189149 [time-out] 60000 milliseconds
...
etc
...
TID: [0] [STRATOS] [2015-06-12 18:54:51,862]  INFO {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty member detected [member-id] di-000-010.di-000-010.cisco-qvpc-sf-0.domain2ab24c1d-53cb-452e-8aa2-f02eecf2db78 with [last time-stamp] 1434135193054 [time-out] 60000 milliseconds
...
etc
...


2.      The last message relating to the CPS failure is at 18:54:52. The first message relating to recovery of the CPS is some 66 seconds later:

TID: [0] [STRATOS] [2015-06-12 18:55:58,516]  INFO {org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition available, hence trying to spawn an instance to fulfil minimum count! [cluster] cartridge-proxy.cartridge-proxy.cartridge-proxy.domain


3.      However, recovery of other VMs started as much as 54 seconds before this point to as much as 26 seconds after:

TID: [0] [STRATOS] [2015-06-12 18:55:04,564]  INFO {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to spawn an instance via cloud controller: [cluster] di-000-008.di-000-008.cisco-qvpc-sf-0.domain [partition] whole-region [network-partition-id] RegionOne
...
TID: [0] [STRATOS] [2015-06-12 18:56:20,603]  INFO {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to spawn an instance via cloud controller: [cluster] di-000-007.di-000-007.cisco-qvpc-sf-0.domain [partition] whole-region [network-partition-id] RegionOne


4.      Between point 2 and 3 (with fuzz for the races), I see Stratos cleaning the old VMs. For example:

TID: [0] [STRATOS] [2015-06-12 18:55:04,580]  INFO {org.apache.stratos.cloud.controller.iaases.JcloudsIaas} -  Starting to terminate member: [cartridge-type] cisco-qvpc-sf-0 [member-id] di-000-008.di-000-008.cisco-qvpc-sf-0.domaina0f15eed-9599-4f3e-a70a-93ddd02ccf5f


5.      However, around that point, look at this:

TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO {org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale up over max, hence not scaling up cluster itself and
                        notifying to parent for possible group scaling or app bursting.
                        [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1 [max] 1



6.      Is it possible that there is a race between the terminates and restarts? It certainly seems so, though a detailed knowledge of Startos is needed to confirm it:

TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO {org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition available, hence trying to spawn an instance to fulfil minimum count! [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to spawn an instance via cloud controller: [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain [partition] whole-region [network-partition-id] RegionOne
TID: [0] [STRATOS] [2015-06-12 18:55:12,799]  INFO {org.apache.stratos.common.client.CloudControllerServiceClient} -  Terminating instance via cloud controller: [member] di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
TID: [0] [STRATOS] [2015-06-12 18:55:12,812]  INFO {org.apache.stratos.cloud.controller.messaging.publisher.TopologyEventPublisher} -  Publishing member created event: [service-name] cisco-qvpc-sf-0 [cluster-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain [cluster-instance-id] di-000-005-1 [member-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d [instance-id] null [network-partition-id] RegionOne [partition-id] whole-region [lb-cluster-id] null
TID: [0] [STRATOS] [2015-06-12 18:55:12,817]  INFO {org.apache.stratos.messaging.message.processor.topology.MemberCreatedMessageProcessor} -  Member created: [service-name] cisco-qvpc-sf-0 [cluster-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain [member-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d [cluster-instance-id] di-000-005-1
TID: [0] [STRATOS] [2015-06-12 18:55:12,834] ERROR {org.apache.stratos.autoscaler.rule.RuleTasksDelegator} -  Request in flight threshold is Zero
TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO {org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale up over max, hence not scaling up cluster itself and
                        notifying to parent for possible group scaling or app bursting.
                        [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1 [max] 1

(I tried to file a JIRA on this, but I’m having a bit of trouble, hence posting here to get the ball rolling).

Thanks, Shaheed





--
Akila Ravihansa Perera
Software Engineer, WSO2

Blog: http://ravihansa3000.blogspot.com



--
Imesh Gunaratne

Senior Technical Lead, WSO2
Committer & PMC Member, Apache Stratos


--
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007<tel:%2B94776442007>




--
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007

Re: Problem with Stratos 4.1 autohealing (was RE: Problems with Stratos 4.1 failover and autohealing)

Posted by Reka Thirunavukkarasu <re...@wso2.com>.

Hi Shaheed,

One more observation as below. This could be the reason why there were lots
of faulty member detection in the CEP side as the member didn't get
actually terminated from the IaaS or topology rather it just got removed
from the obsolete list as the obsolete timeout exceeded. Autoscaler should
keep the member into obsolete list until 1 day as obsolete timeout set to 1
day. It is unlikely that the member got removed within few seconds. As i
have tested with latest code base with 4.1.0-rc4, i could see that the
autoscaler keeps the obsolete member until 1 day. Can you let us know
whether you are testing with latest stratos build? Because, we have already
fixed an issue of terminating the instance once before removing it from the
obsolete list.


Fault member detection
-------------------------------
TID: [0] [STRATOS] [2015-06-15 19:47:20,960]  INFO
{org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
member detected [member-id]
di-000-009.di-000-009.cisco-qvpc-sf-0.domainc8ade134-ff57-4c02-a546-856e7197d78c
with [last time-stamp] 1434397528340 [time-out] 60000 milliseconds


Obsolete member removal
-----------------------------------

TID: [0] [STRATOS] [2015-06-15 19:47:35,525]  INFO
{org.apache.stratos.autoscaler.context.partition.ClusterLevelPartitionContext$ObsoletedMemberWatcher}
-  Obsolete state of member is expired, member will be disposed and will
not be tracked anymore [obsolete member]
di-000-009.di-000-009.cisco-qvpc-sf-0.domainc8ade134-ff57-4c02-a546-856e7197d78c
[expiry time] 86400000 [cluster]
di-000-009.di-000-009.cisco-qvpc-sf-0.domain [cluster instance] di-000-009-1

If you encounter this again, can you enable debug logs for autscaler and
share the logs with us?

Thanks,
Reka


On Mon, Jul 13, 2015 at 4:45 PM, Reka Thirunavukkarasu <re...@wso2.com>
wrote:

> Hi Shaheed,
>
> As i went though the logs "*duplicated cartridges*" case, the observation
> is actually correct and it is what expected from stratos side as well. The
> expected flow is as below currently as Akila explained. I have added a
> point where the termination and spawning can happen at the same time:
>
> 1. Faulty member is detected and received by autoscaler
> 2. Autoscaler immediately move this from active list to obsolete list
> 3. Cluster monitor periodic task executes minimum drool, then scaling
> drool and obsolete drool as parallel to other drools.
> 4. There is a chance that while obsolete drool is terminating the
> instance, minimum drool can start a new instance as both are executed in
> two different thread.
> 5. After minimum drool got executed, scaling drool will get executed. In
> that according to the cluster based stats, it can take any decision to
> scaleup/down.
>
> Please let us know whether this behavior has any impact on your scenarios.
>
> Thanks,
> Reka
>
>
>
>
>
> On Tue, Jul 7, 2015 at 4:39 PM, Shaheedur Haque (shahhaqu) <
> shahhaqu@cisco.com> wrote:
>
>>  I met with several of the wso2 team, and we agreed the following next
>> steps:
>>
>>
>>
>> 1.      Treat the “Stratos failover” issue as separate from the
>> “duplicate instances” issue, and use a new thread for the former, keeping
>> this thread for the latter.
>>
>> 2.      The lack of synchronisation between the cleanup of the faulty
>> VMs and the startup of the replacement VMs was noted.
>>
>> a.      It was mentioned that the period of overlap should be seconds,
>> perhaps upto 60s.
>>
>> b.      I noted that for many of our scenarios, where the hardware is
>> 1:1 mapped to VMs (i.e. neither over-subscribed nor under-subscribed), the
>> overlap may be a problem. I will consult with others and confirm. Imesh
>> ack’d this point.
>>
>> c.      I will provide the wso2 logs for my test case (summarised below).
>>
>> d.      I will check with the test team where the problem was originally
>> detected if this is a transient  (upto 60s) or long-lived, and gather logs
>> as needed.
>>
>>
>>
>> *From:* Imesh Gunaratne [mailto:imesh@apache.org]
>> *Sent:* 06 July 2015 17:11
>> *To:* dev
>> *Subject:* Re: Problems with Stratos 4.1 failover and autohealing
>>
>>
>>
>> On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <
>> ravihansa@wso2.com> wrote:
>>
>>
>>
>>  If it detects that cartridge agent has not published any health event
>> within the given time window (60mins), it will publish the MemberFault
>> event.
>>
>>
>>
>>  This should be 60 seconds.
>>
>>
>>
>> On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <
>> ravihansa@wso2.com> wrote:
>>
>> Hi Shaheed,
>>
>>
>> Regarding Stratos fail-over scenario, by looking at the logs it seems
>> that there are multiple member fault cases detected by CEP. This
>> FaultHandlingWindowProcessor [1] running inside CEP is implemented to check
>> for health stats published by cartridge agents. It does not check whether
>> actual VM or processes running inside the VM is active or not. If it
>> detects that cartridge agent has not published any health event within the
>> given time window (60mins), it will publish the MemberFault event.
>>
>> Ideally this should not happen regardless of whether the deployment is HA
>> or not. Because FaultHandlingWindowProcessor will count the 60mins interval
>> from the point Stratos server started. Only reason I could think of which
>> may have caused these faulty member logs is Stratos (or external CEP) did
>> not receive health stats published by cartridge agents on time.
>>
>>
>>
>> Are you running Stratos in single-JVM mode? If so, I'd highly recommend
>> that you deploy the CEP profile separately. Please refer to [2] for
>> reference deployment architecture of Stratos HA setup.
>>
>>
>>
>> Regarding your second concern; Kill multiple Cartridges, I believe your
>> concern is that the first message relating to recovery of the CPS came
>> 66s after faulty member detection and recovery of other VMs started before
>> that. I'd like to summarize few points regarding Stratos recovery process
>> which I think will explain that behavior.
>>
>>
>>
>>  - Stratos detects faulty members only by checking the health stats
>> published by cartridge agent and not by checking the existence of the
>> actual VM.
>>
>>  - Upon detecting such faulty member, CEP will publish a faulty member
>> event and after receiving such event AS will move that instance to obsolete
>> list. AS will iterate over this obsolete list and try to terminate those
>> instances via CC. When the termination is complete, CC will publish member
>> terminated event, upon receiving that AS will remove that member from
>> obsolete list.
>>
>>  - If CC fails to terminate the instance, it will keep retrying until a
>> obsolete member timeout interval has passed. When the timeout expires, AS
>> will forcefully remove that member from obsolete list.
>>
>>  - Stratos will not try to re-spawn the faulty instance right after
>> detecting it. This recovery process is triggered by the Cluster monitor
>> object. There is a Drools logic which gets executed periodically (monitor
>> interval is configurable, default is 90s) which will execute the min-check
>> rule, scale-up rule and scale-down rule and decide whether to spawn new
>> instances. So this recovery process could take some time depending on your
>> monitor interval.
>>
>>
>>
>>
>>
>> So in retrospect;
>>
>> - There is no way to control which cluster monitor gets executed first,
>> so therefore it is possible that CPS recovery takes place after other VMs.
>>
>> - The reason for seeing lot of termination logs is because
>> Stratos continuously tries to terminate those faulty members until obsolete
>> member timeout has expired. This could occur if the actual VM was killed
>> manually or by other means.
>>
>>
>>
>> However, we need to figure out why you got these faulty member events.
>> Could you send us the CEP trace log for further analysis? Also you can
>> enable debug logs for FaultHandlingWindowProcessor class to get more
>> insights as to what's going on.
>>
>>
>>
>> I hope this would explain your concerns. We can have a Hangout session to
>> discuss this further.
>>
>>
>>
>> [1]
>> https://cwiki.apache.org/confluence/display/STRATOS/4.1.0+Configuring+HA+in+Single+JVM+Mode+on+EC2
>>
>> [2]
>> https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java
>>
>>
>>
>> Thanks.
>>
>>
>>
>> On Mon, Jul 6, 2015 at 5:00 PM, Shaheedur Haque (shahhaqu) <
>> shahhaqu@cisco.com> wrote:
>>
>> Hi,
>>
>>
>>
>> I am seeing some issues around Stratos 4.1 around failover on the one
>> hand (i.e. stop one running instance of Startos, start another as a
>> replacement), and autohealing (i.e. a Cartridge dies and need to be
>> replaced by Stratos) which feel as though they might have a common cause,
>> so I am lumping them together for now. I’ll explain the two scenarios as I
>> can…and the attach some logs explaining one scenario in detail.
>>
>>
>>
>> *Failover of Stratos*
>>
>>
>>
>> The situation here is that I have 3 servers, all running MySQL in
>> active-active mode, and Stratos running on at most one of them under
>> Pacemaker control. The test case is that I kill the server (actually a VM)
>> running Stratos, and Pacemaker responds by starting Stratos on one of the
>> remaining servers. Around the time of the action, there is no ongoing REST
>> activity in terms of defining new Cartridges, or Applications or anything
>> like that.
>>
>>
>>
>> ·        Expected result: the replacement Stratos instance “acquires”
>> the running Cartridges, and the system continues without killing/replacing
>> any Cartridges.
>>
>> ·        Actual result: all Cartridges are kill and restarted.
>>
>>
>>
>> *Kill multiple Cartridges*
>>
>>
>>
>> I have a test where several Cartridges are killed at the same time as
>> follows. We have a Cartridge Proxy Server (CPS) that hosts the (Java)
>> Cartridge agent for VMs which cannot run one natively. The CPS therefore
>> runs N+1 JCAs, one for itself plus N for the Cartridges it is proxying.
>> Killing the CPS therefore causes all N+1 JCAs to disappear, and from a
>> Startos point of view, all N+1 VMs need to be restarted.
>>
>>
>>
>> ·        Expected result: each Cartridges is replaced by Stratos.
>>
>> ·        Actual result: looking at the Stratos logs, it seems that there
>> is a race between the termination logic which cleans up as old cluster
>> members dies against the startup logic which is spinning up replacement VMs
>> for the terminated ones.
>>
>>
>>
>> Both cases previously seemed to work fine, where we **think** previously
>> means versions of Stratos 4.1 dating back ~3 weeks or so. I attach a series
>> of log extracts for the second scenario; I suspect this also covers the
>> first, even if not, at least we can start here.
>>
>>
>>
>> 1.      Original spin up of CPS is at 2015-06-11 19:50:20,585. CPS fault
>> seen around 2015-06-12 18:54:51, contemporaneously, all the other faults
>> are seen:
>>
>>
>>
>> ID: [0] [STRATOS] [2015-06-12 18:54:51,788]  INFO
>> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
>> member detected [member-id]
>> di-000-007.di-000-007.cisco-qvpc-sf-0.domain88ef85b5-5ec6-4d4e-b7b2-1ab9d3f7b160
>> with [last time-stamp] 1434135185318 [time-out] 60000 milliseconds
>>
>> ...
>>
>> TID: [0] [STRATOS] [2015-06-12 18:54:51,803]  INFO
>> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
>> member detected [member-id]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
>> with [last time-stamp] 1434135192556 [time-out] 60000 milliseconds
>>
>> ...
>>
>> TID: [0] [STRATOS] [2015-06-12 18:54:51,837]  INFO
>> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
>> member detected [member-id]
>> cartridge-proxy.cartridge-proxy.cartridge-proxy.domain95a58a3e-e7a9-4071-b42b-3551bad25e6a
>> with [last time-stamp] 1434135189149 [time-out] 60000 milliseconds
>>
>> ...
>>
>> etc
>>
>> ...
>>
>> TID: [0] [STRATOS] [2015-06-12 18:54:51,862]  INFO
>> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
>> member detected [member-id]
>> di-000-010.di-000-010.cisco-qvpc-sf-0.domain2ab24c1d-53cb-452e-8aa2-f02eecf2db78
>> with [last time-stamp] 1434135193054 [time-out] 60000 milliseconds
>>
>> ...
>>
>> etc
>>
>> ...
>>
>>
>>
>> 2.      The last message relating to the CPS failure is at 18:54:52. The
>> first message relating to recovery of the CPS is some 66 seconds later:
>>
>>
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:58,516]  INFO
>> {org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition
>> available, hence trying to spawn an instance to fulfil minimum count!
>> [cluster] cartridge-proxy.cartridge-proxy.cartridge-proxy.domain
>>
>>
>>
>> 3.      However, recovery of other VMs started as much as 54 seconds
>> before this point to as much as 26 seconds after:
>>
>>
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:04,564]  INFO
>> {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to
>> spawn an instance via cloud controller: [cluster]
>> di-000-008.di-000-008.cisco-qvpc-sf-0.domain [partition] whole-region
>> [network-partition-id] RegionOne
>>
>> ...
>>
>> TID: [0] [STRATOS] [2015-06-12 18:56:20,603]  INFO
>> {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to
>> spawn an instance via cloud controller: [cluster]
>> di-000-007.di-000-007.cisco-qvpc-sf-0.domain [partition] whole-region
>> [network-partition-id] RegionOne
>>
>>
>>
>> 4.      Between point 2 and 3 (with fuzz for the races), I see Stratos
>> cleaning the old VMs. For example:
>>
>>
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:04,580]  INFO
>> {org.apache.stratos.cloud.controller.iaases.JcloudsIaas} -  Starting to
>> terminate member: [cartridge-type] cisco-qvpc-sf-0 [member-id]
>> di-000-008.di-000-008.cisco-qvpc-sf-0.domaina0f15eed-9599-4f3e-a70a-93ddd02ccf5f
>>
>>
>>
>> 5.      However, around that point, look at this:
>>
>>
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO
>> {org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale
>> up over max, hence not scaling up cluster itself and
>>
>>                         notifying to parent for possible group scaling or
>> app bursting.
>>
>>                         [cluster]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1
>> [max] 1
>>
>>
>>
>>
>>
>> 6.      Is it possible that there is a race between the terminates and
>> restarts? It certainly seems so, though a detailed knowledge of Startos is
>> needed to confirm it:
>>
>>
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO
>> {org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition
>> available, hence trying to spawn an instance to fulfil minimum count!
>> [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO
>> {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to
>> spawn an instance via cloud controller: [cluster]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [partition] whole-region
>> [network-partition-id] RegionOne
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,799]  INFO
>> {org.apache.stratos.common.client.CloudControllerServiceClient} -
>> Terminating instance via cloud controller: [member]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,812]  INFO
>> {org.apache.stratos.cloud.controller.messaging.publisher.TopologyEventPublisher}
>> -  Publishing member created event: [service-name] cisco-qvpc-sf-0
>> [cluster-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
>> [cluster-instance-id] di-000-005-1 [member-id]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
>> [instance-id] null [network-partition-id] RegionOne [partition-id]
>> whole-region [lb-cluster-id] null
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,817]  INFO
>> {org.apache.stratos.messaging.message.processor.topology.MemberCreatedMessageProcessor}
>> -  Member created: [service-name] cisco-qvpc-sf-0 [cluster-id]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [member-id]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
>> [cluster-instance-id] di-000-005-1
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,834] ERROR
>> {org.apache.stratos.autoscaler.rule.RuleTasksDelegator} -  Request in
>> flight threshold is Zero
>>
>> TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO
>> {org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale
>> up over max, hence not scaling up cluster itself and
>>
>>                         notifying to parent for possible group scaling or
>> app bursting.
>>
>>                         [cluster]
>> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1
>> [max] 1
>>
>>
>>
>> (I tried to file a JIRA on this, but I’m having a bit of trouble, hence
>> posting here to get the ball rolling).
>>
>>
>>
>> Thanks, Shaheed
>>
>>
>>
>>
>>
>>
>>
>>
>>
>> --
>>
>> Akila Ravihansa Perera
>> Software Engineer, WSO2
>>
>> Blog: http://ravihansa3000.blogspot.com
>>
>>
>>
>>
>>
>> --
>>
>> Imesh Gunaratne
>>
>>
>>
>> Senior Technical Lead, WSO2
>>
>> Committer & PMC Member, Apache Stratos
>>
>
>
>
> --
> Reka Thirunavukkarasu
> Senior Software Engineer,
> WSO2, Inc.:http://wso2.com,
> Mobile: +94776442007
>
>
>


-- 
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007

Re: Problem with Stratos 4.1 autohealing (was RE: Problems with Stratos 4.1 failover and autohealing)

Posted by Reka Thirunavukkarasu <re...@wso2.com>.

Hi Shaheed,

As i went though the logs "*duplicated cartridges*" case, the observation
is actually correct and it is what expected from stratos side as well. The
expected flow is as below currently as Akila explained. I have added a
point where the termination and spawning can happen at the same time:

1. Faulty member is detected and received by autoscaler
2. Autoscaler immediately move this from active list to obsolete list
3. Cluster monitor periodic task executes minimum drool, then scaling drool
and obsolete drool as parallel to other drools.
4. There is a chance that while obsolete drool is terminating the instance,
minimum drool can start a new instance as both are executed in two
different thread.
5. After minimum drool got executed, scaling drool will get executed. In
that according to the cluster based stats, it can take any decision to
scaleup/down.

Please let us know whether this behavior has any impact on your scenarios.

Thanks,
Reka





On Tue, Jul 7, 2015 at 4:39 PM, Shaheedur Haque (shahhaqu) <
shahhaqu@cisco.com> wrote:

>  I met with several of the wso2 team, and we agreed the following next
> steps:
>
>
>
> 1.      Treat the “Stratos failover” issue as separate from the
> “duplicate instances” issue, and use a new thread for the former, keeping
> this thread for the latter.
>
> 2.      The lack of synchronisation between the cleanup of the faulty VMs
> and the startup of the replacement VMs was noted.
>
> a.      It was mentioned that the period of overlap should be seconds,
> perhaps upto 60s.
>
> b.      I noted that for many of our scenarios, where the hardware is 1:1
> mapped to VMs (i.e. neither over-subscribed nor under-subscribed), the
> overlap may be a problem. I will consult with others and confirm. Imesh
> ack’d this point.
>
> c.      I will provide the wso2 logs for my test case (summarised below).
>
> d.      I will check with the test team where the problem was originally
> detected if this is a transient  (upto 60s) or long-lived, and gather logs
> as needed.
>
>
>
> *From:* Imesh Gunaratne [mailto:imesh@apache.org]
> *Sent:* 06 July 2015 17:11
> *To:* dev
> *Subject:* Re: Problems with Stratos 4.1 failover and autohealing
>
>
>
> On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <ravihansa@wso2.com
> > wrote:
>
>
>
>  If it detects that cartridge agent has not published any health event
> within the given time window (60mins), it will publish the MemberFault
> event.
>
>
>
>  This should be 60 seconds.
>
>
>
> On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <ra...@wso2.com>
> wrote:
>
> Hi Shaheed,
>
>
> Regarding Stratos fail-over scenario, by looking at the logs it seems that
> there are multiple member fault cases detected by CEP. This
> FaultHandlingWindowProcessor [1] running inside CEP is implemented to check
> for health stats published by cartridge agents. It does not check whether
> actual VM or processes running inside the VM is active or not. If it
> detects that cartridge agent has not published any health event within the
> given time window (60mins), it will publish the MemberFault event.
>
> Ideally this should not happen regardless of whether the deployment is HA
> or not. Because FaultHandlingWindowProcessor will count the 60mins interval
> from the point Stratos server started. Only reason I could think of which
> may have caused these faulty member logs is Stratos (or external CEP) did
> not receive health stats published by cartridge agents on time.
>
>
>
> Are you running Stratos in single-JVM mode? If so, I'd highly recommend
> that you deploy the CEP profile separately. Please refer to [2] for
> reference deployment architecture of Stratos HA setup.
>
>
>
> Regarding your second concern; Kill multiple Cartridges, I believe your
> concern is that the first message relating to recovery of the CPS came
> 66s after faulty member detection and recovery of other VMs started before
> that. I'd like to summarize few points regarding Stratos recovery process
> which I think will explain that behavior.
>
>
>
>  - Stratos detects faulty members only by checking the health stats
> published by cartridge agent and not by checking the existence of the
> actual VM.
>
>  - Upon detecting such faulty member, CEP will publish a faulty member
> event and after receiving such event AS will move that instance to obsolete
> list. AS will iterate over this obsolete list and try to terminate those
> instances via CC. When the termination is complete, CC will publish member
> terminated event, upon receiving that AS will remove that member from
> obsolete list.
>
>  - If CC fails to terminate the instance, it will keep retrying until a
> obsolete member timeout interval has passed. When the timeout expires, AS
> will forcefully remove that member from obsolete list.
>
>  - Stratos will not try to re-spawn the faulty instance right after
> detecting it. This recovery process is triggered by the Cluster monitor
> object. There is a Drools logic which gets executed periodically (monitor
> interval is configurable, default is 90s) which will execute the min-check
> rule, scale-up rule and scale-down rule and decide whether to spawn new
> instances. So this recovery process could take some time depending on your
> monitor interval.
>
>
>
>
>
> So in retrospect;
>
> - There is no way to control which cluster monitor gets executed first, so
> therefore it is possible that CPS recovery takes place after other VMs.
>
> - The reason for seeing lot of termination logs is because
> Stratos continuously tries to terminate those faulty members until obsolete
> member timeout has expired. This could occur if the actual VM was killed
> manually or by other means.
>
>
>
> However, we need to figure out why you got these faulty member events.
> Could you send us the CEP trace log for further analysis? Also you can
> enable debug logs for FaultHandlingWindowProcessor class to get more
> insights as to what's going on.
>
>
>
> I hope this would explain your concerns. We can have a Hangout session to
> discuss this further.
>
>
>
> [1]
> https://cwiki.apache.org/confluence/display/STRATOS/4.1.0+Configuring+HA+in+Single+JVM+Mode+on+EC2
>
> [2]
> https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java
>
>
>
> Thanks.
>
>
>
> On Mon, Jul 6, 2015 at 5:00 PM, Shaheedur Haque (shahhaqu) <
> shahhaqu@cisco.com> wrote:
>
> Hi,
>
>
>
> I am seeing some issues around Stratos 4.1 around failover on the one hand
> (i.e. stop one running instance of Startos, start another as a
> replacement), and autohealing (i.e. a Cartridge dies and need to be
> replaced by Stratos) which feel as though they might have a common cause,
> so I am lumping them together for now. I’ll explain the two scenarios as I
> can…and the attach some logs explaining one scenario in detail.
>
>
>
> *Failover of Stratos*
>
>
>
> The situation here is that I have 3 servers, all running MySQL in
> active-active mode, and Stratos running on at most one of them under
> Pacemaker control. The test case is that I kill the server (actually a VM)
> running Stratos, and Pacemaker responds by starting Stratos on one of the
> remaining servers. Around the time of the action, there is no ongoing REST
> activity in terms of defining new Cartridges, or Applications or anything
> like that.
>
>
>
> ·        Expected result: the replacement Stratos instance “acquires” the
> running Cartridges, and the system continues without killing/replacing any
> Cartridges.
>
> ·        Actual result: all Cartridges are kill and restarted.
>
>
>
> *Kill multiple Cartridges*
>
>
>
> I have a test where several Cartridges are killed at the same time as
> follows. We have a Cartridge Proxy Server (CPS) that hosts the (Java)
> Cartridge agent for VMs which cannot run one natively. The CPS therefore
> runs N+1 JCAs, one for itself plus N for the Cartridges it is proxying.
> Killing the CPS therefore causes all N+1 JCAs to disappear, and from a
> Startos point of view, all N+1 VMs need to be restarted.
>
>
>
> ·        Expected result: each Cartridges is replaced by Stratos.
>
> ·        Actual result: looking at the Stratos logs, it seems that there
> is a race between the termination logic which cleans up as old cluster
> members dies against the startup logic which is spinning up replacement VMs
> for the terminated ones.
>
>
>
> Both cases previously seemed to work fine, where we **think** previously
> means versions of Stratos 4.1 dating back ~3 weeks or so. I attach a series
> of log extracts for the second scenario; I suspect this also covers the
> first, even if not, at least we can start here.
>
>
>
> 1.      Original spin up of CPS is at 2015-06-11 19:50:20,585. CPS fault
> seen around 2015-06-12 18:54:51, contemporaneously, all the other faults
> are seen:
>
>
>
> ID: [0] [STRATOS] [2015-06-12 18:54:51,788]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> di-000-007.di-000-007.cisco-qvpc-sf-0.domain88ef85b5-5ec6-4d4e-b7b2-1ab9d3f7b160
> with [last time-stamp] 1434135185318 [time-out] 60000 milliseconds
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:54:51,803]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
> with [last time-stamp] 1434135192556 [time-out] 60000 milliseconds
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:54:51,837]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> cartridge-proxy.cartridge-proxy.cartridge-proxy.domain95a58a3e-e7a9-4071-b42b-3551bad25e6a
> with [last time-stamp] 1434135189149 [time-out] 60000 milliseconds
>
> ...
>
> etc
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:54:51,862]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> di-000-010.di-000-010.cisco-qvpc-sf-0.domain2ab24c1d-53cb-452e-8aa2-f02eecf2db78
> with [last time-stamp] 1434135193054 [time-out] 60000 milliseconds
>
> ...
>
> etc
>
> ...
>
>
>
> 2.      The last message relating to the CPS failure is at 18:54:52. The
> first message relating to recovery of the CPS is some 66 seconds later:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:58,516]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition
> available, hence trying to spawn an instance to fulfil minimum count!
> [cluster] cartridge-proxy.cartridge-proxy.cartridge-proxy.domain
>
>
>
> 3.      However, recovery of other VMs started as much as 54 seconds
> before this point to as much as 26 seconds after:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:04,564]  INFO
> {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to
> spawn an instance via cloud controller: [cluster]
> di-000-008.di-000-008.cisco-qvpc-sf-0.domain [partition] whole-region
> [network-partition-id] RegionOne
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:56:20,603]  INFO
> {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to
> spawn an instance via cloud controller: [cluster]
> di-000-007.di-000-007.cisco-qvpc-sf-0.domain [partition] whole-region
> [network-partition-id] RegionOne
>
>
>
> 4.      Between point 2 and 3 (with fuzz for the races), I see Stratos
> cleaning the old VMs. For example:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:04,580]  INFO
> {org.apache.stratos.cloud.controller.iaases.JcloudsIaas} -  Starting to
> terminate member: [cartridge-type] cisco-qvpc-sf-0 [member-id]
> di-000-008.di-000-008.cisco-qvpc-sf-0.domaina0f15eed-9599-4f3e-a70a-93ddd02ccf5f
>
>
>
> 5.      However, around that point, look at this:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale
> up over max, hence not scaling up cluster itself and
>
>                         notifying to parent for possible group scaling or
> app bursting.
>
>                         [cluster]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1
> [max] 1
>
>
>
>
>
> 6.      Is it possible that there is a race between the terminates and
> restarts? It certainly seems so, though a detailed knowledge of Startos is
> needed to confirm it:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition
> available, hence trying to spawn an instance to fulfil minimum count!
> [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO
> {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to
> spawn an instance via cloud controller: [cluster]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [partition] whole-region
> [network-partition-id] RegionOne
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,799]  INFO
> {org.apache.stratos.common.client.CloudControllerServiceClient} -
> Terminating instance via cloud controller: [member]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,812]  INFO
> {org.apache.stratos.cloud.controller.messaging.publisher.TopologyEventPublisher}
> -  Publishing member created event: [service-name] cisco-qvpc-sf-0
> [cluster-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
> [cluster-instance-id] di-000-005-1 [member-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
> [instance-id] null [network-partition-id] RegionOne [partition-id]
> whole-region [lb-cluster-id] null
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,817]  INFO
> {org.apache.stratos.messaging.message.processor.topology.MemberCreatedMessageProcessor}
> -  Member created: [service-name] cisco-qvpc-sf-0 [cluster-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [member-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
> [cluster-instance-id] di-000-005-1
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,834] ERROR
> {org.apache.stratos.autoscaler.rule.RuleTasksDelegator} -  Request in
> flight threshold is Zero
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale
> up over max, hence not scaling up cluster itself and
>
>                         notifying to parent for possible group scaling or
> app bursting.
>
>                         [cluster]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1
> [max] 1
>
>
>
> (I tried to file a JIRA on this, but I’m having a bit of trouble, hence
> posting here to get the ball rolling).
>
>
>
> Thanks, Shaheed
>
>
>
>
>
>
>
>
>
> --
>
> Akila Ravihansa Perera
> Software Engineer, WSO2
>
> Blog: http://ravihansa3000.blogspot.com
>
>
>
>
>
> --
>
> Imesh Gunaratne
>
>
>
> Senior Technical Lead, WSO2
>
> Committer & PMC Member, Apache Stratos
>



-- 
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007

Re: Problem with Stratos 4.1 autohealing (was RE: Problems with Stratos 4.1 failover and autohealing)

Posted by Imesh Gunaratne <im...@apache.org>.

Thanks for the update Shaheed!

On Tue, Jul 7, 2015 at 4:39 PM, Shaheedur Haque (shahhaqu) <
shahhaqu@cisco.com> wrote:

>  I met with several of the wso2 team, and we agreed the following next
> steps:
>
>
>
> 1.      Treat the “Stratos failover” issue as separate from the
> “duplicate instances” issue, and use a new thread for the former, keeping
> this thread for the latter.
>
> 2.      The lack of synchronisation between the cleanup of the faulty VMs
> and the startup of the replacement VMs was noted.
>
> a.      It was mentioned that the period of overlap should be seconds,
> perhaps upto 60s.
>
> b.      I noted that for many of our scenarios, where the hardware is 1:1
> mapped to VMs (i.e. neither over-subscribed nor under-subscribed), the
> overlap may be a problem. I will consult with others and confirm. Imesh
> ack’d this point.
>
> c.      I will provide the wso2 logs for my test case (summarised below).
>
> d.      I will check with the test team where the problem was originally
> detected if this is a transient  (upto 60s) or long-lived, and gather logs
> as needed.
>
>
>
> *From:* Imesh Gunaratne [mailto:imesh@apache.org]
> *Sent:* 06 July 2015 17:11
> *To:* dev
> *Subject:* Re: Problems with Stratos 4.1 failover and autohealing
>
>
>
> On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <ravihansa@wso2.com
> > wrote:
>
>
>
>  If it detects that cartridge agent has not published any health event
> within the given time window (60mins), it will publish the MemberFault
> event.
>
>
>
>  This should be 60 seconds.
>
>
>
> On Mon, Jul 6, 2015 at 8:13 PM, Akila Ravihansa Perera <ra...@wso2.com>
> wrote:
>
> Hi Shaheed,
>
>
> Regarding Stratos fail-over scenario, by looking at the logs it seems that
> there are multiple member fault cases detected by CEP. This
> FaultHandlingWindowProcessor [1] running inside CEP is implemented to check
> for health stats published by cartridge agents. It does not check whether
> actual VM or processes running inside the VM is active or not. If it
> detects that cartridge agent has not published any health event within the
> given time window (60mins), it will publish the MemberFault event.
>
> Ideally this should not happen regardless of whether the deployment is HA
> or not. Because FaultHandlingWindowProcessor will count the 60mins interval
> from the point Stratos server started. Only reason I could think of which
> may have caused these faulty member logs is Stratos (or external CEP) did
> not receive health stats published by cartridge agents on time.
>
>
>
> Are you running Stratos in single-JVM mode? If so, I'd highly recommend
> that you deploy the CEP profile separately. Please refer to [2] for
> reference deployment architecture of Stratos HA setup.
>
>
>
> Regarding your second concern; Kill multiple Cartridges, I believe your
> concern is that the first message relating to recovery of the CPS came
> 66s after faulty member detection and recovery of other VMs started before
> that. I'd like to summarize few points regarding Stratos recovery process
> which I think will explain that behavior.
>
>
>
>  - Stratos detects faulty members only by checking the health stats
> published by cartridge agent and not by checking the existence of the
> actual VM.
>
>  - Upon detecting such faulty member, CEP will publish a faulty member
> event and after receiving such event AS will move that instance to obsolete
> list. AS will iterate over this obsolete list and try to terminate those
> instances via CC. When the termination is complete, CC will publish member
> terminated event, upon receiving that AS will remove that member from
> obsolete list.
>
>  - If CC fails to terminate the instance, it will keep retrying until a
> obsolete member timeout interval has passed. When the timeout expires, AS
> will forcefully remove that member from obsolete list.
>
>  - Stratos will not try to re-spawn the faulty instance right after
> detecting it. This recovery process is triggered by the Cluster monitor
> object. There is a Drools logic which gets executed periodically (monitor
> interval is configurable, default is 90s) which will execute the min-check
> rule, scale-up rule and scale-down rule and decide whether to spawn new
> instances. So this recovery process could take some time depending on your
> monitor interval.
>
>
>
>
>
> So in retrospect;
>
> - There is no way to control which cluster monitor gets executed first, so
> therefore it is possible that CPS recovery takes place after other VMs.
>
> - The reason for seeing lot of termination logs is because
> Stratos continuously tries to terminate those faulty members until obsolete
> member timeout has expired. This could occur if the actual VM was killed
> manually or by other means.
>
>
>
> However, we need to figure out why you got these faulty member events.
> Could you send us the CEP trace log for further analysis? Also you can
> enable debug logs for FaultHandlingWindowProcessor class to get more
> insights as to what's going on.
>
>
>
> I hope this would explain your concerns. We can have a Hangout session to
> discuss this further.
>
>
>
> [1]
> https://cwiki.apache.org/confluence/display/STRATOS/4.1.0+Configuring+HA+in+Single+JVM+Mode+on+EC2
>
> [2]
> https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java
>
>
>
> Thanks.
>
>
>
> On Mon, Jul 6, 2015 at 5:00 PM, Shaheedur Haque (shahhaqu) <
> shahhaqu@cisco.com> wrote:
>
> Hi,
>
>
>
> I am seeing some issues around Stratos 4.1 around failover on the one hand
> (i.e. stop one running instance of Startos, start another as a
> replacement), and autohealing (i.e. a Cartridge dies and need to be
> replaced by Stratos) which feel as though they might have a common cause,
> so I am lumping them together for now. I’ll explain the two scenarios as I
> can…and the attach some logs explaining one scenario in detail.
>
>
>
> *Failover of Stratos*
>
>
>
> The situation here is that I have 3 servers, all running MySQL in
> active-active mode, and Stratos running on at most one of them under
> Pacemaker control. The test case is that I kill the server (actually a VM)
> running Stratos, and Pacemaker responds by starting Stratos on one of the
> remaining servers. Around the time of the action, there is no ongoing REST
> activity in terms of defining new Cartridges, or Applications or anything
> like that.
>
>
>
> ·        Expected result: the replacement Stratos instance “acquires” the
> running Cartridges, and the system continues without killing/replacing any
> Cartridges.
>
> ·        Actual result: all Cartridges are kill and restarted.
>
>
>
> *Kill multiple Cartridges*
>
>
>
> I have a test where several Cartridges are killed at the same time as
> follows. We have a Cartridge Proxy Server (CPS) that hosts the (Java)
> Cartridge agent for VMs which cannot run one natively. The CPS therefore
> runs N+1 JCAs, one for itself plus N for the Cartridges it is proxying.
> Killing the CPS therefore causes all N+1 JCAs to disappear, and from a
> Startos point of view, all N+1 VMs need to be restarted.
>
>
>
> ·        Expected result: each Cartridges is replaced by Stratos.
>
> ·        Actual result: looking at the Stratos logs, it seems that there
> is a race between the termination logic which cleans up as old cluster
> members dies against the startup logic which is spinning up replacement VMs
> for the terminated ones.
>
>
>
> Both cases previously seemed to work fine, where we **think** previously
> means versions of Stratos 4.1 dating back ~3 weeks or so. I attach a series
> of log extracts for the second scenario; I suspect this also covers the
> first, even if not, at least we can start here.
>
>
>
> 1.      Original spin up of CPS is at 2015-06-11 19:50:20,585. CPS fault
> seen around 2015-06-12 18:54:51, contemporaneously, all the other faults
> are seen:
>
>
>
> ID: [0] [STRATOS] [2015-06-12 18:54:51,788]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> di-000-007.di-000-007.cisco-qvpc-sf-0.domain88ef85b5-5ec6-4d4e-b7b2-1ab9d3f7b160
> with [last time-stamp] 1434135185318 [time-out] 60000 milliseconds
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:54:51,803]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
> with [last time-stamp] 1434135192556 [time-out] 60000 milliseconds
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:54:51,837]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> cartridge-proxy.cartridge-proxy.cartridge-proxy.domain95a58a3e-e7a9-4071-b42b-3551bad25e6a
> with [last time-stamp] 1434135189149 [time-out] 60000 milliseconds
>
> ...
>
> etc
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:54:51,862]  INFO
> {org.apache.stratos.cep.extension.FaultHandlingWindowProcessor} -  Faulty
> member detected [member-id]
> di-000-010.di-000-010.cisco-qvpc-sf-0.domain2ab24c1d-53cb-452e-8aa2-f02eecf2db78
> with [last time-stamp] 1434135193054 [time-out] 60000 milliseconds
>
> ...
>
> etc
>
> ...
>
>
>
> 2.      The last message relating to the CPS failure is at 18:54:52. The
> first message relating to recovery of the CPS is some 66 seconds later:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:58,516]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition
> available, hence trying to spawn an instance to fulfil minimum count!
> [cluster] cartridge-proxy.cartridge-proxy.cartridge-proxy.domain
>
>
>
> 3.      However, recovery of other VMs started as much as 54 seconds
> before this point to as much as 26 seconds after:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:04,564]  INFO
> {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to
> spawn an instance via cloud controller: [cluster]
> di-000-008.di-000-008.cisco-qvpc-sf-0.domain [partition] whole-region
> [network-partition-id] RegionOne
>
> ...
>
> TID: [0] [STRATOS] [2015-06-12 18:56:20,603]  INFO
> {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to
> spawn an instance via cloud controller: [cluster]
> di-000-007.di-000-007.cisco-qvpc-sf-0.domain [partition] whole-region
> [network-partition-id] RegionOne
>
>
>
> 4.      Between point 2 and 3 (with fuzz for the races), I see Stratos
> cleaning the old VMs. For example:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:04,580]  INFO
> {org.apache.stratos.cloud.controller.iaases.JcloudsIaas} -  Starting to
> terminate member: [cartridge-type] cisco-qvpc-sf-0 [member-id]
> di-000-008.di-000-008.cisco-qvpc-sf-0.domaina0f15eed-9599-4f3e-a70a-93ddd02ccf5f
>
>
>
> 5.      However, around that point, look at this:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale
> up over max, hence not scaling up cluster itself and
>
>                         notifying to parent for possible group scaling or
> app bursting.
>
>                         [cluster]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1
> [max] 1
>
>
>
>
>
> 6.      Is it possible that there is a race between the terminates and
> restarts? It certainly seems so, though a detailed knowledge of Startos is
> needed to confirm it:
>
>
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [min-check] Partition
> available, hence trying to spawn an instance to fulfil minimum count!
> [cluster] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,798]  INFO
> {org.apache.stratos.autoscaler.client.CloudControllerClient} -  Trying to
> spawn an instance via cloud controller: [cluster]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [partition] whole-region
> [network-partition-id] RegionOne
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,799]  INFO
> {org.apache.stratos.common.client.CloudControllerServiceClient} -
> Terminating instance via cloud controller: [member]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain4bcdd49a-985f-4247-a70b-31c8d65153d8
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,812]  INFO
> {org.apache.stratos.cloud.controller.messaging.publisher.TopologyEventPublisher}
> -  Publishing member created event: [service-name] cisco-qvpc-sf-0
> [cluster-id] di-000-005.di-000-005.cisco-qvpc-sf-0.domain
> [cluster-instance-id] di-000-005-1 [member-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
> [instance-id] null [network-partition-id] RegionOne [partition-id]
> whole-region [lb-cluster-id] null
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,817]  INFO
> {org.apache.stratos.messaging.message.processor.topology.MemberCreatedMessageProcessor}
> -  Member created: [service-name] cisco-qvpc-sf-0 [cluster-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [member-id]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domaine9e4111c-abba-4979-af07-d13d90bbd84d
> [cluster-instance-id] di-000-005-1
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,834] ERROR
> {org.apache.stratos.autoscaler.rule.RuleTasksDelegator} -  Request in
> flight threshold is Zero
>
> TID: [0] [STRATOS] [2015-06-12 18:55:12,835]  INFO
> {org.apache.stratos.autoscaler.rule.RuleLog} -  [scale-up] Trying to scale
> up over max, hence not scaling up cluster itself and
>
>                         notifying to parent for possible group scaling or
> app bursting.
>
>                         [cluster]
> di-000-005.di-000-005.cisco-qvpc-sf-0.domain [instance id]di-000-005-1
> [max] 1
>
>
>
> (I tried to file a JIRA on this, but I’m having a bit of trouble, hence
> posting here to get the ball rolling).
>
>
>
> Thanks, Shaheed
>
>
>
>
>
>
>
>
>
> --
>
> Akila Ravihansa Perera
> Software Engineer, WSO2
>
> Blog: http://ravihansa3000.blogspot.com
>
>
>
>
>
> --
>
> Imesh Gunaratne
>
>
>
> Senior Technical Lead, WSO2
>
> Committer & PMC Member, Apache Stratos
>



-- 
Imesh Gunaratne

Senior Technical Lead, WSO2
Committer & PMC Member, Apache Stratos