You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stratos.apache.org by "Michiel Blokzijl (mblokzij)" <mb...@cisco.com> on 2014/06/23 18:23:11 UTC

stop stratos, kill cartridge instance, start stratos -> cartridge still active?

Hi all,

Basically, I was stopping and starting Stratos and looking at how it handled dying cartridges, and found that Stratos only detected cartridge deaths while it was running..

The problem
In steady state, I have some cartridges managed by Stratos, 

./stratos.sh list-subscribed-cartridges | grep samp
| cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant | cisco-sample-vm | Active | 1                 | cisco-sample-vm.foo.cisco.com |

nova list | grep samp
| 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE | None       | Running     | core=172.16.2.17, 10.86.205.231  |

All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample cartridge, and then start ActiveMQ and Stratos again.

Now, at first things look good..:

./stratos.sh list-subscribed-cartridges | grep samp
| cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant | cisco-sample-vm | Inactive | 0                 | cisco-sample-vm.foo.cisco.com |

But then,

root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh list-subscribed-cartridges | grep samp
| cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant | cisco-sample-vm | Active | 1                 | cisco-sample-vm.foo.cisco.com |

# nova list | grep samp
# 

How did the cartridge become active without it actually being there? As far as I can tell, Stratos never recovers from this.

I found this bug here: https://issues.apache.org/jira/browse/STRATOS-234 - is this describing the issue I’m seeing? I was a little bit confused by the usage of the word “obsolete”.

Where to go next?
Now, I’ve done a little bit of digging, but I don’t yet have a full mental model of how everything fits together in Stratos - please could someone help me put the pieces together? :)

What I’m seeing is the following:
- The cluster monitor appears to be active:

TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG {org.apache.stratos.autoscaler.monitor.ClusterMonitor} -  Cluster monitor is running.. Cluste
rMonitor [clusterId=cisco-sample-vm.cisco-sample-v, serviceId=cisco-sample-vm, deploymentPolicy=Deployment Policy [id]static-1 [partitions] [org
.apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0], autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null, de
scription=null], lbReferenceType=null] {org.apache.stratos.autoscaler.monitor.ClusterMonitor}

- It looks like the CEP FaultHandlingWindowProcessor usually detects inactive members. However, since this member was never active, the timeStampMap doesn’t contain an element for this member, so it’s never checked.
- I think the fault handling is triggered by a fault_message, but I didn’t manage to figure out where it’s coming from. Does anyone know what triggers it? (is it the CEP extension?)

Anyway.. 

Questions
- How should Stratos detect after some downtime which cartridges are still there and which ones aren’t? (what was the intended design?)
- Why did the missing cartridge go “active”? Is this a result from restoring persistent state? (If I look in the registry I can see stuff under subscriptions/active, but not sure if that’s where it comes from)
- Who should be responsible for detecting the absence of an instance - the ClusterMonitor? That seems to be fed incorrect data, since it clearly thinks there are enough instances running. Which component has the necessary data?
- It looks like it’s possible to snapshot CEP state to make it semi-persistent. However, if I restarted Stratos after 2min downtime, wouldn’t it try to kill all the nodes since the last reply was more than 60s ago? Also, snapshots would be periodic, so there’s still a window in which cartridges might “disappear".

Thanks a lot and best regards!

Michiel

Re: stop stratos, kill cartridge instance, start stratos -> cartridge still active?

Posted by Isuru Haththotuwa <is...@apache.org>.
Hi Chris,


On Sat, Jun 28, 2014 at 11:54 AM, chris snow <ch...@gmail.com> wrote:

> Hi Reka, will this fix also need to get applied to 4.0.0?
>
Yes, AFAIU this should be applied as a patch to 4.0.0. But this problem
will occur if data publishing is done to BAM.

> Hi all,
>>
>> On Wed, Jun 25, 2014 at 11:44 PM, Nirmal Fernando <nirmal070125@gmail.com
>> > wrote:
>>
>>>
>>> On Wed, Jun 25, 2014 at 10:51 PM, Imesh Gunaratne <im...@apache.org>
>>> wrote:
>>>
>>>> Hi Michiel,
>>>>
>>>> As Reka has pointed out there is a potential issue
>>>> in CloudControllerServiceImpl class. It seems like cloud controller is
>>>> retrieving its state from registry
>>>> in CloudControllerServiceImpl constructor and it's being invoked in two
>>>> other places than it's expected to:
>>>>
>>>>
>>>> ​
>>>>
>>>>
>>>>
>>>>
>>> This was a bug, we identified recently and someone has made this commit
>>> without properly analyzing the way CC has implemented. :-(
>>>
>>> AFAIK Reka has already filed a jira and on her way to remove that broken
>>> logic.
>>>
>> I have fixed this issue in master and updated the jira (STRATOS-685).  I
>> have removed CloudControllerServiceImpl initialization which used in cloud
>> controller when publishing events to BAM and in the instance termination on
>> behalf of the MemberReadyToShutdownEvent.
>>
>> The fix that i did was to get the relevant cartridge information from
>> FasterLookupDataHolder when publishing events to BAM instead of getting it
>> from buggy way as earlier. Handled the instance termination via Autoscaler
>> on behalf of MemberReadyToShutdownEvent instead of CC itself terminates the
>> member. I think that this would be good  way as autoscaler is the one who
>> requests to start or terminate the member in all scenarios.
>>
>> Thanks,
>> Reka
>>
>> However the above logic does not retrieve the topology from registry. It
>>>> is being retrieved by Topology Manager:
>>>>
>>>>
>>>> ​
>>>> Therefore the above issue may have very little affect on the problem
>>>> you have noticed. However I wonder whether we have an issue in Autoscaler
>>>> in refreshing its state once restarted.
>>>>
>>>>  Just to narrow down the cause of this issue, will you be able to list
>>>> down the actions that you carried out from the very beginning please? Then
>>>> we could try to re-produce this problem by going through them.
>>>>
>>>>
>>>> Many Thanks
>>>> Imesh
>>>>
>>>>
>>>> On Mon, Jun 23, 2014 at 9:53 PM, Michiel Blokzijl (mblokzij) <
>>>> mblokzij@cisco.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Basically, I was stopping and starting Stratos and looking at how it
>>>>> handled dying cartridges, and found that Stratos only detected cartridge
>>>>> deaths while it was running..
>>>>>
>>>>> *The problem*
>>>>> In steady state, I have some cartridges managed by Stratos,
>>>>>
>>>>> ./stratos.sh list-subscribed-cartridges | grep samp
>>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>>>> cisco-sample-vm | Active | 1                 | cisco-samp
>>>>> le-vm.foo.cisco.com |
>>>>>
>>>>> nova list | grep samp
>>>>> | 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE |
>>>>> None       | Running     | core=172.16.2.17, 10.86.205.231  |
>>>>>
>>>>> All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample
>>>>> cartridge, and then start ActiveMQ and Stratos again.
>>>>>
>>>>> Now, at first things look good..:
>>>>>
>>>>> ./stratos.sh list-subscribed-cartridges | grep samp
>>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>>>> cisco-sample-vm | Inactive | 0                 |
>>>>> cisco-sample-vm.foo.cisco.com |
>>>>>
>>>>> But then,
>>>>>
>>>>> root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh
>>>>> list-subscribed-cartridges | grep samp
>>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>>>> cisco-sample-vm | Active | 1                 | cisco-samp
>>>>> le-vm.foo.cisco.com |
>>>>>
>>>>> # nova list | grep samp
>>>>> #
>>>>>
>>>>> How did the cartridge become active without it actually being there?
>>>>> As far as I can tell, Stratos never recovers from this.
>>>>>
>>>>> I found this bug here:
>>>>> https://issues.apache.org/jira/browse/STRATOS-234 - is this
>>>>> describing the issue I’m seeing? I was a little bit confused by the usage
>>>>> of the word “obsolete”.
>>>>>
>>>>> *Where to go next?*
>>>>> Now, I’ve done a little bit of digging, but I don’t yet have a full
>>>>> mental model of how everything fits together in Stratos - please could
>>>>> someone help me put the pieces together? :)
>>>>>
>>>>> What I’m seeing is the following:
>>>>> - The cluster monitor appears to be active:
>>>>>
>>>>> TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG
>>>>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor} -  Cluster monitor
>>>>> is running.. Cluste
>>>>> rMonitor [clusterId=cisco-sample-vm.cisco-sample-v,
>>>>> serviceId=cisco-sample-vm, deploymentPolicy=Deployment Policy [id]static-1
>>>>> [partitions] [org
>>>>>
>>>>> .apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0],
>>>>> autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null, de
>>>>> scription=null], lbReferenceType=null]
>>>>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor}
>>>>>
>>>>> - It looks like the CEP FaultHandlingWindowProcessor usually detects
>>>>> inactive members. However, since this member was never active, the
>>>>> timeStampMap doesn’t contain an element for this member, so it’s
>>>>> never checked.
>>>>> - I think the fault handling is triggered by a fault_message, but I
>>>>> didn’t manage to figure out where it’s coming from. Does anyone know what
>>>>> triggers it? (is it the CEP extension?)
>>>>>
>>>>> Anyway..
>>>>>
>>>>> *Questions*
>>>>> - How should Stratos detect after some downtime which cartridges are
>>>>> still there and which ones aren’t? (what was the intended design?)
>>>>> - Why did the missing cartridge go “active”? Is this a result from
>>>>> restoring persistent state? (If I look in the registry I can see stuff
>>>>> under subscriptions/active, but not sure if that’s where it comes from)
>>>>> - Who should be responsible for detecting the absence of an instance -
>>>>> the ClusterMonitor? That seems to be fed incorrect data, since it clearly
>>>>> thinks there are enough instances running. Which component has the
>>>>> necessary data?
>>>>> - It looks like it’s possible to snapshot CEP state
>>>>> <http://stackoverflow.com/questions/20348326/wso2-cep-siddhi-how-to-make-time-windows-persistent> to
>>>>> make it semi-persistent. However, if I restarted Stratos after 2min
>>>>> downtime, wouldn’t it try to kill all the nodes since the last reply was
>>>>> more than 60s ago? Also, snapshots would be periodic, so there’s still a
>>>>> window in which cartridges might “disappear".
>>>>>
>>>>> Thanks a lot and best regards!
>>>>>
>>>>> Michiel
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Imesh Gunaratne
>>>>
>>>> Technical Lead, WSO2
>>>> Committer & PPMC Member, Apache Stratos
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Nirmal
>>>
>>> Nirmal Fernando.
>>> PPMC Member & Committer of Apache Stratos,
>>> Senior Software Engineer, WSO2 Inc.
>>>
>>> Blog: http://nirmalfdo.blogspot.com/
>>>
>>
>>
>>
>> --
>> Reka Thirunavukkarasu
>> Senior Software Engineer,
>> WSO2, Inc.:http://wso2.com,
>> Mobile: +94776442007
>>
>>
>>

Re: stop stratos, kill cartridge instance, start stratos -> cartridge still active?

Posted by "Michiel Blokzijl (mblokzij)" <mb...@cisco.com>.
Hi all,

Apologies for the radio silence since my initial email, I’ve been very busy.. :(

Thank you Reka for your detailed explanations, I now have a much better understanding of how it’s supposed to work!

I’m not actually using the BAM (yet)*, so STRATOS-685 shouldn’t affect me, right? Even if it doesn’t affect me I think the fix would still be nice to have in the 4.0.0 branch.

> Just to narrow down the cause of this issue, will you be able to list down the actions that you carried out from the very beginning please? Then we could try to re-produce this problem by going through them.

I’ve attached an annotated log of the steps I’ve taken to reproduce the issue.

I think there’s still an issue in this area, since I’m hitting this issue without using the BAM. I could try Reka’s suggestion of enabling the CEP persistence, but I suspect given that restarting Stratos takes more than 1min, the fault handler will think that ALL cartridges are inactive and kill them all. Does anyone know if this is the right documentation for setting up CEP snapshotting?

*: The <BamServerURL> is commented out in <stratos>/repository/conf/carbon.xml.

Best regards,

Michiel



On 29 Jun 2014, at 18:02, Reka Thirunavukkarasu <re...@wso2.com> wrote:

> Hi
> 
> On Sun, Jun 29, 2014 at 9:28 PM, Lakmal Warusawithana <la...@wso2.com> wrote:
> Hi Reka,
> 
> We can double commit these into 4.0.0 branch and master, and will do 4.0.1 minor release with these fixers. I also like suggest some UX improvements for 4.0.1 release. I had some offline discussion with several folks, will send some suggestions on UX improvement with the user stories in separate thread.
>  
> +1 for the 4.0.1 release with all the minor fixes and UI improvements. Then will commit the fixes done to the 4.0.0 as well.
> 
> Thanks,
> Reka
> 
> thanks 
> 
> 
> On Sun, Jun 29, 2014 at 9:05 PM, Reka Thirunavukkarasu <re...@wso2.com> wrote:
> Hi Cris,
> 
> 
> On Sat, Jun 28, 2014 at 11:54 AM, chris snow <ch...@gmail.com> wrote:
> Hi Reka, will this fix also need to get applied to 4.0.0?
> 
> Yah. As Isuru mentioned, we can apply it as a patch to 4.0.0. The issue will be there only when you publish events to BAM from cloud controller and when you unsubscribe from an instance. I will create a patch from 4.0.0 branch with the fix and update the jira with the patch..
> 
> Thanks,
> Reka
> 
> Thanks,
> Reka
>  
> On 26 Jun 2014 06:43, "Reka Thirunavukkarasu" <re...@wso2.com> wrote:
> Hi all,
> 
> On Wed, Jun 25, 2014 at 11:44 PM, Nirmal Fernando <ni...@gmail.com> wrote:
> 
> On Wed, Jun 25, 2014 at 10:51 PM, Imesh Gunaratne <im...@apache.org> wrote:
> Hi Michiel,
> 
> As Reka has pointed out there is a potential issue in CloudControllerServiceImpl class. It seems like cloud controller is retrieving its state from registry in CloudControllerServiceImpl constructor and it's being invoked in two other places than it's expected to:
> 
> <Screen Shot 2014-06-25 at 10.36.07 PM.png>
> ​
> 
> <Screen Shot 2014-06-25 at 10.14.01 PM.png>
> 
> 
> This was a bug, we identified recently and someone has made this commit without properly analyzing the way CC has implemented. :-(
> 
> AFAIK Reka has already filed a jira and on her way to remove that broken logic.
> I have fixed this issue in master and updated the jira (STRATOS-685).  I have removed CloudControllerServiceImpl initialization which used in cloud controller when publishing events to BAM and in the instance termination on behalf of the MemberReadyToShutdownEvent.
>  
> The fix that i did was to get the relevant cartridge information from FasterLookupDataHolder when publishing events to BAM instead of getting it from buggy way as earlier. Handled the instance termination via Autoscaler on behalf of MemberReadyToShutdownEvent instead of CC itself terminates the member. I think that this would be good  way as autoscaler is the one who requests to start or terminate the member in all scenarios.
> 
> Thanks,
> Reka
> 
> However the above logic does not retrieve the topology from registry. It is being retrieved by Topology Manager:
> 
> <Screen Shot 2014-06-25 at 10.45.36 PM.png>
> ​
> Therefore the above issue may have very little affect on the problem you have noticed. However I wonder whether we have an issue in Autoscaler in refreshing its state once restarted.
> 
> Just to narrow down the cause of this issue, will you be able to list down the actions that you carried out from the very beginning please? Then we could try to re-produce this problem by going through them.
> 
> 
> Many Thanks
> Imesh
> 
> 
> On Mon, Jun 23, 2014 at 9:53 PM, Michiel Blokzijl (mblokzij) <mb...@cisco.com> wrote:
> Hi all,
> 
> Basically, I was stopping and starting Stratos and looking at how it handled dying cartridges, and found that Stratos only detected cartridge deaths while it was running..
> 
> The problem
> In steady state, I have some cartridges managed by Stratos, 
> 
> ./stratos.sh list-subscribed-cartridges | grep samp
> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant | cisco-sample-vm | Active | 1                 | cisco-sample-vm.foo.cisco.com |
> 
> nova list | grep samp
> | 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE | None       | Running     | core=172.16.2.17, 10.86.205.231  |
> 
> All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample cartridge, and then start ActiveMQ and Stratos again.
> 
> Now, at first things look good..:
> 
> ./stratos.sh list-subscribed-cartridges | grep samp
> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant | cisco-sample-vm | Inactive | 0                 | cisco-sample-vm.foo.cisco.com |
> 
> But then,
> 
> root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh list-subscribed-cartridges | grep samp
> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant | cisco-sample-vm | Active | 1                 | cisco-sample-vm.foo.cisco.com |
> 
> # nova list | grep samp
> # 
> 
> How did the cartridge become active without it actually being there? As far as I can tell, Stratos never recovers from this.
> 
> I found this bug here: https://issues.apache.org/jira/browse/STRATOS-234 - is this describing the issue I’m seeing? I was a little bit confused by the usage of the word “obsolete”.
> 
> Where to go next?
> Now, I’ve done a little bit of digging, but I don’t yet have a full mental model of how everything fits together in Stratos - please could someone help me put the pieces together? :)
> 
> What I’m seeing is the following:
> - The cluster monitor appears to be active:
> 
> TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG {org.apache.stratos.autoscaler.monitor.ClusterMonitor} -  Cluster monitor is running.. Cluste
> rMonitor [clusterId=cisco-sample-vm.cisco-sample-v, serviceId=cisco-sample-vm, deploymentPolicy=Deployment Policy [id]static-1 [partitions] [org
> .apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0], autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null, de
> scription=null], lbReferenceType=null] {org.apache.stratos.autoscaler.monitor.ClusterMonitor}
> 
> - It looks like the CEP FaultHandlingWindowProcessor usually detects inactive members. However, since this member was never active, the timeStampMap doesn’t contain an element for this member, so it’s never checked.
> - I think the fault handling is triggered by a fault_message, but I didn’t manage to figure out where it’s coming from. Does anyone know what triggers it? (is it the CEP extension?)
> 
> Anyway.. 
> 
> Questions
> - How should Stratos detect after some downtime which cartridges are still there and which ones aren’t? (what was the intended design?)
> - Why did the missing cartridge go “active”? Is this a result from restoring persistent state? (If I look in the registry I can see stuff under subscriptions/active, but not sure if that’s where it comes from)
> - Who should be responsible for detecting the absence of an instance - the ClusterMonitor? That seems to be fed incorrect data, since it clearly thinks there are enough instances running. Which component has the necessary data?
> - It looks like it’s possible to snapshot CEP state to make it semi-persistent. However, if I restarted Stratos after 2min downtime, wouldn’t it try to kill all the nodes since the last reply was more than 60s ago? Also, snapshots would be periodic, so there’s still a window in which cartridges might “disappear".
> 
> Thanks a lot and best regards!
> 
> Michiel
> 
> 
> 
> -- 
> Imesh Gunaratne
> 
> Technical Lead, WSO2
> Committer & PPMC Member, Apache Stratos
> 
> 
> 
> -- 
> Best Regards,
> Nirmal
> 
> Nirmal Fernando.
> PPMC Member & Committer of Apache Stratos,
> Senior Software Engineer, WSO2 Inc.
> 
> Blog: http://nirmalfdo.blogspot.com/
> 
> 
> 
> -- 
> Reka Thirunavukkarasu
> Senior Software Engineer,
> WSO2, Inc.:http://wso2.com,
> Mobile: +94776442007
> 
> 
> 
> 
> 
> -- 
> Reka Thirunavukkarasu
> Senior Software Engineer,
> WSO2, Inc.:http://wso2.com,
> Mobile: +94776442007
> 
> 
> 
> 
> 
> -- 
> Lakmal Warusawithana
> Vice President, Apache Stratos
> Director - Cloud Architecture; WSO2 Inc.
> Mobile : +94714289692
> Blog : http://lakmalsview.blogspot.com/
> 
> 
> 
> 
> -- 
> Reka Thirunavukkarasu
> Senior Software Engineer,
> WSO2, Inc.:http://wso2.com,
> Mobile: +94776442007
> 
> 


Re: stop stratos, kill cartridge instance, start stratos -> cartridge still active?

Posted by Reka Thirunavukkarasu <re...@wso2.com>.
Hi

On Sun, Jun 29, 2014 at 9:28 PM, Lakmal Warusawithana <la...@wso2.com>
wrote:

> Hi Reka,
>
> We can double commit these into 4.0.0 branch and master, and will do 4.0.1
> minor release with these fixers. I also like suggest some UX improvements
> for 4.0.1 release. I had some offline discussion with several folks, will
> send some suggestions on UX improvement with the user stories in separate
> thread.
>

+1 for the 4.0.1 release with all the minor fixes and UI improvements. Then
will commit the fixes done to the 4.0.0 as well.

Thanks,
Reka

>
> thanks
>
>
> On Sun, Jun 29, 2014 at 9:05 PM, Reka Thirunavukkarasu <re...@wso2.com>
> wrote:
>
>> Hi Cris,
>>
>>
>> On Sat, Jun 28, 2014 at 11:54 AM, chris snow <ch...@gmail.com> wrote:
>>
>>> Hi Reka, will this fix also need to get applied to 4.0.0?
>>>
>>  Yah. As Isuru mentioned, we can apply it as a patch to 4.0.0. The issue
>> will be there only when you publish events to BAM from cloud controller and
>> when you unsubscribe from an instance. I will create a patch from 4.0.0
>> branch with the fix and update the jira with the patch..
>>
>> Thanks,
>> Reka
>>
>> Thanks,
>> Reka
>>
>>
>>>  On 26 Jun 2014 06:43, "Reka Thirunavukkarasu" <re...@wso2.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> On Wed, Jun 25, 2014 at 11:44 PM, Nirmal Fernando <
>>>> nirmal070125@gmail.com> wrote:
>>>>
>>>>>
>>>>> On Wed, Jun 25, 2014 at 10:51 PM, Imesh Gunaratne <im...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hi Michiel,
>>>>>>
>>>>>> As Reka has pointed out there is a potential issue
>>>>>> in CloudControllerServiceImpl class. It seems like cloud controller is
>>>>>> retrieving its state from registry
>>>>>> in CloudControllerServiceImpl constructor and it's being invoked in two
>>>>>> other places than it's expected to:
>>>>>>
>>>>>>
>>>>>> ​
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>> This was a bug, we identified recently and someone has made this
>>>>> commit without properly analyzing the way CC has implemented. :-(
>>>>>
>>>>> AFAIK Reka has already filed a jira and on her way to remove that
>>>>> broken logic.
>>>>>
>>>> I have fixed this issue in master and updated the jira (STRATOS-685).
>>>> I have removed CloudControllerServiceImpl initialization which used in
>>>> cloud controller when publishing events to BAM and in the instance
>>>> termination on behalf of the MemberReadyToShutdownEvent.
>>>>
>>>> The fix that i did was to get the relevant cartridge information from
>>>> FasterLookupDataHolder when publishing events to BAM instead of getting it
>>>> from buggy way as earlier. Handled the instance termination via Autoscaler
>>>> on behalf of MemberReadyToShutdownEvent instead of CC itself terminates the
>>>> member. I think that this would be good  way as autoscaler is the one who
>>>> requests to start or terminate the member in all scenarios.
>>>>
>>>> Thanks,
>>>> Reka
>>>>
>>>> However the above logic does not retrieve the topology from registry.
>>>>>> It is being retrieved by Topology Manager:
>>>>>>
>>>>>>
>>>>>> ​
>>>>>> Therefore the above issue may have very little affect on the problem
>>>>>> you have noticed. However I wonder whether we have an issue in Autoscaler
>>>>>> in refreshing its state once restarted.
>>>>>>
>>>>>>  Just to narrow down the cause of this issue, will you be able to
>>>>>> list down the actions that you carried out from the very beginning please?
>>>>>> Then we could try to re-produce this problem by going through them.
>>>>>>
>>>>>>
>>>>>> Many Thanks
>>>>>> Imesh
>>>>>>
>>>>>>
>>>>>> On Mon, Jun 23, 2014 at 9:53 PM, Michiel Blokzijl (mblokzij) <
>>>>>> mblokzij@cisco.com> wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>>
>>>>>>> Basically, I was stopping and starting Stratos and looking at how it
>>>>>>> handled dying cartridges, and found that Stratos only detected cartridge
>>>>>>> deaths while it was running..
>>>>>>>
>>>>>>> *The problem*
>>>>>>> In steady state, I have some cartridges managed by Stratos,
>>>>>>>
>>>>>>> ./stratos.sh list-subscribed-cartridges | grep samp
>>>>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>>>>>> cisco-sample-vm | Active | 1                 | cisco-samp
>>>>>>> le-vm.foo.cisco.com |
>>>>>>>
>>>>>>> nova list | grep samp
>>>>>>> | 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE |
>>>>>>> None       | Running     | core=172.16.2.17, 10.86.205.231  |
>>>>>>>
>>>>>>> All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample
>>>>>>> cartridge, and then start ActiveMQ and Stratos again.
>>>>>>>
>>>>>>> Now, at first things look good..:
>>>>>>>
>>>>>>> ./stratos.sh list-subscribed-cartridges | grep samp
>>>>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>>>>>> cisco-sample-vm | Inactive | 0                 |
>>>>>>> cisco-sample-vm.foo.cisco.com |
>>>>>>>
>>>>>>> But then,
>>>>>>>
>>>>>>> root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh
>>>>>>> list-subscribed-cartridges | grep samp
>>>>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>>>>>> cisco-sample-vm | Active | 1                 | cisco-samp
>>>>>>> le-vm.foo.cisco.com |
>>>>>>>
>>>>>>> # nova list | grep samp
>>>>>>> #
>>>>>>>
>>>>>>> How did the cartridge become active without it actually being there?
>>>>>>> As far as I can tell, Stratos never recovers from this.
>>>>>>>
>>>>>>> I found this bug here:
>>>>>>> https://issues.apache.org/jira/browse/STRATOS-234 - is this
>>>>>>> describing the issue I’m seeing? I was a little bit confused by the usage
>>>>>>> of the word “obsolete”.
>>>>>>>
>>>>>>> *Where to go next?*
>>>>>>> Now, I’ve done a little bit of digging, but I don’t yet have a full
>>>>>>> mental model of how everything fits together in Stratos - please could
>>>>>>> someone help me put the pieces together? :)
>>>>>>>
>>>>>>> What I’m seeing is the following:
>>>>>>> - The cluster monitor appears to be active:
>>>>>>>
>>>>>>> TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG
>>>>>>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor} -  Cluster monitor
>>>>>>> is running.. Cluste
>>>>>>> rMonitor [clusterId=cisco-sample-vm.cisco-sample-v,
>>>>>>> serviceId=cisco-sample-vm, deploymentPolicy=Deployment Policy [id]static-1
>>>>>>> [partitions] [org
>>>>>>>
>>>>>>> .apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0],
>>>>>>> autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null, de
>>>>>>> scription=null], lbReferenceType=null]
>>>>>>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor}
>>>>>>>
>>>>>>> - It looks like the CEP FaultHandlingWindowProcessor usually detects
>>>>>>> inactive members. However, since this member was never active, the
>>>>>>> timeStampMap doesn’t contain an element for this member, so it’s
>>>>>>> never checked.
>>>>>>> - I think the fault handling is triggered by a fault_message, but I
>>>>>>> didn’t manage to figure out where it’s coming from. Does anyone know what
>>>>>>> triggers it? (is it the CEP extension?)
>>>>>>>
>>>>>>> Anyway..
>>>>>>>
>>>>>>> *Questions*
>>>>>>> - How should Stratos detect after some downtime which cartridges are
>>>>>>> still there and which ones aren’t? (what was the intended design?)
>>>>>>> - Why did the missing cartridge go “active”? Is this a result from
>>>>>>> restoring persistent state? (If I look in the registry I can see stuff
>>>>>>> under subscriptions/active, but not sure if that’s where it comes from)
>>>>>>> - Who should be responsible for detecting the absence of an instance
>>>>>>> - the ClusterMonitor? That seems to be fed incorrect data, since it clearly
>>>>>>> thinks there are enough instances running. Which component has the
>>>>>>> necessary data?
>>>>>>> - It looks like it’s possible to snapshot CEP state
>>>>>>> <http://stackoverflow.com/questions/20348326/wso2-cep-siddhi-how-to-make-time-windows-persistent> to
>>>>>>> make it semi-persistent. However, if I restarted Stratos after 2min
>>>>>>> downtime, wouldn’t it try to kill all the nodes since the last reply was
>>>>>>> more than 60s ago? Also, snapshots would be periodic, so there’s still a
>>>>>>> window in which cartridges might “disappear".
>>>>>>>
>>>>>>> Thanks a lot and best regards!
>>>>>>>
>>>>>>> Michiel
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Imesh Gunaratne
>>>>>>
>>>>>> Technical Lead, WSO2
>>>>>> Committer & PPMC Member, Apache Stratos
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Best Regards,
>>>>> Nirmal
>>>>>
>>>>> Nirmal Fernando.
>>>>> PPMC Member & Committer of Apache Stratos,
>>>>> Senior Software Engineer, WSO2 Inc.
>>>>>
>>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Reka Thirunavukkarasu
>>>> Senior Software Engineer,
>>>> WSO2, Inc.:http://wso2.com,
>>>> Mobile: +94776442007
>>>>
>>>>
>>>>
>>
>>
>> --
>> Reka Thirunavukkarasu
>> Senior Software Engineer,
>> WSO2, Inc.:http://wso2.com,
>> Mobile: +94776442007
>>
>>
>>
>
>
> --
> Lakmal Warusawithana
> Vice President, Apache Stratos
> Director - Cloud Architecture; WSO2 Inc.
> Mobile : +94714289692
> Blog : http://lakmalsview.blogspot.com/
>
>


-- 
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007

Re: stop stratos, kill cartridge instance, start stratos -> cartridge still active?

Posted by Lakmal Warusawithana <la...@wso2.com>.
Hi Reka,

We can double commit these into 4.0.0 branch and master, and will do 4.0.1
minor release with these fixers. I also like suggest some UX improvements
for 4.0.1 release. I had some offline discussion with several folks, will
send some suggestions on UX improvement with the user stories in separate
thread.

thanks


On Sun, Jun 29, 2014 at 9:05 PM, Reka Thirunavukkarasu <re...@wso2.com>
wrote:

> Hi Cris,
>
>
> On Sat, Jun 28, 2014 at 11:54 AM, chris snow <ch...@gmail.com> wrote:
>
>> Hi Reka, will this fix also need to get applied to 4.0.0?
>>
> Yah. As Isuru mentioned, we can apply it as a patch to 4.0.0. The issue
> will be there only when you publish events to BAM from cloud controller and
> when you unsubscribe from an instance. I will create a patch from 4.0.0
> branch with the fix and update the jira with the patch..
>
> Thanks,
> Reka
>
> Thanks,
> Reka
>
>
>>  On 26 Jun 2014 06:43, "Reka Thirunavukkarasu" <re...@wso2.com> wrote:
>>
>>> Hi all,
>>>
>>> On Wed, Jun 25, 2014 at 11:44 PM, Nirmal Fernando <
>>> nirmal070125@gmail.com> wrote:
>>>
>>>>
>>>> On Wed, Jun 25, 2014 at 10:51 PM, Imesh Gunaratne <im...@apache.org>
>>>> wrote:
>>>>
>>>>> Hi Michiel,
>>>>>
>>>>> As Reka has pointed out there is a potential issue
>>>>> in CloudControllerServiceImpl class. It seems like cloud controller is
>>>>> retrieving its state from registry
>>>>> in CloudControllerServiceImpl constructor and it's being invoked in two
>>>>> other places than it's expected to:
>>>>>
>>>>>
>>>>> ​
>>>>>
>>>>>
>>>>>
>>>>>
>>>> This was a bug, we identified recently and someone has made this commit
>>>> without properly analyzing the way CC has implemented. :-(
>>>>
>>>> AFAIK Reka has already filed a jira and on her way to remove that
>>>> broken logic.
>>>>
>>> I have fixed this issue in master and updated the jira (STRATOS-685).  I
>>> have removed CloudControllerServiceImpl initialization which used in cloud
>>> controller when publishing events to BAM and in the instance termination on
>>> behalf of the MemberReadyToShutdownEvent.
>>>
>>> The fix that i did was to get the relevant cartridge information from
>>> FasterLookupDataHolder when publishing events to BAM instead of getting it
>>> from buggy way as earlier. Handled the instance termination via Autoscaler
>>> on behalf of MemberReadyToShutdownEvent instead of CC itself terminates the
>>> member. I think that this would be good  way as autoscaler is the one who
>>> requests to start or terminate the member in all scenarios.
>>>
>>> Thanks,
>>> Reka
>>>
>>> However the above logic does not retrieve the topology from registry. It
>>>>> is being retrieved by Topology Manager:
>>>>>
>>>>>
>>>>> ​
>>>>> Therefore the above issue may have very little affect on the problem
>>>>> you have noticed. However I wonder whether we have an issue in Autoscaler
>>>>> in refreshing its state once restarted.
>>>>>
>>>>>  Just to narrow down the cause of this issue, will you be able to list
>>>>> down the actions that you carried out from the very beginning please? Then
>>>>> we could try to re-produce this problem by going through them.
>>>>>
>>>>>
>>>>> Many Thanks
>>>>> Imesh
>>>>>
>>>>>
>>>>> On Mon, Jun 23, 2014 at 9:53 PM, Michiel Blokzijl (mblokzij) <
>>>>> mblokzij@cisco.com> wrote:
>>>>>
>>>>>> Hi all,
>>>>>>
>>>>>> Basically, I was stopping and starting Stratos and looking at how it
>>>>>> handled dying cartridges, and found that Stratos only detected cartridge
>>>>>> deaths while it was running..
>>>>>>
>>>>>> *The problem*
>>>>>> In steady state, I have some cartridges managed by Stratos,
>>>>>>
>>>>>> ./stratos.sh list-subscribed-cartridges | grep samp
>>>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>>>>> cisco-sample-vm | Active | 1                 | cisco-samp
>>>>>> le-vm.foo.cisco.com |
>>>>>>
>>>>>> nova list | grep samp
>>>>>> | 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE |
>>>>>> None       | Running     | core=172.16.2.17, 10.86.205.231  |
>>>>>>
>>>>>> All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample
>>>>>> cartridge, and then start ActiveMQ and Stratos again.
>>>>>>
>>>>>> Now, at first things look good..:
>>>>>>
>>>>>> ./stratos.sh list-subscribed-cartridges | grep samp
>>>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>>>>> cisco-sample-vm | Inactive | 0                 |
>>>>>> cisco-sample-vm.foo.cisco.com |
>>>>>>
>>>>>> But then,
>>>>>>
>>>>>> root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh
>>>>>> list-subscribed-cartridges | grep samp
>>>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>>>>> cisco-sample-vm | Active | 1                 | cisco-samp
>>>>>> le-vm.foo.cisco.com |
>>>>>>
>>>>>> # nova list | grep samp
>>>>>> #
>>>>>>
>>>>>> How did the cartridge become active without it actually being there?
>>>>>> As far as I can tell, Stratos never recovers from this.
>>>>>>
>>>>>> I found this bug here:
>>>>>> https://issues.apache.org/jira/browse/STRATOS-234 - is this
>>>>>> describing the issue I’m seeing? I was a little bit confused by the usage
>>>>>> of the word “obsolete”.
>>>>>>
>>>>>> *Where to go next?*
>>>>>> Now, I’ve done a little bit of digging, but I don’t yet have a full
>>>>>> mental model of how everything fits together in Stratos - please could
>>>>>> someone help me put the pieces together? :)
>>>>>>
>>>>>> What I’m seeing is the following:
>>>>>> - The cluster monitor appears to be active:
>>>>>>
>>>>>> TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG
>>>>>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor} -  Cluster monitor
>>>>>> is running.. Cluste
>>>>>> rMonitor [clusterId=cisco-sample-vm.cisco-sample-v,
>>>>>> serviceId=cisco-sample-vm, deploymentPolicy=Deployment Policy [id]static-1
>>>>>> [partitions] [org
>>>>>>
>>>>>> .apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0],
>>>>>> autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null, de
>>>>>> scription=null], lbReferenceType=null]
>>>>>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor}
>>>>>>
>>>>>> - It looks like the CEP FaultHandlingWindowProcessor usually detects
>>>>>> inactive members. However, since this member was never active, the
>>>>>> timeStampMap doesn’t contain an element for this member, so it’s
>>>>>> never checked.
>>>>>> - I think the fault handling is triggered by a fault_message, but I
>>>>>> didn’t manage to figure out where it’s coming from. Does anyone know what
>>>>>> triggers it? (is it the CEP extension?)
>>>>>>
>>>>>> Anyway..
>>>>>>
>>>>>> *Questions*
>>>>>> - How should Stratos detect after some downtime which cartridges are
>>>>>> still there and which ones aren’t? (what was the intended design?)
>>>>>> - Why did the missing cartridge go “active”? Is this a result from
>>>>>> restoring persistent state? (If I look in the registry I can see stuff
>>>>>> under subscriptions/active, but not sure if that’s where it comes from)
>>>>>> - Who should be responsible for detecting the absence of an instance
>>>>>> - the ClusterMonitor? That seems to be fed incorrect data, since it clearly
>>>>>> thinks there are enough instances running. Which component has the
>>>>>> necessary data?
>>>>>> - It looks like it’s possible to snapshot CEP state
>>>>>> <http://stackoverflow.com/questions/20348326/wso2-cep-siddhi-how-to-make-time-windows-persistent> to
>>>>>> make it semi-persistent. However, if I restarted Stratos after 2min
>>>>>> downtime, wouldn’t it try to kill all the nodes since the last reply was
>>>>>> more than 60s ago? Also, snapshots would be periodic, so there’s still a
>>>>>> window in which cartridges might “disappear".
>>>>>>
>>>>>> Thanks a lot and best regards!
>>>>>>
>>>>>> Michiel
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Imesh Gunaratne
>>>>>
>>>>> Technical Lead, WSO2
>>>>> Committer & PPMC Member, Apache Stratos
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Best Regards,
>>>> Nirmal
>>>>
>>>> Nirmal Fernando.
>>>> PPMC Member & Committer of Apache Stratos,
>>>> Senior Software Engineer, WSO2 Inc.
>>>>
>>>> Blog: http://nirmalfdo.blogspot.com/
>>>>
>>>
>>>
>>>
>>> --
>>> Reka Thirunavukkarasu
>>> Senior Software Engineer,
>>> WSO2, Inc.:http://wso2.com,
>>> Mobile: +94776442007
>>>
>>>
>>>
>
>
> --
> Reka Thirunavukkarasu
> Senior Software Engineer,
> WSO2, Inc.:http://wso2.com,
> Mobile: +94776442007
>
>
>


-- 
Lakmal Warusawithana
Vice President, Apache Stratos
Director - Cloud Architecture; WSO2 Inc.
Mobile : +94714289692
Blog : http://lakmalsview.blogspot.com/

Re: stop stratos, kill cartridge instance, start stratos -> cartridge still active?

Posted by Reka Thirunavukkarasu <re...@wso2.com>.
Hi Cris,


On Sat, Jun 28, 2014 at 11:54 AM, chris snow <ch...@gmail.com> wrote:

> Hi Reka, will this fix also need to get applied to 4.0.0?
>
Yah. As Isuru mentioned, we can apply it as a patch to 4.0.0. The issue
will be there only when you publish events to BAM from cloud controller and
when you unsubscribe from an instance. I will create a patch from 4.0.0
branch with the fix and update the jira with the patch..

Thanks,
Reka

Thanks,
Reka


>  On 26 Jun 2014 06:43, "Reka Thirunavukkarasu" <re...@wso2.com> wrote:
>
>> Hi all,
>>
>> On Wed, Jun 25, 2014 at 11:44 PM, Nirmal Fernando <nirmal070125@gmail.com
>> > wrote:
>>
>>>
>>> On Wed, Jun 25, 2014 at 10:51 PM, Imesh Gunaratne <im...@apache.org>
>>> wrote:
>>>
>>>> Hi Michiel,
>>>>
>>>> As Reka has pointed out there is a potential issue
>>>> in CloudControllerServiceImpl class. It seems like cloud controller is
>>>> retrieving its state from registry
>>>> in CloudControllerServiceImpl constructor and it's being invoked in two
>>>> other places than it's expected to:
>>>>
>>>>
>>>> ​
>>>>
>>>>
>>>>
>>>>
>>> This was a bug, we identified recently and someone has made this commit
>>> without properly analyzing the way CC has implemented. :-(
>>>
>>> AFAIK Reka has already filed a jira and on her way to remove that broken
>>> logic.
>>>
>> I have fixed this issue in master and updated the jira (STRATOS-685).  I
>> have removed CloudControllerServiceImpl initialization which used in cloud
>> controller when publishing events to BAM and in the instance termination on
>> behalf of the MemberReadyToShutdownEvent.
>>
>> The fix that i did was to get the relevant cartridge information from
>> FasterLookupDataHolder when publishing events to BAM instead of getting it
>> from buggy way as earlier. Handled the instance termination via Autoscaler
>> on behalf of MemberReadyToShutdownEvent instead of CC itself terminates the
>> member. I think that this would be good  way as autoscaler is the one who
>> requests to start or terminate the member in all scenarios.
>>
>> Thanks,
>> Reka
>>
>> However the above logic does not retrieve the topology from registry. It
>>>> is being retrieved by Topology Manager:
>>>>
>>>>
>>>> ​
>>>> Therefore the above issue may have very little affect on the problem
>>>> you have noticed. However I wonder whether we have an issue in Autoscaler
>>>> in refreshing its state once restarted.
>>>>
>>>>  Just to narrow down the cause of this issue, will you be able to list
>>>> down the actions that you carried out from the very beginning please? Then
>>>> we could try to re-produce this problem by going through them.
>>>>
>>>>
>>>> Many Thanks
>>>> Imesh
>>>>
>>>>
>>>> On Mon, Jun 23, 2014 at 9:53 PM, Michiel Blokzijl (mblokzij) <
>>>> mblokzij@cisco.com> wrote:
>>>>
>>>>> Hi all,
>>>>>
>>>>> Basically, I was stopping and starting Stratos and looking at how it
>>>>> handled dying cartridges, and found that Stratos only detected cartridge
>>>>> deaths while it was running..
>>>>>
>>>>> *The problem*
>>>>> In steady state, I have some cartridges managed by Stratos,
>>>>>
>>>>> ./stratos.sh list-subscribed-cartridges | grep samp
>>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>>>> cisco-sample-vm | Active | 1                 | cisco-samp
>>>>> le-vm.foo.cisco.com |
>>>>>
>>>>> nova list | grep samp
>>>>> | 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE |
>>>>> None       | Running     | core=172.16.2.17, 10.86.205.231  |
>>>>>
>>>>> All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample
>>>>> cartridge, and then start ActiveMQ and Stratos again.
>>>>>
>>>>> Now, at first things look good..:
>>>>>
>>>>> ./stratos.sh list-subscribed-cartridges | grep samp
>>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>>>> cisco-sample-vm | Inactive | 0                 |
>>>>> cisco-sample-vm.foo.cisco.com |
>>>>>
>>>>> But then,
>>>>>
>>>>> root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh
>>>>> list-subscribed-cartridges | grep samp
>>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>>>> cisco-sample-vm | Active | 1                 | cisco-samp
>>>>> le-vm.foo.cisco.com |
>>>>>
>>>>> # nova list | grep samp
>>>>> #
>>>>>
>>>>> How did the cartridge become active without it actually being there?
>>>>> As far as I can tell, Stratos never recovers from this.
>>>>>
>>>>> I found this bug here:
>>>>> https://issues.apache.org/jira/browse/STRATOS-234 - is this
>>>>> describing the issue I’m seeing? I was a little bit confused by the usage
>>>>> of the word “obsolete”.
>>>>>
>>>>> *Where to go next?*
>>>>> Now, I’ve done a little bit of digging, but I don’t yet have a full
>>>>> mental model of how everything fits together in Stratos - please could
>>>>> someone help me put the pieces together? :)
>>>>>
>>>>> What I’m seeing is the following:
>>>>> - The cluster monitor appears to be active:
>>>>>
>>>>> TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG
>>>>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor} -  Cluster monitor
>>>>> is running.. Cluste
>>>>> rMonitor [clusterId=cisco-sample-vm.cisco-sample-v,
>>>>> serviceId=cisco-sample-vm, deploymentPolicy=Deployment Policy [id]static-1
>>>>> [partitions] [org
>>>>>
>>>>> .apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0],
>>>>> autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null, de
>>>>> scription=null], lbReferenceType=null]
>>>>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor}
>>>>>
>>>>> - It looks like the CEP FaultHandlingWindowProcessor usually detects
>>>>> inactive members. However, since this member was never active, the
>>>>> timeStampMap doesn’t contain an element for this member, so it’s
>>>>> never checked.
>>>>> - I think the fault handling is triggered by a fault_message, but I
>>>>> didn’t manage to figure out where it’s coming from. Does anyone know what
>>>>> triggers it? (is it the CEP extension?)
>>>>>
>>>>> Anyway..
>>>>>
>>>>> *Questions*
>>>>> - How should Stratos detect after some downtime which cartridges are
>>>>> still there and which ones aren’t? (what was the intended design?)
>>>>> - Why did the missing cartridge go “active”? Is this a result from
>>>>> restoring persistent state? (If I look in the registry I can see stuff
>>>>> under subscriptions/active, but not sure if that’s where it comes from)
>>>>> - Who should be responsible for detecting the absence of an instance -
>>>>> the ClusterMonitor? That seems to be fed incorrect data, since it clearly
>>>>> thinks there are enough instances running. Which component has the
>>>>> necessary data?
>>>>> - It looks like it’s possible to snapshot CEP state
>>>>> <http://stackoverflow.com/questions/20348326/wso2-cep-siddhi-how-to-make-time-windows-persistent> to
>>>>> make it semi-persistent. However, if I restarted Stratos after 2min
>>>>> downtime, wouldn’t it try to kill all the nodes since the last reply was
>>>>> more than 60s ago? Also, snapshots would be periodic, so there’s still a
>>>>> window in which cartridges might “disappear".
>>>>>
>>>>> Thanks a lot and best regards!
>>>>>
>>>>> Michiel
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Imesh Gunaratne
>>>>
>>>> Technical Lead, WSO2
>>>> Committer & PPMC Member, Apache Stratos
>>>>
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Nirmal
>>>
>>> Nirmal Fernando.
>>> PPMC Member & Committer of Apache Stratos,
>>> Senior Software Engineer, WSO2 Inc.
>>>
>>> Blog: http://nirmalfdo.blogspot.com/
>>>
>>
>>
>>
>> --
>> Reka Thirunavukkarasu
>> Senior Software Engineer,
>> WSO2, Inc.:http://wso2.com,
>> Mobile: +94776442007
>>
>>
>>


-- 
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007

Re: stop stratos, kill cartridge instance, start stratos -> cartridge still active?

Posted by chris snow <ch...@gmail.com>.
Hi Reka, will this fix also need to get applied to 4.0.0?
On 26 Jun 2014 06:43, "Reka Thirunavukkarasu" <re...@wso2.com> wrote:

> Hi all,
>
> On Wed, Jun 25, 2014 at 11:44 PM, Nirmal Fernando <ni...@gmail.com>
> wrote:
>
>>
>> On Wed, Jun 25, 2014 at 10:51 PM, Imesh Gunaratne <im...@apache.org>
>> wrote:
>>
>>> Hi Michiel,
>>>
>>> As Reka has pointed out there is a potential issue
>>> in CloudControllerServiceImpl class. It seems like cloud controller is
>>> retrieving its state from registry
>>> in CloudControllerServiceImpl constructor and it's being invoked in two
>>> other places than it's expected to:
>>>
>>>
>>> ​
>>>
>>>
>>>
>>>
>> This was a bug, we identified recently and someone has made this commit
>> without properly analyzing the way CC has implemented. :-(
>>
>> AFAIK Reka has already filed a jira and on her way to remove that broken
>> logic.
>>
> I have fixed this issue in master and updated the jira (STRATOS-685).  I
> have removed CloudControllerServiceImpl initialization which used in cloud
> controller when publishing events to BAM and in the instance termination on
> behalf of the MemberReadyToShutdownEvent.
>
> The fix that i did was to get the relevant cartridge information from
> FasterLookupDataHolder when publishing events to BAM instead of getting it
> from buggy way as earlier. Handled the instance termination via Autoscaler
> on behalf of MemberReadyToShutdownEvent instead of CC itself terminates the
> member. I think that this would be good  way as autoscaler is the one who
> requests to start or terminate the member in all scenarios.
>
> Thanks,
> Reka
>
> However the above logic does not retrieve the topology from registry. It
>>> is being retrieved by Topology Manager:
>>>
>>>
>>> ​
>>> Therefore the above issue may have very little affect on the problem you
>>> have noticed. However I wonder whether we have an issue in Autoscaler in
>>> refreshing its state once restarted.
>>>
>>>  Just to narrow down the cause of this issue, will you be able to list
>>> down the actions that you carried out from the very beginning please? Then
>>> we could try to re-produce this problem by going through them.
>>>
>>>
>>> Many Thanks
>>> Imesh
>>>
>>>
>>> On Mon, Jun 23, 2014 at 9:53 PM, Michiel Blokzijl (mblokzij) <
>>> mblokzij@cisco.com> wrote:
>>>
>>>> Hi all,
>>>>
>>>> Basically, I was stopping and starting Stratos and looking at how it
>>>> handled dying cartridges, and found that Stratos only detected cartridge
>>>> deaths while it was running..
>>>>
>>>> *The problem*
>>>> In steady state, I have some cartridges managed by Stratos,
>>>>
>>>> ./stratos.sh list-subscribed-cartridges | grep samp
>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>>> cisco-sample-vm | Active | 1                 | cisco-samp
>>>> le-vm.foo.cisco.com |
>>>>
>>>> nova list | grep samp
>>>> | 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE |
>>>> None       | Running     | core=172.16.2.17, 10.86.205.231  |
>>>>
>>>> All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample
>>>> cartridge, and then start ActiveMQ and Stratos again.
>>>>
>>>> Now, at first things look good..:
>>>>
>>>> ./stratos.sh list-subscribed-cartridges | grep samp
>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>>> cisco-sample-vm | Inactive | 0                 |
>>>> cisco-sample-vm.foo.cisco.com |
>>>>
>>>> But then,
>>>>
>>>> root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh
>>>> list-subscribed-cartridges | grep samp
>>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>>> cisco-sample-vm | Active | 1                 | cisco-samp
>>>> le-vm.foo.cisco.com |
>>>>
>>>> # nova list | grep samp
>>>> #
>>>>
>>>> How did the cartridge become active without it actually being there? As
>>>> far as I can tell, Stratos never recovers from this.
>>>>
>>>> I found this bug here:
>>>> https://issues.apache.org/jira/browse/STRATOS-234 - is this describing
>>>> the issue I’m seeing? I was a little bit confused by the usage of the word
>>>> “obsolete”.
>>>>
>>>> *Where to go next?*
>>>> Now, I’ve done a little bit of digging, but I don’t yet have a full
>>>> mental model of how everything fits together in Stratos - please could
>>>> someone help me put the pieces together? :)
>>>>
>>>> What I’m seeing is the following:
>>>> - The cluster monitor appears to be active:
>>>>
>>>> TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG
>>>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor} -  Cluster monitor
>>>> is running.. Cluste
>>>> rMonitor [clusterId=cisco-sample-vm.cisco-sample-v,
>>>> serviceId=cisco-sample-vm, deploymentPolicy=Deployment Policy [id]static-1
>>>> [partitions] [org
>>>>
>>>> .apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0],
>>>> autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null, de
>>>> scription=null], lbReferenceType=null]
>>>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor}
>>>>
>>>> - It looks like the CEP FaultHandlingWindowProcessor usually detects
>>>> inactive members. However, since this member was never active, the
>>>> timeStampMap doesn’t contain an element for this member, so it’s never
>>>> checked.
>>>> - I think the fault handling is triggered by a fault_message, but I
>>>> didn’t manage to figure out where it’s coming from. Does anyone know what
>>>> triggers it? (is it the CEP extension?)
>>>>
>>>> Anyway..
>>>>
>>>> *Questions*
>>>> - How should Stratos detect after some downtime which cartridges are
>>>> still there and which ones aren’t? (what was the intended design?)
>>>> - Why did the missing cartridge go “active”? Is this a result from
>>>> restoring persistent state? (If I look in the registry I can see stuff
>>>> under subscriptions/active, but not sure if that’s where it comes from)
>>>> - Who should be responsible for detecting the absence of an instance -
>>>> the ClusterMonitor? That seems to be fed incorrect data, since it clearly
>>>> thinks there are enough instances running. Which component has the
>>>> necessary data?
>>>> - It looks like it’s possible to snapshot CEP state
>>>> <http://stackoverflow.com/questions/20348326/wso2-cep-siddhi-how-to-make-time-windows-persistent> to
>>>> make it semi-persistent. However, if I restarted Stratos after 2min
>>>> downtime, wouldn’t it try to kill all the nodes since the last reply was
>>>> more than 60s ago? Also, snapshots would be periodic, so there’s still a
>>>> window in which cartridges might “disappear".
>>>>
>>>> Thanks a lot and best regards!
>>>>
>>>> Michiel
>>>>
>>>
>>>
>>>
>>> --
>>> Imesh Gunaratne
>>>
>>> Technical Lead, WSO2
>>> Committer & PPMC Member, Apache Stratos
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Nirmal
>>
>> Nirmal Fernando.
>> PPMC Member & Committer of Apache Stratos,
>> Senior Software Engineer, WSO2 Inc.
>>
>> Blog: http://nirmalfdo.blogspot.com/
>>
>
>
>
> --
> Reka Thirunavukkarasu
> Senior Software Engineer,
> WSO2, Inc.:http://wso2.com,
> Mobile: +94776442007
>
>
>

Re: stop stratos, kill cartridge instance, start stratos -> cartridge still active?

Posted by Reka Thirunavukkarasu <re...@wso2.com>.
Hi all,

On Wed, Jun 25, 2014 at 11:44 PM, Nirmal Fernando <ni...@gmail.com>
wrote:

>
> On Wed, Jun 25, 2014 at 10:51 PM, Imesh Gunaratne <im...@apache.org>
> wrote:
>
>> Hi Michiel,
>>
>> As Reka has pointed out there is a potential issue
>> in CloudControllerServiceImpl class. It seems like cloud controller is
>> retrieving its state from registry
>> in CloudControllerServiceImpl constructor and it's being invoked in two
>> other places than it's expected to:
>>
>>
>> ​
>>
>>
>>
>>
> This was a bug, we identified recently and someone has made this commit
> without properly analyzing the way CC has implemented. :-(
>
> AFAIK Reka has already filed a jira and on her way to remove that broken
> logic.
>
I have fixed this issue in master and updated the jira (STRATOS-685).  I
have removed CloudControllerServiceImpl initialization which used in cloud
controller when publishing events to BAM and in the instance termination on
behalf of the MemberReadyToShutdownEvent.

The fix that i did was to get the relevant cartridge information from
FasterLookupDataHolder when publishing events to BAM instead of getting it
from buggy way as earlier. Handled the instance termination via Autoscaler
on behalf of MemberReadyToShutdownEvent instead of CC itself terminates the
member. I think that this would be good  way as autoscaler is the one who
requests to start or terminate the member in all scenarios.

Thanks,
Reka

However the above logic does not retrieve the topology from registry. It is
>> being retrieved by Topology Manager:
>>
>>
>> ​
>> Therefore the above issue may have very little affect on the problem you
>> have noticed. However I wonder whether we have an issue in Autoscaler in
>> refreshing its state once restarted.
>>
>>  Just to narrow down the cause of this issue, will you be able to list
>> down the actions that you carried out from the very beginning please? Then
>> we could try to re-produce this problem by going through them.
>>
>>
>> Many Thanks
>> Imesh
>>
>>
>> On Mon, Jun 23, 2014 at 9:53 PM, Michiel Blokzijl (mblokzij) <
>> mblokzij@cisco.com> wrote:
>>
>>> Hi all,
>>>
>>> Basically, I was stopping and starting Stratos and looking at how it
>>> handled dying cartridges, and found that Stratos only detected cartridge
>>> deaths while it was running..
>>>
>>> *The problem*
>>> In steady state, I have some cartridges managed by Stratos,
>>>
>>> ./stratos.sh list-subscribed-cartridges | grep samp
>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>> cisco-sample-vm | Active | 1                 | cisco-samp
>>> le-vm.foo.cisco.com |
>>>
>>> nova list | grep samp
>>> | 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE |
>>> None       | Running     | core=172.16.2.17, 10.86.205.231  |
>>>
>>> All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample
>>> cartridge, and then start ActiveMQ and Stratos again.
>>>
>>> Now, at first things look good..:
>>>
>>> ./stratos.sh list-subscribed-cartridges | grep samp
>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>> cisco-sample-vm | Inactive | 0                 |
>>> cisco-sample-vm.foo.cisco.com |
>>>
>>> But then,
>>>
>>> root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh
>>> list-subscribed-cartridges | grep samp
>>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>>> cisco-sample-vm | Active | 1                 | cisco-samp
>>> le-vm.foo.cisco.com |
>>>
>>> # nova list | grep samp
>>> #
>>>
>>> How did the cartridge become active without it actually being there? As
>>> far as I can tell, Stratos never recovers from this.
>>>
>>> I found this bug here: https://issues.apache.org/jira/browse/STRATOS-234 -
>>> is this describing the issue I’m seeing? I was a little bit confused by the
>>> usage of the word “obsolete”.
>>>
>>> *Where to go next?*
>>> Now, I’ve done a little bit of digging, but I don’t yet have a full
>>> mental model of how everything fits together in Stratos - please could
>>> someone help me put the pieces together? :)
>>>
>>> What I’m seeing is the following:
>>> - The cluster monitor appears to be active:
>>>
>>> TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG
>>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor} -  Cluster monitor
>>> is running.. Cluste
>>> rMonitor [clusterId=cisco-sample-vm.cisco-sample-v,
>>> serviceId=cisco-sample-vm, deploymentPolicy=Deployment Policy [id]static-1
>>> [partitions] [org
>>>
>>> .apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0],
>>> autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null, de
>>> scription=null], lbReferenceType=null]
>>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor}
>>>
>>> - It looks like the CEP FaultHandlingWindowProcessor usually detects
>>> inactive members. However, since this member was never active, the
>>> timeStampMap doesn’t contain an element for this member, so it’s never
>>> checked.
>>> - I think the fault handling is triggered by a fault_message, but I
>>> didn’t manage to figure out where it’s coming from. Does anyone know what
>>> triggers it? (is it the CEP extension?)
>>>
>>> Anyway..
>>>
>>> *Questions*
>>> - How should Stratos detect after some downtime which cartridges are
>>> still there and which ones aren’t? (what was the intended design?)
>>> - Why did the missing cartridge go “active”? Is this a result from
>>> restoring persistent state? (If I look in the registry I can see stuff
>>> under subscriptions/active, but not sure if that’s where it comes from)
>>> - Who should be responsible for detecting the absence of an instance -
>>> the ClusterMonitor? That seems to be fed incorrect data, since it clearly
>>> thinks there are enough instances running. Which component has the
>>> necessary data?
>>> - It looks like it’s possible to snapshot CEP state
>>> <http://stackoverflow.com/questions/20348326/wso2-cep-siddhi-how-to-make-time-windows-persistent> to
>>> make it semi-persistent. However, if I restarted Stratos after 2min
>>> downtime, wouldn’t it try to kill all the nodes since the last reply was
>>> more than 60s ago? Also, snapshots would be periodic, so there’s still a
>>> window in which cartridges might “disappear".
>>>
>>> Thanks a lot and best regards!
>>>
>>> Michiel
>>>
>>
>>
>>
>> --
>> Imesh Gunaratne
>>
>> Technical Lead, WSO2
>> Committer & PPMC Member, Apache Stratos
>>
>
>
>
> --
> Best Regards,
> Nirmal
>
> Nirmal Fernando.
> PPMC Member & Committer of Apache Stratos,
> Senior Software Engineer, WSO2 Inc.
>
> Blog: http://nirmalfdo.blogspot.com/
>



-- 
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007

Re: stop stratos, kill cartridge instance, start stratos -> cartridge still active?

Posted by Nirmal Fernando <ni...@gmail.com>.
On Wed, Jun 25, 2014 at 10:51 PM, Imesh Gunaratne <im...@apache.org> wrote:

> Hi Michiel,
>
> As Reka has pointed out there is a potential issue
> in CloudControllerServiceImpl class. It seems like cloud controller is
> retrieving its state from registry
> in CloudControllerServiceImpl constructor and it's being invoked in two
> other places than it's expected to:
>
>
> ​
>
>
>
>
This was a bug, we identified recently and someone has made this commit
without properly analyzing the way CC has implemented. :-(

AFAIK Reka has already filed a jira and on her way to remove that broken
logic.



> However the above logic does not retrieve the topology from registry. It
> is being retrieved by Topology Manager:
>
>
> ​
> Therefore the above issue may have very little affect on the problem you
> have noticed. However I wonder whether we have an issue in Autoscaler in
> refreshing its state once restarted.
>
>  Just to narrow down the cause of this issue, will you be able to list
> down the actions that you carried out from the very beginning please? Then
> we could try to re-produce this problem by going through them.
>
>
> Many Thanks
> Imesh
>
>
> On Mon, Jun 23, 2014 at 9:53 PM, Michiel Blokzijl (mblokzij) <
> mblokzij@cisco.com> wrote:
>
>> Hi all,
>>
>> Basically, I was stopping and starting Stratos and looking at how it
>> handled dying cartridges, and found that Stratos only detected cartridge
>> deaths while it was running..
>>
>> *The problem*
>> In steady state, I have some cartridges managed by Stratos,
>>
>> ./stratos.sh list-subscribed-cartridges | grep samp
>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>> cisco-sample-vm | Active | 1                 | cisco-samp
>> le-vm.foo.cisco.com |
>>
>> nova list | grep samp
>> | 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE | None
>>     | Running     | core=172.16.2.17, 10.86.205.231  |
>>
>> All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample
>> cartridge, and then start ActiveMQ and Stratos again.
>>
>> Now, at first things look good..:
>>
>> ./stratos.sh list-subscribed-cartridges | grep samp
>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>> cisco-sample-vm | Inactive | 0                 |
>> cisco-sample-vm.foo.cisco.com |
>>
>> But then,
>>
>> root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh
>> list-subscribed-cartridges | grep samp
>> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
>> cisco-sample-vm | Active | 1                 | cisco-samp
>> le-vm.foo.cisco.com |
>>
>> # nova list | grep samp
>> #
>>
>> How did the cartridge become active without it actually being there? As
>> far as I can tell, Stratos never recovers from this.
>>
>> I found this bug here: https://issues.apache.org/jira/browse/STRATOS-234 -
>> is this describing the issue I’m seeing? I was a little bit confused by the
>> usage of the word “obsolete”.
>>
>> *Where to go next?*
>> Now, I’ve done a little bit of digging, but I don’t yet have a full
>> mental model of how everything fits together in Stratos - please could
>> someone help me put the pieces together? :)
>>
>> What I’m seeing is the following:
>> - The cluster monitor appears to be active:
>>
>> TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG
>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor} -  Cluster monitor
>> is running.. Cluste
>> rMonitor [clusterId=cisco-sample-vm.cisco-sample-v,
>> serviceId=cisco-sample-vm, deploymentPolicy=Deployment Policy [id]static-1
>> [partitions] [org
>>
>> .apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0],
>> autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null, de
>> scription=null], lbReferenceType=null]
>> {org.apache.stratos.autoscaler.monitor.ClusterMonitor}
>>
>> - It looks like the CEP FaultHandlingWindowProcessor usually detects
>> inactive members. However, since this member was never active, the
>> timeStampMap doesn’t contain an element for this member, so it’s never
>> checked.
>> - I think the fault handling is triggered by a fault_message, but I
>> didn’t manage to figure out where it’s coming from. Does anyone know what
>> triggers it? (is it the CEP extension?)
>>
>> Anyway..
>>
>> *Questions*
>> - How should Stratos detect after some downtime which cartridges are
>> still there and which ones aren’t? (what was the intended design?)
>> - Why did the missing cartridge go “active”? Is this a result from
>> restoring persistent state? (If I look in the registry I can see stuff
>> under subscriptions/active, but not sure if that’s where it comes from)
>> - Who should be responsible for detecting the absence of an instance -
>> the ClusterMonitor? That seems to be fed incorrect data, since it clearly
>> thinks there are enough instances running. Which component has the
>> necessary data?
>> - It looks like it’s possible to snapshot CEP state
>> <http://stackoverflow.com/questions/20348326/wso2-cep-siddhi-how-to-make-time-windows-persistent> to
>> make it semi-persistent. However, if I restarted Stratos after 2min
>> downtime, wouldn’t it try to kill all the nodes since the last reply was
>> more than 60s ago? Also, snapshots would be periodic, so there’s still a
>> window in which cartridges might “disappear".
>>
>> Thanks a lot and best regards!
>>
>> Michiel
>>
>
>
>
> --
> Imesh Gunaratne
>
> Technical Lead, WSO2
> Committer & PPMC Member, Apache Stratos
>



-- 
Best Regards,
Nirmal

Nirmal Fernando.
PPMC Member & Committer of Apache Stratos,
Senior Software Engineer, WSO2 Inc.

Blog: http://nirmalfdo.blogspot.com/

Re: stop stratos, kill cartridge instance, start stratos -> cartridge still active?

Posted by Imesh Gunaratne <im...@apache.org>.
Hi Michiel,

As Reka has pointed out there is a potential issue
in CloudControllerServiceImpl class. It seems like cloud controller is
retrieving its state from registry
in CloudControllerServiceImpl constructor and it's being invoked in two
other places than it's expected to:


​



However the above logic does not retrieve the topology from registry. It is
being retrieved by Topology Manager:


​
Therefore the above issue may have very little affect on the problem you
have noticed. However I wonder whether we have an issue in Autoscaler in
refreshing its state once restarted.

Just to narrow down the cause of this issue, will you be able to list down
the actions that you carried out from the very beginning please? Then we
could try to re-produce this problem by going through them.


Many Thanks
Imesh


On Mon, Jun 23, 2014 at 9:53 PM, Michiel Blokzijl (mblokzij) <
mblokzij@cisco.com> wrote:

> Hi all,
>
> Basically, I was stopping and starting Stratos and looking at how it
> handled dying cartridges, and found that Stratos only detected cartridge
> deaths while it was running..
>
> *The problem*
> In steady state, I have some cartridges managed by Stratos,
>
> ./stratos.sh list-subscribed-cartridges | grep samp
> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
> cisco-sample-vm | Active | 1                 | cisco-samp
> le-vm.foo.cisco.com |
>
> nova list | grep samp
> | 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE | None
>     | Running     | core=172.16.2.17, 10.86.205.231  |
>
> All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample
> cartridge, and then start ActiveMQ and Stratos again.
>
> Now, at first things look good..:
>
> ./stratos.sh list-subscribed-cartridges | grep samp
> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
> cisco-sample-vm | Inactive | 0                 |
> cisco-sample-vm.foo.cisco.com |
>
> But then,
>
> root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh
> list-subscribed-cartridges | grep samp
> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
> cisco-sample-vm | Active | 1                 | cisco-samp
> le-vm.foo.cisco.com |
>
> # nova list | grep samp
> #
>
> How did the cartridge become active without it actually being there? As
> far as I can tell, Stratos never recovers from this.
>
> I found this bug here: https://issues.apache.org/jira/browse/STRATOS-234 -
> is this describing the issue I’m seeing? I was a little bit confused by the
> usage of the word “obsolete”.
>
> *Where to go next?*
> Now, I’ve done a little bit of digging, but I don’t yet have a full mental
> model of how everything fits together in Stratos - please could someone
> help me put the pieces together? :)
>
> What I’m seeing is the following:
> - The cluster monitor appears to be active:
>
> TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG
> {org.apache.stratos.autoscaler.monitor.ClusterMonitor} -  Cluster monitor
> is running.. Cluste
> rMonitor [clusterId=cisco-sample-vm.cisco-sample-v,
> serviceId=cisco-sample-vm, deploymentPolicy=Deployment Policy [id]static-1
> [partitions] [org
>
> .apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0],
> autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null, de
> scription=null], lbReferenceType=null]
> {org.apache.stratos.autoscaler.monitor.ClusterMonitor}
>
> - It looks like the CEP FaultHandlingWindowProcessor usually detects
> inactive members. However, since this member was never active, the
> timeStampMap doesn’t contain an element for this member, so it’s never
> checked.
> - I think the fault handling is triggered by a fault_message, but I didn’t
> manage to figure out where it’s coming from. Does anyone know what triggers
> it? (is it the CEP extension?)
>
> Anyway..
>
> *Questions*
> - How should Stratos detect after some downtime which cartridges are still
> there and which ones aren’t? (what was the intended design?)
> - Why did the missing cartridge go “active”? Is this a result from
> restoring persistent state? (If I look in the registry I can see stuff
> under subscriptions/active, but not sure if that’s where it comes from)
> - Who should be responsible for detecting the absence of an instance - the
> ClusterMonitor? That seems to be fed incorrect data, since it clearly
> thinks there are enough instances running. Which component has the
> necessary data?
> - It looks like it’s possible to snapshot CEP state
> <http://stackoverflow.com/questions/20348326/wso2-cep-siddhi-how-to-make-time-windows-persistent> to
> make it semi-persistent. However, if I restarted Stratos after 2min
> downtime, wouldn’t it try to kill all the nodes since the last reply was
> more than 60s ago? Also, snapshots would be periodic, so there’s still a
> window in which cartridges might “disappear".
>
> Thanks a lot and best regards!
>
> Michiel
>



-- 
Imesh Gunaratne

Technical Lead, WSO2
Committer & PPMC Member, Apache Stratos

Re: stop stratos, kill cartridge instance, start stratos -> cartridge still active?

Posted by Reka Thirunavukkarasu <re...@wso2.com>.
Hi Michiel,

Thanks for bringing this up..Please fine my comments in line.

On Mon, Jun 23, 2014 at 9:53 PM, Michiel Blokzijl (mblokzij) <
mblokzij@cisco.com> wrote:

> Hi all,
>
> Basically, I was stopping and starting Stratos and looking at how it
> handled dying cartridges, and found that Stratos only detected cartridge
> deaths while it was running..
>
> *The problem*
> In steady state, I have some cartridges managed by Stratos,
>
> ./stratos.sh list-subscribed-cartridges | grep samp
> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
> cisco-sample-vm | Active | 1                 | cisco-samp
> le-vm.foo.cisco.com |
>
> nova list | grep samp
> | 2b50cc6c-37b1-42d4-9277-e7b624d8b957 | cisco-samp-4a8 | ACTIVE | None
>     | Running     | core=172.16.2.17, 10.86.205.231  |
>
> All good. Now I stop Stratos and ActiveMQ, 'nova delete’ the sample
> cartridge, and then start ActiveMQ and Stratos again.
>
> Now, at first things look good..:
>
> ./stratos.sh list-subscribed-cartridges | grep samp
> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
> cisco-sample-vm | Inactive | 0                 |
> cisco-sample-vm.foo.cisco.com |
>
> But then,
>
> root@octl-01:/opt/wso2/apache-stratos-cli-4.0.0# ./stratos.sh
> list-subscribed-cartridges | grep samp
> | cisco-sample-vm | cisco-sample-vm | 1       | Single-Tenant |
> cisco-sample-vm | Active | 1                 | cisco-samp
> le-vm.foo.cisco.com |
>
> # nova list | grep samp
> #
>
> How did the cartridge become active without it actually being there? As
> far as I can tell, Stratos never recovers from this.
>

Le me explain the behavior here. At the restarts itself, stratos will
recover the data from registry. Until it recovers the data, you won't be
able to see the list of subscribed cartridges. AFAIK, after stratos loaded
the data from registry, then in memory data model only gets updated with
the Topology events. We periodically update the registry from in memory
model in order to persist the data. So this state change could be updated
by cartridge agent. If it is not, then we will have to debug on this more.
I'm not sure whether https://issues.apache.org/jira/browse/STRATOS-685 has
any impact here. If you could see this consistently, then can you create a
jira for this. So that we can work on it.

>
> I found this bug here: https://issues.apache.org/jira/browse/STRATOS-234 -
> is this describing the issue I’m seeing? I was a little bit confused by the
> usage of the word “obsolete”.
>
I think that Lahiru can add more on here..

>
> *Where to go next?*
> Now, I’ve done a little bit of digging, but I don’t yet have a full mental
> model of how everything fits together in Stratos - please could someone
> help me put the pieces together? :)
>
> What I’m seeing is the following:
> - The cluster monitor appears to be active:
>
> TID: [0] [STRATOS] [2014-06-23 10:12:39,994] DEBUG
> {org.apache.stratos.autoscaler.monitor.ClusterMonitor} -  Cluster monitor
> is running.. Cluste
> rMonitor [clusterId=cisco-sample-vm.cisco-sample-v,
> serviceId=cisco-sample-vm, deploymentPolicy=Deployment Policy [id]static-1
> [partitions] [org
>
> .apache.stratos.cloud.controller.stub.deployment.partition.Partition@48cf06c0],
> autoscalePolicy=ASPolicy [id=economyPolicy, displayName=null, de
> scription=null], lbReferenceType=null]
> {org.apache.stratos.autoscaler.monitor.ClusterMonitor}
>
> - It looks like the CEP FaultHandlingWindowProcessor usually detects
> inactive members. However, since this member was never active, the
> timeStampMap doesn’t contain an element for this member, so it’s never
> checked.
>
In order for CEP's  FaultHandlingWindowProcessor(custom window) used by
GradientOfHealthRequest( execution plan) to detect the unhealthy member,
the member should become active at the very first time. There after only,
events are getting sent by cartirdge agent to CEP. So that CEP will keep
track of the events and triggers the execution plan
(GradientOfHealthRequest). In the execution, if
FaultHandlingWindowProcessor detects that events are not received for more
the one minute, then it will identify that member as inactive and put it
into  fault_message stream. Later on, FaultMessageEventFormatter in CEP
will read this stream and publish the message to message broker. Then
autoscaler will receive it from message broker and perform the necessary
actions.

- I think the fault handling is triggered by a fault_message, but I didn’t
> manage to figure out where it’s coming from. Does anyone know what triggers
> it? (is it the CEP extension?)
>

Hope above explains what you have asked.

>
> Anyway..
>
> *Questions*
> - How should Stratos detect after some downtime which cartridges are still
> there and which ones aren’t? (what was the intended design?)
>
If the cartridge become active, then it will be the responsibility of CEP
to detect the failures. Other wise, if it is an unhealthy instance even
from the instance spawn, then autoscaler will keep track of it and will
terminate it after sometime.


> - Why did the missing cartridge go “active”? Is this a result from
> restoring persistent state? (If I look in the registry I can see stuff
> under subscriptions/active, but not sure if that’s where it comes from)
>
I hope that above explained this as well.

 - Who should be responsible for detecting the absence of an instance - the
> ClusterMonitor? That seems to be fed incorrect data, since it clearly
> thinks there are enough instances running. Which component has the
> necessary data?
>
absence of an instance will be detected by CEP.  ClusterMonitor always
keeps track of whether minimum number of instances are up and running or
whether it required to be scaled based on the stats received.


> - It looks like it’s possible to snapshot CEP state
> <http://stackoverflow.com/questions/20348326/wso2-cep-siddhi-how-to-make-time-windows-persistent> to
> make it semi-persistent. However, if I restarted Stratos after 2min
> downtime, wouldn’t it try to kill all the nodes since the last reply was
> more than 60s ago? Also, snapshots would be periodic, so there’s still a
> window in which cartridges might “disappear".
>
As you mentioned here, we will have to configure CEP to persist data when
it is restarted. By default, stratos configuration doesn't support
persisting data for CEP.

Thanks,
Reka

>
> Thanks a lot and best regards!
>
> Michiel
>



-- 
Reka Thirunavukkarasu
Senior Software Engineer,
WSO2, Inc.:http://wso2.com,
Mobile: +94776442007