You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stratos.apache.org by Akila Ravihansa Perera <ra...@wso2.com> on 2014/08/28 09:18:58 UTC

Re: MemberFault event is lost forever when MB is down

Hi,

Since we're using WSO2 CEP for monitoring faulty members, it would
make sense to enhance the Faulty Member window processor [1] to
recover from a core component failure. I have made some improvements
to this window processor and committed in [2].

CEP will now have an additional dependency for Stratos messaging
component (applicable only when using stand-alone CEP). Therefore it
can now listen to the topology topic events published by CC. CEP will
now check for cartridge agent health stats published by instances
against the member list published by CC in complete topology event.
Thus, even if the MemberFault event is lost in case of MB failure
Stratos can recover itself since it will periodically check against
member list published by CC. The code has been rigorously tested on
EC2 and OpenStack.

The other possible alternative (as opposed to dependency with
messaging component) would be to create a new JMS input adaptor in CEP
and listen to topology topic. But with this approach we will have to
duplicate the messaging component model (topology structure) in CEP
window processor. This is an un-necessary duplication IMHO.

However, with this dependency for messaging component in CEP, if a
user is deploying Stratos with a stand-alone CEP, then he will have to
manually copy the messaging component artifacts to CEP plugins
directory.

Would appreciate your thoughts on this.

[1] https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java
[2] https://github.com/apache/stratos/commit/05e1ddc20a871b73b721487a13a2547cf9b8768d

Thanks.

On Wed, Jul 30, 2014 at 7:32 PM, Udara Liyanage <ud...@wso2.com> wrote:
> Hi Imesh,
>
> Yes any message will not be communicated when message broker is not
> available.
>
>
> On Wed, Jul 30, 2014 at 7:24 PM, Imesh Gunaratne <im...@apache.org> wrote:
>>
>> As I understood its not just the Member Fault event that is affected in
>> this scenario, any event that CEP publishes to message broker will encounter
>> the same problem.
>>
>>
>> On Wed, Jul 30, 2014 at 5:49 AM, Michiel Blokzijl (mblokzij)
>> <mb...@cisco.com> wrote:
>>>
>>> +1.
>>>
>>> If Stratos, or any component it relies on, fails, and eventually returns
>>> to service, Stratos should "orchestrate" the cloud back to the desired
>>> state. If any cartridges went missing and after some time T (post failure)
>>> Stratos hasn’t re-discovered them, they should be respawned.
>>>
>>> Best regards,
>>>
>>> Michiel
>>>
>>>
>>> On 30 Jul 2014, at 05:51, Isuru Haththotuwa <is...@apache.org> wrote:
>>>
>>>
>>>
>>>
>>> On Wed, Jul 30, 2014 at 9:45 AM, Akila Ravihansa Perera
>>> <ra...@wso2.com> wrote:
>>>>
>>>> Hi Devs,
>>>>
>>>> Current Stratos architecture relies heavily on high availability of
>>>> the message broker. We faced a situation when MB is down, some of the
>>>> messages published will get lost forever and the system state will
>>>> never be recovered.
>>>>
>>>> One such example is, when a cartridge instance goes down the CEP
>>>> component will identify this event and publish a MemberFault event to
>>>> the MB's summarized-health-stat topic. But the problem is CEP
>>>> component creates its own list of cartridge instance members by
>>>> looking at health-stats published to MB - it does not consider the
>>>> topology. Hence, when a cartridge instance goes down, MemberFault
>>>> event will get fired only once. But if the MB is down at this time, it
>>>> will cause this message to be lost forever resulting in an un-stable
>>>> system state in which Stratos thinks a member exists but in reality it
>>>> is not the case.
>>>>
>>>> We can introduce a simple house keeping task to check whether every
>>>> member is alive. Ideally this should be auto-scaler's responsibility.
>>>> It will allow the system to recover itself from an un-stable
>>>> situation. I think this is a critical bug and should be given high
>>>> priority.
>>>>
>>>> Please share your thoughts.
>>>
>>> +1. We would need to decide what is the best method for this though. If
>>> we consider CEP the central point of decision making, another option is to
>>> make it listen to topology and get the correct decision. Or else, we can use
>>> a health check mechanism for the MB which can detect if the MB is down and
>>> replay any of the messages. This IMO can be very useful since the primary
>>> communication mechanism in Stratos is the MB.
>>>
>>> One other important thing is to have fail-over/HA for MB. There can be
>>> many other occasion if the MB is down, the system going to a undefined state
>>> due to loss of messages.
>>>>
>>>>
>>>> --
>>>> Akila Ravihansa Perera
>>>> Software Engineer
>>>> WSO2 Inc.
>>>> http://wso2.com
>>>>
>>>> Blog: http://ravihansa3000.blogspot.com
>>>>
>>>> --
>>>> Thanks and Regards,
>>>>
>>>> Isuru H.
>>>> +94 716 358 048
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>>
>> --
>> Imesh Gunaratne
>>
>> Technical Lead, WSO2
>> Committer & PPMC Member, Apache Stratos
>
>
>
>
> --
>
> Udara Liyanage
> Software Engineer
> WSO2, Inc.: http://wso2.com
> lean. enterprise. middleware
>
> web: http://udaraliyanage.wordpress.com
> phone: +94 71 443 6897



-- 
Akila Ravihansa Perera
WSO2 Inc

Blog: http://ravihansa3000.blogspot.com

Re: MemberFault event is lost forever when MB is down

Posted by "Michiel Blokzijl (mblokzij)" <mb...@cisco.com>.
Hi,

I’m guessing the fix for [1] will be in 4.1.0, right? I’m glad you managed to resolve both issues with 1 fix!

Thanks and best regards,

Michiel


[1] https://issues.apache.org/jira/browse/STRATOS-795

On 12 Sep 2014, at 15:57, Lahiru Sandaruwan <la...@wso2.com> wrote:

> Ok cool, Let's resolve the Jira. 
> 
> On Fri, Sep 12, 2014 at 5:51 PM, Akila Ravihansa Perera <ra...@wso2.com> wrote:
> Hi Lahiru,
> 
> Yes, this is resolved now. Stratos will now check health stats against
> the member list published by CC to topology topic (CompleteTopology
> event). This will allow Stratos to recover from MB failures and also
> server unavailable situations.
> 
> Thanks.
> 
> On Fri, Sep 12, 2014 at 5:03 PM, Lahiru Sandaruwan <la...@wso2.com> wrote:
> > Hi Akila,
> >
> > Would [1] also be solved with the solution we talked here?
> >
> > Thanks.
> > [1] https://issues.apache.org/jira/browse/STRATOS-795
> >
> > On Thu, Aug 28, 2014 at 12:48 PM, Akila Ravihansa Perera
> > <ra...@wso2.com> wrote:
> >>
> >> Hi,
> >>
> >> Since we're using WSO2 CEP for monitoring faulty members, it would
> >> make sense to enhance the Faulty Member window processor [1] to
> >> recover from a core component failure. I have made some improvements
> >> to this window processor and committed in [2].
> >>
> >> CEP will now have an additional dependency for Stratos messaging
> >> component (applicable only when using stand-alone CEP). Therefore it
> >> can now listen to the topology topic events published by CC. CEP will
> >> now check for cartridge agent health stats published by instances
> >> against the member list published by CC in complete topology event.
> >> Thus, even if the MemberFault event is lost in case of MB failure
> >> Stratos can recover itself since it will periodically check against
> >> member list published by CC. The code has been rigorously tested on
> >> EC2 and OpenStack.
> >>
> >> The other possible alternative (as opposed to dependency with
> >> messaging component) would be to create a new JMS input adaptor in CEP
> >> and listen to topology topic. But with this approach we will have to
> >> duplicate the messaging component model (topology structure) in CEP
> >> window processor. This is an un-necessary duplication IMHO.
> >>
> >> However, with this dependency for messaging component in CEP, if a
> >> user is deploying Stratos with a stand-alone CEP, then he will have to
> >> manually copy the messaging component artifacts to CEP plugins
> >> directory.
> >>
> >> Would appreciate your thoughts on this.
> >>
> >> [1]
> >> https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java
> >> [2]
> >> https://github.com/apache/stratos/commit/05e1ddc20a871b73b721487a13a2547cf9b8768d
> >>
> >> Thanks.
> >>
> >> On Wed, Jul 30, 2014 at 7:32 PM, Udara Liyanage <ud...@wso2.com> wrote:
> >> > Hi Imesh,
> >> >
> >> > Yes any message will not be communicated when message broker is not
> >> > available.
> >> >
> >> >
> >> > On Wed, Jul 30, 2014 at 7:24 PM, Imesh Gunaratne <im...@apache.org>
> >> > wrote:
> >> >>
> >> >> As I understood its not just the Member Fault event that is affected in
> >> >> this scenario, any event that CEP publishes to message broker will
> >> >> encounter
> >> >> the same problem.
> >> >>
> >> >>
> >> >> On Wed, Jul 30, 2014 at 5:49 AM, Michiel Blokzijl (mblokzij)
> >> >> <mb...@cisco.com> wrote:
> >> >>>
> >> >>> +1.
> >> >>>
> >> >>> If Stratos, or any component it relies on, fails, and eventually
> >> >>> returns
> >> >>> to service, Stratos should "orchestrate" the cloud back to the desired
> >> >>> state. If any cartridges went missing and after some time T (post
> >> >>> failure)
> >> >>> Stratos hasn’t re-discovered them, they should be respawned.
> >> >>>
> >> >>> Best regards,
> >> >>>
> >> >>> Michiel
> >> >>>
> >> >>>
> >> >>> On 30 Jul 2014, at 05:51, Isuru Haththotuwa <is...@apache.org> wrote:
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Wed, Jul 30, 2014 at 9:45 AM, Akila Ravihansa Perera
> >> >>> <ra...@wso2.com> wrote:
> >> >>>>
> >> >>>> Hi Devs,
> >> >>>>
> >> >>>> Current Stratos architecture relies heavily on high availability of
> >> >>>> the message broker. We faced a situation when MB is down, some of the
> >> >>>> messages published will get lost forever and the system state will
> >> >>>> never be recovered.
> >> >>>>
> >> >>>> One such example is, when a cartridge instance goes down the CEP
> >> >>>> component will identify this event and publish a MemberFault event to
> >> >>>> the MB's summarized-health-stat topic. But the problem is CEP
> >> >>>> component creates its own list of cartridge instance members by
> >> >>>> looking at health-stats published to MB - it does not consider the
> >> >>>> topology. Hence, when a cartridge instance goes down, MemberFault
> >> >>>> event will get fired only once. But if the MB is down at this time,
> >> >>>> it
> >> >>>> will cause this message to be lost forever resulting in an un-stable
> >> >>>> system state in which Stratos thinks a member exists but in reality
> >> >>>> it
> >> >>>> is not the case.
> >> >>>>
> >> >>>> We can introduce a simple house keeping task to check whether every
> >> >>>> member is alive. Ideally this should be auto-scaler's responsibility.
> >> >>>> It will allow the system to recover itself from an un-stable
> >> >>>> situation. I think this is a critical bug and should be given high
> >> >>>> priority.
> >> >>>>
> >> >>>> Please share your thoughts.
> >> >>>
> >> >>> +1. We would need to decide what is the best method for this though.
> >> >>> If
> >> >>> we consider CEP the central point of decision making, another option
> >> >>> is to
> >> >>> make it listen to topology and get the correct decision. Or else, we
> >> >>> can use
> >> >>> a health check mechanism for the MB which can detect if the MB is down
> >> >>> and
> >> >>> replay any of the messages. This IMO can be very useful since the
> >> >>> primary
> >> >>> communication mechanism in Stratos is the MB.
> >> >>>
> >> >>> One other important thing is to have fail-over/HA for MB. There can be
> >> >>> many other occasion if the MB is down, the system going to a undefined
> >> >>> state
> >> >>> due to loss of messages.
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Akila Ravihansa Perera
> >> >>>> Software Engineer
> >> >>>> WSO2 Inc.
> >> >>>> http://wso2.com
> >> >>>>
> >> >>>> Blog: http://ravihansa3000.blogspot.com
> >> >>>>
> >> >>>> --
> >> >>>> Thanks and Regards,
> >> >>>>
> >> >>>> Isuru H.
> >> >>>> +94 716 358 048
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Imesh Gunaratne
> >> >>
> >> >> Technical Lead, WSO2
> >> >> Committer & PPMC Member, Apache Stratos
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > Udara Liyanage
> >> > Software Engineer
> >> > WSO2, Inc.: http://wso2.com
> >> > lean. enterprise. middleware
> >> >
> >> > web: http://udaraliyanage.wordpress.com
> >> > phone: +94 71 443 6897
> >>
> >>
> >>
> >> --
> >> Akila Ravihansa Perera
> >> WSO2 Inc
> >>
> >> Blog: http://ravihansa3000.blogspot.com
> >
> >
> >
> >
> > --
> > --
> > Lahiru Sandaruwan
> > Committer and PMC member, Apache Stratos,
> > Senior Software Engineer,
> > WSO2 Inc., http://wso2.com
> > lean.enterprise.middleware
> >
> > email: lahirus@wso2.com cell: (+94) 773 325 954
> > blog: http://lahiruwrites.blogspot.com/
> > twitter: http://twitter.com/lahirus
> > linked-in: http://lk.linkedin.com/pub/lahiru-sandaruwan/16/153/146
> >
> 
> 
> 
> --
> Akila Ravihansa Perera
> Software Engineer, WSO2
> 
> Blog: http://ravihansa3000.blogspot.com
> 
> 
> 
> -- 
> --
> Lahiru Sandaruwan
> Committer and PMC member, Apache Stratos,
> Senior Software Engineer,
> WSO2 Inc., http://wso2.com
> lean.enterprise.middleware
> 
> email: lahirus@wso2.com cell: (+94) 773 325 954
> blog: http://lahiruwrites.blogspot.com/
> twitter: http://twitter.com/lahirus
> linked-in: http://lk.linkedin.com/pub/lahiru-sandaruwan/16/153/146
> 


Re: MemberFault event is lost forever when MB is down

Posted by Lahiru Sandaruwan <la...@wso2.com>.
Ok cool, Let's resolve the Jira.

On Fri, Sep 12, 2014 at 5:51 PM, Akila Ravihansa Perera <ra...@wso2.com>
wrote:

> Hi Lahiru,
>
> Yes, this is resolved now. Stratos will now check health stats against
> the member list published by CC to topology topic (CompleteTopology
> event). This will allow Stratos to recover from MB failures and also
> server unavailable situations.
>
> Thanks.
>
> On Fri, Sep 12, 2014 at 5:03 PM, Lahiru Sandaruwan <la...@wso2.com>
> wrote:
> > Hi Akila,
> >
> > Would [1] also be solved with the solution we talked here?
> >
> > Thanks.
> > [1] https://issues.apache.org/jira/browse/STRATOS-795
> >
> > On Thu, Aug 28, 2014 at 12:48 PM, Akila Ravihansa Perera
> > <ra...@wso2.com> wrote:
> >>
> >> Hi,
> >>
> >> Since we're using WSO2 CEP for monitoring faulty members, it would
> >> make sense to enhance the Faulty Member window processor [1] to
> >> recover from a core component failure. I have made some improvements
> >> to this window processor and committed in [2].
> >>
> >> CEP will now have an additional dependency for Stratos messaging
> >> component (applicable only when using stand-alone CEP). Therefore it
> >> can now listen to the topology topic events published by CC. CEP will
> >> now check for cartridge agent health stats published by instances
> >> against the member list published by CC in complete topology event.
> >> Thus, even if the MemberFault event is lost in case of MB failure
> >> Stratos can recover itself since it will periodically check against
> >> member list published by CC. The code has been rigorously tested on
> >> EC2 and OpenStack.
> >>
> >> The other possible alternative (as opposed to dependency with
> >> messaging component) would be to create a new JMS input adaptor in CEP
> >> and listen to topology topic. But with this approach we will have to
> >> duplicate the messaging component model (topology structure) in CEP
> >> window processor. This is an un-necessary duplication IMHO.
> >>
> >> However, with this dependency for messaging component in CEP, if a
> >> user is deploying Stratos with a stand-alone CEP, then he will have to
> >> manually copy the messaging component artifacts to CEP plugins
> >> directory.
> >>
> >> Would appreciate your thoughts on this.
> >>
> >> [1]
> >>
> https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java
> >> [2]
> >>
> https://github.com/apache/stratos/commit/05e1ddc20a871b73b721487a13a2547cf9b8768d
> >>
> >> Thanks.
> >>
> >> On Wed, Jul 30, 2014 at 7:32 PM, Udara Liyanage <ud...@wso2.com> wrote:
> >> > Hi Imesh,
> >> >
> >> > Yes any message will not be communicated when message broker is not
> >> > available.
> >> >
> >> >
> >> > On Wed, Jul 30, 2014 at 7:24 PM, Imesh Gunaratne <im...@apache.org>
> >> > wrote:
> >> >>
> >> >> As I understood its not just the Member Fault event that is affected
> in
> >> >> this scenario, any event that CEP publishes to message broker will
> >> >> encounter
> >> >> the same problem.
> >> >>
> >> >>
> >> >> On Wed, Jul 30, 2014 at 5:49 AM, Michiel Blokzijl (mblokzij)
> >> >> <mb...@cisco.com> wrote:
> >> >>>
> >> >>> +1.
> >> >>>
> >> >>> If Stratos, or any component it relies on, fails, and eventually
> >> >>> returns
> >> >>> to service, Stratos should "orchestrate" the cloud back to the
> desired
> >> >>> state. If any cartridges went missing and after some time T (post
> >> >>> failure)
> >> >>> Stratos hasn’t re-discovered them, they should be respawned.
> >> >>>
> >> >>> Best regards,
> >> >>>
> >> >>> Michiel
> >> >>>
> >> >>>
> >> >>> On 30 Jul 2014, at 05:51, Isuru Haththotuwa <is...@apache.org>
> wrote:
> >> >>>
> >> >>>
> >> >>>
> >> >>>
> >> >>> On Wed, Jul 30, 2014 at 9:45 AM, Akila Ravihansa Perera
> >> >>> <ra...@wso2.com> wrote:
> >> >>>>
> >> >>>> Hi Devs,
> >> >>>>
> >> >>>> Current Stratos architecture relies heavily on high availability of
> >> >>>> the message broker. We faced a situation when MB is down, some of
> the
> >> >>>> messages published will get lost forever and the system state will
> >> >>>> never be recovered.
> >> >>>>
> >> >>>> One such example is, when a cartridge instance goes down the CEP
> >> >>>> component will identify this event and publish a MemberFault event
> to
> >> >>>> the MB's summarized-health-stat topic. But the problem is CEP
> >> >>>> component creates its own list of cartridge instance members by
> >> >>>> looking at health-stats published to MB - it does not consider the
> >> >>>> topology. Hence, when a cartridge instance goes down, MemberFault
> >> >>>> event will get fired only once. But if the MB is down at this time,
> >> >>>> it
> >> >>>> will cause this message to be lost forever resulting in an
> un-stable
> >> >>>> system state in which Stratos thinks a member exists but in reality
> >> >>>> it
> >> >>>> is not the case.
> >> >>>>
> >> >>>> We can introduce a simple house keeping task to check whether every
> >> >>>> member is alive. Ideally this should be auto-scaler's
> responsibility.
> >> >>>> It will allow the system to recover itself from an un-stable
> >> >>>> situation. I think this is a critical bug and should be given high
> >> >>>> priority.
> >> >>>>
> >> >>>> Please share your thoughts.
> >> >>>
> >> >>> +1. We would need to decide what is the best method for this though.
> >> >>> If
> >> >>> we consider CEP the central point of decision making, another option
> >> >>> is to
> >> >>> make it listen to topology and get the correct decision. Or else, we
> >> >>> can use
> >> >>> a health check mechanism for the MB which can detect if the MB is
> down
> >> >>> and
> >> >>> replay any of the messages. This IMO can be very useful since the
> >> >>> primary
> >> >>> communication mechanism in Stratos is the MB.
> >> >>>
> >> >>> One other important thing is to have fail-over/HA for MB. There can
> be
> >> >>> many other occasion if the MB is down, the system going to a
> undefined
> >> >>> state
> >> >>> due to loss of messages.
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Akila Ravihansa Perera
> >> >>>> Software Engineer
> >> >>>> WSO2 Inc.
> >> >>>> http://wso2.com
> >> >>>>
> >> >>>> Blog: http://ravihansa3000.blogspot.com
> >> >>>>
> >> >>>> --
> >> >>>> Thanks and Regards,
> >> >>>>
> >> >>>> Isuru H.
> >> >>>> +94 716 358 048
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>
> >> >>
> >> >>
> >> >>
> >> >> --
> >> >> Imesh Gunaratne
> >> >>
> >> >> Technical Lead, WSO2
> >> >> Committer & PPMC Member, Apache Stratos
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> >
> >> > Udara Liyanage
> >> > Software Engineer
> >> > WSO2, Inc.: http://wso2.com
> >> > lean. enterprise. middleware
> >> >
> >> > web: http://udaraliyanage.wordpress.com
> >> > phone: +94 71 443 6897
> >>
> >>
> >>
> >> --
> >> Akila Ravihansa Perera
> >> WSO2 Inc
> >>
> >> Blog: http://ravihansa3000.blogspot.com
> >
> >
> >
> >
> > --
> > --
> > Lahiru Sandaruwan
> > Committer and PMC member, Apache Stratos,
> > Senior Software Engineer,
> > WSO2 Inc., http://wso2.com
> > lean.enterprise.middleware
> >
> > email: lahirus@wso2.com cell: (+94) 773 325 954
> > blog: http://lahiruwrites.blogspot.com/
> > twitter: http://twitter.com/lahirus
> > linked-in: http://lk.linkedin.com/pub/lahiru-sandaruwan/16/153/146
> >
>
>
>
> --
> Akila Ravihansa Perera
> Software Engineer, WSO2
>
> Blog: http://ravihansa3000.blogspot.com
>



-- 
--
Lahiru Sandaruwan
Committer and PMC member, Apache Stratos,
Senior Software Engineer,
WSO2 Inc., http://wso2.com
lean.enterprise.middleware

email: lahirus@wso2.com cell: (+94) 773 325 954
blog: http://lahiruwrites.blogspot.com/
twitter: http://twitter.com/lahirus
linked-in: http://lk.linkedin.com/pub/lahiru-sandaruwan/16/153/146

Re: MemberFault event is lost forever when MB is down

Posted by Akila Ravihansa Perera <ra...@wso2.com>.
Hi Lahiru,

Yes, this is resolved now. Stratos will now check health stats against
the member list published by CC to topology topic (CompleteTopology
event). This will allow Stratos to recover from MB failures and also
server unavailable situations.

Thanks.

On Fri, Sep 12, 2014 at 5:03 PM, Lahiru Sandaruwan <la...@wso2.com> wrote:
> Hi Akila,
>
> Would [1] also be solved with the solution we talked here?
>
> Thanks.
> [1] https://issues.apache.org/jira/browse/STRATOS-795
>
> On Thu, Aug 28, 2014 at 12:48 PM, Akila Ravihansa Perera
> <ra...@wso2.com> wrote:
>>
>> Hi,
>>
>> Since we're using WSO2 CEP for monitoring faulty members, it would
>> make sense to enhance the Faulty Member window processor [1] to
>> recover from a core component failure. I have made some improvements
>> to this window processor and committed in [2].
>>
>> CEP will now have an additional dependency for Stratos messaging
>> component (applicable only when using stand-alone CEP). Therefore it
>> can now listen to the topology topic events published by CC. CEP will
>> now check for cartridge agent health stats published by instances
>> against the member list published by CC in complete topology event.
>> Thus, even if the MemberFault event is lost in case of MB failure
>> Stratos can recover itself since it will periodically check against
>> member list published by CC. The code has been rigorously tested on
>> EC2 and OpenStack.
>>
>> The other possible alternative (as opposed to dependency with
>> messaging component) would be to create a new JMS input adaptor in CEP
>> and listen to topology topic. But with this approach we will have to
>> duplicate the messaging component model (topology structure) in CEP
>> window processor. This is an un-necessary duplication IMHO.
>>
>> However, with this dependency for messaging component in CEP, if a
>> user is deploying Stratos with a stand-alone CEP, then he will have to
>> manually copy the messaging component artifacts to CEP plugins
>> directory.
>>
>> Would appreciate your thoughts on this.
>>
>> [1]
>> https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java
>> [2]
>> https://github.com/apache/stratos/commit/05e1ddc20a871b73b721487a13a2547cf9b8768d
>>
>> Thanks.
>>
>> On Wed, Jul 30, 2014 at 7:32 PM, Udara Liyanage <ud...@wso2.com> wrote:
>> > Hi Imesh,
>> >
>> > Yes any message will not be communicated when message broker is not
>> > available.
>> >
>> >
>> > On Wed, Jul 30, 2014 at 7:24 PM, Imesh Gunaratne <im...@apache.org>
>> > wrote:
>> >>
>> >> As I understood its not just the Member Fault event that is affected in
>> >> this scenario, any event that CEP publishes to message broker will
>> >> encounter
>> >> the same problem.
>> >>
>> >>
>> >> On Wed, Jul 30, 2014 at 5:49 AM, Michiel Blokzijl (mblokzij)
>> >> <mb...@cisco.com> wrote:
>> >>>
>> >>> +1.
>> >>>
>> >>> If Stratos, or any component it relies on, fails, and eventually
>> >>> returns
>> >>> to service, Stratos should "orchestrate" the cloud back to the desired
>> >>> state. If any cartridges went missing and after some time T (post
>> >>> failure)
>> >>> Stratos hasn’t re-discovered them, they should be respawned.
>> >>>
>> >>> Best regards,
>> >>>
>> >>> Michiel
>> >>>
>> >>>
>> >>> On 30 Jul 2014, at 05:51, Isuru Haththotuwa <is...@apache.org> wrote:
>> >>>
>> >>>
>> >>>
>> >>>
>> >>> On Wed, Jul 30, 2014 at 9:45 AM, Akila Ravihansa Perera
>> >>> <ra...@wso2.com> wrote:
>> >>>>
>> >>>> Hi Devs,
>> >>>>
>> >>>> Current Stratos architecture relies heavily on high availability of
>> >>>> the message broker. We faced a situation when MB is down, some of the
>> >>>> messages published will get lost forever and the system state will
>> >>>> never be recovered.
>> >>>>
>> >>>> One such example is, when a cartridge instance goes down the CEP
>> >>>> component will identify this event and publish a MemberFault event to
>> >>>> the MB's summarized-health-stat topic. But the problem is CEP
>> >>>> component creates its own list of cartridge instance members by
>> >>>> looking at health-stats published to MB - it does not consider the
>> >>>> topology. Hence, when a cartridge instance goes down, MemberFault
>> >>>> event will get fired only once. But if the MB is down at this time,
>> >>>> it
>> >>>> will cause this message to be lost forever resulting in an un-stable
>> >>>> system state in which Stratos thinks a member exists but in reality
>> >>>> it
>> >>>> is not the case.
>> >>>>
>> >>>> We can introduce a simple house keeping task to check whether every
>> >>>> member is alive. Ideally this should be auto-scaler's responsibility.
>> >>>> It will allow the system to recover itself from an un-stable
>> >>>> situation. I think this is a critical bug and should be given high
>> >>>> priority.
>> >>>>
>> >>>> Please share your thoughts.
>> >>>
>> >>> +1. We would need to decide what is the best method for this though.
>> >>> If
>> >>> we consider CEP the central point of decision making, another option
>> >>> is to
>> >>> make it listen to topology and get the correct decision. Or else, we
>> >>> can use
>> >>> a health check mechanism for the MB which can detect if the MB is down
>> >>> and
>> >>> replay any of the messages. This IMO can be very useful since the
>> >>> primary
>> >>> communication mechanism in Stratos is the MB.
>> >>>
>> >>> One other important thing is to have fail-over/HA for MB. There can be
>> >>> many other occasion if the MB is down, the system going to a undefined
>> >>> state
>> >>> due to loss of messages.
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Akila Ravihansa Perera
>> >>>> Software Engineer
>> >>>> WSO2 Inc.
>> >>>> http://wso2.com
>> >>>>
>> >>>> Blog: http://ravihansa3000.blogspot.com
>> >>>>
>> >>>> --
>> >>>> Thanks and Regards,
>> >>>>
>> >>>> Isuru H.
>> >>>> +94 716 358 048
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>
>> >>
>> >>
>> >>
>> >> --
>> >> Imesh Gunaratne
>> >>
>> >> Technical Lead, WSO2
>> >> Committer & PPMC Member, Apache Stratos
>> >
>> >
>> >
>> >
>> > --
>> >
>> > Udara Liyanage
>> > Software Engineer
>> > WSO2, Inc.: http://wso2.com
>> > lean. enterprise. middleware
>> >
>> > web: http://udaraliyanage.wordpress.com
>> > phone: +94 71 443 6897
>>
>>
>>
>> --
>> Akila Ravihansa Perera
>> WSO2 Inc
>>
>> Blog: http://ravihansa3000.blogspot.com
>
>
>
>
> --
> --
> Lahiru Sandaruwan
> Committer and PMC member, Apache Stratos,
> Senior Software Engineer,
> WSO2 Inc., http://wso2.com
> lean.enterprise.middleware
>
> email: lahirus@wso2.com cell: (+94) 773 325 954
> blog: http://lahiruwrites.blogspot.com/
> twitter: http://twitter.com/lahirus
> linked-in: http://lk.linkedin.com/pub/lahiru-sandaruwan/16/153/146
>



-- 
Akila Ravihansa Perera
Software Engineer, WSO2

Blog: http://ravihansa3000.blogspot.com

Re: MemberFault event is lost forever when MB is down

Posted by Lahiru Sandaruwan <la...@wso2.com>.
Hi Akila,

Would [1] also be solved with the solution we talked here?

Thanks.
[1] https://issues.apache.org/jira/browse/STRATOS-795

On Thu, Aug 28, 2014 at 12:48 PM, Akila Ravihansa Perera <ravihansa@wso2.com
> wrote:

> Hi,
>
> Since we're using WSO2 CEP for monitoring faulty members, it would
> make sense to enhance the Faulty Member window processor [1] to
> recover from a core component failure. I have made some improvements
> to this window processor and committed in [2].
>
> CEP will now have an additional dependency for Stratos messaging
> component (applicable only when using stand-alone CEP). Therefore it
> can now listen to the topology topic events published by CC. CEP will
> now check for cartridge agent health stats published by instances
> against the member list published by CC in complete topology event.
> Thus, even if the MemberFault event is lost in case of MB failure
> Stratos can recover itself since it will periodically check against
> member list published by CC. The code has been rigorously tested on
> EC2 and OpenStack.
>
> The other possible alternative (as opposed to dependency with
> messaging component) would be to create a new JMS input adaptor in CEP
> and listen to topology topic. But with this approach we will have to
> duplicate the messaging component model (topology structure) in CEP
> window processor. This is an un-necessary duplication IMHO.
>
> However, with this dependency for messaging component in CEP, if a
> user is deploying Stratos with a stand-alone CEP, then he will have to
> manually copy the messaging component artifacts to CEP plugins
> directory.
>
> Would appreciate your thoughts on this.
>
> [1]
> https://github.com/apache/stratos/blob/master/extensions/cep/stratos-cep-extension/src/main/java/org/apache/stratos/cep/extension/FaultHandlingWindowProcessor.java
> [2]
> https://github.com/apache/stratos/commit/05e1ddc20a871b73b721487a13a2547cf9b8768d
>
> Thanks.
>
> On Wed, Jul 30, 2014 at 7:32 PM, Udara Liyanage <ud...@wso2.com> wrote:
> > Hi Imesh,
> >
> > Yes any message will not be communicated when message broker is not
> > available.
> >
> >
> > On Wed, Jul 30, 2014 at 7:24 PM, Imesh Gunaratne <im...@apache.org>
> wrote:
> >>
> >> As I understood its not just the Member Fault event that is affected in
> >> this scenario, any event that CEP publishes to message broker will
> encounter
> >> the same problem.
> >>
> >>
> >> On Wed, Jul 30, 2014 at 5:49 AM, Michiel Blokzijl (mblokzij)
> >> <mb...@cisco.com> wrote:
> >>>
> >>> +1.
> >>>
> >>> If Stratos, or any component it relies on, fails, and eventually
> returns
> >>> to service, Stratos should "orchestrate" the cloud back to the desired
> >>> state. If any cartridges went missing and after some time T (post
> failure)
> >>> Stratos hasn’t re-discovered them, they should be respawned.
> >>>
> >>> Best regards,
> >>>
> >>> Michiel
> >>>
> >>>
> >>> On 30 Jul 2014, at 05:51, Isuru Haththotuwa <is...@apache.org> wrote:
> >>>
> >>>
> >>>
> >>>
> >>> On Wed, Jul 30, 2014 at 9:45 AM, Akila Ravihansa Perera
> >>> <ra...@wso2.com> wrote:
> >>>>
> >>>> Hi Devs,
> >>>>
> >>>> Current Stratos architecture relies heavily on high availability of
> >>>> the message broker. We faced a situation when MB is down, some of the
> >>>> messages published will get lost forever and the system state will
> >>>> never be recovered.
> >>>>
> >>>> One such example is, when a cartridge instance goes down the CEP
> >>>> component will identify this event and publish a MemberFault event to
> >>>> the MB's summarized-health-stat topic. But the problem is CEP
> >>>> component creates its own list of cartridge instance members by
> >>>> looking at health-stats published to MB - it does not consider the
> >>>> topology. Hence, when a cartridge instance goes down, MemberFault
> >>>> event will get fired only once. But if the MB is down at this time, it
> >>>> will cause this message to be lost forever resulting in an un-stable
> >>>> system state in which Stratos thinks a member exists but in reality it
> >>>> is not the case.
> >>>>
> >>>> We can introduce a simple house keeping task to check whether every
> >>>> member is alive. Ideally this should be auto-scaler's responsibility.
> >>>> It will allow the system to recover itself from an un-stable
> >>>> situation. I think this is a critical bug and should be given high
> >>>> priority.
> >>>>
> >>>> Please share your thoughts.
> >>>
> >>> +1. We would need to decide what is the best method for this though. If
> >>> we consider CEP the central point of decision making, another option
> is to
> >>> make it listen to topology and get the correct decision. Or else, we
> can use
> >>> a health check mechanism for the MB which can detect if the MB is down
> and
> >>> replay any of the messages. This IMO can be very useful since the
> primary
> >>> communication mechanism in Stratos is the MB.
> >>>
> >>> One other important thing is to have fail-over/HA for MB. There can be
> >>> many other occasion if the MB is down, the system going to a undefined
> state
> >>> due to loss of messages.
> >>>>
> >>>>
> >>>> --
> >>>> Akila Ravihansa Perera
> >>>> Software Engineer
> >>>> WSO2 Inc.
> >>>> http://wso2.com
> >>>>
> >>>> Blog: http://ravihansa3000.blogspot.com
> >>>>
> >>>> --
> >>>> Thanks and Regards,
> >>>>
> >>>> Isuru H.
> >>>> +94 716 358 048
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Imesh Gunaratne
> >>
> >> Technical Lead, WSO2
> >> Committer & PPMC Member, Apache Stratos
> >
> >
> >
> >
> > --
> >
> > Udara Liyanage
> > Software Engineer
> > WSO2, Inc.: http://wso2.com
> > lean. enterprise. middleware
> >
> > web: http://udaraliyanage.wordpress.com
> > phone: +94 71 443 6897
>
>
>
> --
> Akila Ravihansa Perera
> WSO2 Inc
>
> Blog: http://ravihansa3000.blogspot.com
>



-- 
--
Lahiru Sandaruwan
Committer and PMC member, Apache Stratos,
Senior Software Engineer,
WSO2 Inc., http://wso2.com
lean.enterprise.middleware

email: lahirus@wso2.com cell: (+94) 773 325 954
blog: http://lahiruwrites.blogspot.com/
twitter: http://twitter.com/lahirus
linked-in: http://lk.linkedin.com/pub/lahiru-sandaruwan/16/153/146