You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@cloudstack.apache.org by Daan Hoogland <DH...@schubergphilis.com> on 2013/08/22 12:31:48 UTC

HA redundant virtual router

LS,

Schuberg Philis guarantees 100% functional uptime for their customers. Infrastructure is of course part of this promise and the easier factor to provide strong levels of resiliency. For this reason we want to make use of redundant virtual routers together with HA functionality.

We see HA and redundant routers as to different methods to provide higher levels of uptime.


1.      The redundant router setup takes care of seamless failover without lengthy hick-ups in the case of a single router failure.

2.      HA takes care of restarting a failed VM or router. Restoring connectivity in the case of single router or restoring 2n resiliency in the case of a redundant router setup.

The combination of these two methods will help us to meet our 100% promise; .We need to restore 2N redundancy ASAP in the case of single component failure e.g. a router. With these two methods combined the system is more autonomous and doesn't need human intervention to restore redundancy.

In the current situation we need to send a page to an on call engineer to restore redundancy asap, because of the tight SLA's. While if we could use HA icw redundant routers. The on-call guy can enjoy his sleep and will be a more happy guy :)
The present code forces the HA offering to off on redundant routers which seems odd.

So my question is: Why is it forced to off; Is there a technical restraint or is this a design choice we can discuss and maybe revise?

Cheers,


Re: HA redundant virtual router

Posted by Daan Hoogland <da...@gmail.com>.
H Sheng, thanks. Will raad it soon and comment or propose
additions/alterations

mobile biligual spell checker used
Op 6 sep. 2013 00:27 schreef "Sheng Yang" <sh...@yasker.org> het volgende:

> Here is the doc.
>
>
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Redundant+Virtual+Router+Functional+Spec
>
> It's not extremely detail, but describe today's design generally.
>
> --Sheng
>
>
> On Thu, Aug 29, 2013 at 8:17 AM, Daan Hoogland <da...@gmail.com>wrote:
>
>> ok,
>>
>> let's postpone the discussion till you are at least halve done. We
>> will of course continue to deliberate on what we need internally.
>>
>> Daan
>>
>> On Thu, Aug 29, 2013 at 5:08 PM, Sheng Yang <sh...@yasker.org> wrote:
>> > Hi Daan,
>> >
>> > As I said, I am writing a design doc to describe the current redundant
>> > router policy, to help understanding redundant router. Current it
>> doesn't
>> > support VPC, so how to implement it in VPC is still open to discuss.
>> >
>> > --Sheng
>> >
>> >
>> > On Thu, Aug 29, 2013 at 4:26 AM, Daan Hoogland <daan.hoogland@gmail.com
>> >
>> > wrote:
>> >>
>> >> Sheng,
>> >>
>> >> just to make sure; You are going to write this document? I see Roeland
>> >> understood your mail like this.
>> >>
>> >> When you do, I'd like you to keep in mind that we also want redundant
>> >> routers within a VPC to ensure ACS upgrades are more seamless for
>> >> customer application groups and - dtap streets. If you need any help
>> >> on writing such a doc, let me know.
>> >>
>> >> kind regards,
>> >> Daan
>> >>
>> >> On Thu, Aug 29, 2013 at 1:13 PM, Roeland Kuipers
>> >> <RK...@schubergphilis.com> wrote:
>> >> > Hi Sheng,
>> >> >
>> >> > Thanks for the info. Looking forward to the design doc, I trust this
>> >> > will make things clearer.
>> >> > In the meantime will be doing some research and thinking too, to see
>> how
>> >> > we can improve things to also have HA on the RvR in a safe way.
>> >> > We will share this once ready.
>> >> >
>> >> > Thanks,
>> >> > Roeland
>> >> >
>> >> >
>> >> > From: Sheng Yang [mailto:sheng@yasker.org]
>> >> > Sent: donderdag 29 augustus 2013 0:19
>> >> > To: <de...@cloudstack.apache.org>
>> >> > Cc: int-cloud; Daan Hoogland
>> >> > Subject: Re: HA redundant virtual router
>> >> >
>> >> > Hi Roeland,
>> >> >
>> >> > I would write a design doc to explain how redundant router works
>> >> > currently. For example, for the point 2, we have to force BACKUP
>> become
>> >> > MASTER because:
>> >> >
>> >> > 1. CS cannot communicate with MASTER at the time
>> >> > 2. CS can communicate with BACKUP.
>> >> > 3. Rule has to be programmed immediately.
>> >> > 4. In case old MASTER come back, it should yield to the VR with
>> updated
>> >> > rule, rather than preempt the updated VR.
>> >> >
>> >> > In this case, CS need to communicate with RvR to program the new
>> rule,
>> >> > thus it need to intervene the RvR to ensure that if there is only
>> one VR got
>> >> > the rule, it should become MASTER.
>> >> >
>> >> > Still, I would write a doc later to try to cover every concern of RvR
>> >> > design.
>> >> >
>> >> > --Sheng
>> >> >
>> >> > On Tue, Aug 27, 2013 at 3:40 AM, Roeland Kuipers
>> >> > <RK...@schubergphilis.com>>
>> wrote:
>> >> > Hi Sheng,
>> >> >
>> >> > Thanks for your reply. I'll see if we can replay this scenario.
>> >> >
>> >> > With respect to point 1: a good principal IMHO.
>> >> >
>> >> > Point 2: Why do we force a keepalived node to become master and not
>> wait
>> >> > for keepalived to become master? This way there is less reason to
>> intervene
>> >> > and less risk of multiple masters? As we have seen this behavior
>> with RvR
>> >> > without HA in the past. The downside that updates to rules do not
>> function
>> >> > until backup becomes master. But maybe this is wise anyways since
>> there is
>> >> > something wrong. This conflicts a bit with point 2 as we do
>> intervene here.
>> >> >
>> >> > Point 3: In my opinion keepalived is solid enough to leave this
>> >> > responsibility with keepalived and that CS just should check the
>> state and
>> >> > not fiddle with priorities to force masters. Because there is
>> obviously a
>> >> > reason why BACKUP refuses to become master.
>> >> > I think we should let keepalived prevent multiple master as is
>> designed
>> >> > to prevent this. Or do I miss something here?
>> >> > Actually in the scenario you described, with a functioning guest
>> >> > network, keepalived should be able to handle this situation if we
>> make sure
>> >> > all routers have different prios.
>> >> >
>> >> > I still have the opinion HA and RvR are different mechanisms.
>> >> >
>> >> > So what do you think is necessary to have the possibility of HA icw
>> RvR?
>> >> > We have a clear business requirement to have this implement on CS.
>> And we
>> >> > have Developers willing to create these changes to make this
>> possible.
>> >> > We also like to see RvR on VPC's and are also willing to contribute
>> this
>> >> > functionality.
>> >> >
>> >> > Thanks for your feedback!
>> >> >
>> >> > Cheers,
>> >> > Roeland
>> >> >
>> >> > -----Original Message-----
>> >> > From: Sheng Yang [mailto:sheng@yasker.org<ma...@yasker.org>]
>> >> > Sent: vrijdag 23 augustus 2013 23:25
>> >> > To: <de...@cloudstack.apache.org>>
>> >> > Subject: Re: HA redundant virtual router
>> >> >
>> >> > Hi Roeland,
>> >> >
>> >> > Thank you for your testing!
>> >> >
>> >> > Power off is not an concern right now, because at that time the VM
>> would
>> >> > disappear anyway.
>> >> >
>> >> > Our concern is more about if VM is still alive but we cannot detect
>> it
>> >> > for a while. For example, a network glitch happened, CS lost
>> connection to
>> >> > the host temporarily(control network), but the guest network is still
>> >> > working.
>> >> > HA would start another VR, which would possible result in 3 routers
>> in
>> >> > the guest network(at least for a moment). Many of the policy focus on
>> >> > dealing these intermediate status. Also if you plug off the network
>> cable of
>> >> > one host many things should happen...
>> >> >
>> >> >
>> >> > In RvR we want to make sure:
>> >> > 1. The status are self-governed, no need for CS to intervene.
>> >> > 2. MASTER would always get the latest rules. That means, if we cannot
>> >> > communicate with MASTER, we would turn to BACKUP and program the
>> rule on it
>> >> > and make it MASTER - even we cannot communicate with MASTER at this
>> time.
>> >> > And BACKUP should able to become MASTER if we request. This is
>> achieved
>> >> > by using a script to bump up the priority of BACKUP.
>> >> > 3. Trying best to prevent the dual-MASTER situation. So we would
>> program
>> >> > different priority for VRs and the MASTER/BACKUP status completely
>> depends
>> >> > on priority.
>> >> >
>> >> > And if you take RvR as an alternative to VM's HA mechanism., it's not
>> >> > that counter intuitive in fact.
>> >> >
>> >> > --Sheng
>> >> >
>> >> >
>> >> > On Fri, Aug 23, 2013 at 1:56 AM, Roeland Kuipers <
>> >> > RKuipers@schubergphilis.com<ma...@schubergphilis.com>>
>> wrote:
>> >> >
>> >> >> Hi Sheng,
>> >> >>
>> >> >> So far our testing showed no big problems. I've marked a redundant
>> set
>> >> >> of routers to be ha_enabled by setting ha_enabled bit in the
>> >> >> vm_instance table. (This is our workaround ATM) We tested HA icw RvR
>> >> >> in the scenarios ,shutdown / force power off VM. In these scenarios
>> HA
>> >> >> worked a treat and did restore the redundant pair as it should. And
>> >> >> keepalived nicely negotiated MASTER & BACKUP.
>> >> >> These are obviously basic tests, but we are happy to do some more
>> >> >> testing.
>> >> >>
>> >> >> I understand your concerns and am totally in favour of the KISS
>> >> >> principle.
>> >> >> What could be the scenario to end up with 3 routers?
>> >> >> Why is the situation complex to deal with? These are separate
>> >> >> mechanisms.
>> >> >> HA just making sure the router is up and alive. And keepalived
>> >> >> negotatiating MASTER-BACUP states according to keepalived
>> >> >> configuration, unless there a 3 routers with conflicting configs.
>> But
>> >> >> so far I do not understand the scenario where we could end up with 3
>> >> >> routers, so I cannot judge end/or test this.
>> >> >>
>> >> >> We like to see the hardcoded denial of HA in a redundant router
>> setup
>> >> >> go for several reasons:
>> >> >> 1. It's counter intuitive - we configured an HA service offering on
>> >> >> purpose for the RvR's. And found out by accident that it was not
>> >> >> enabled at all.
>> >> >> 2. CS could implement a default offering without HA for this setup
>> (to
>> >> >> keep it simple by default and keep currently forced behaviour), but
>> if
>> >> >> users, like us, deliberately like to have HA, users can create a
>> >> >> custom offering with HA enabled
>> >> >>
>> >> >> This way it's configurable, doesn't change default behavior and is
>> >> >> more intuitive.
>> >> >>
>> >> >> Thanks & Cheers,
>> >> >> Roeland
>> >> >>
>> >> >>
>> >> >>
>> >> >> -----Original Message-----
>> >> >> From: Sheng Yang [mailto:sheng@yasker.org<ma...@yasker.org>]
>> >> >> Sent: vrijdag 23 augustus 2013 3:03
>> >> >> To: <de...@cloudstack.apache.org>>
>> >> >> Subject: Re: HA redundant virtual router
>> >> >>
>> >> >> It's a design choice, the only reason is it would be a very complex
>> >> >> situation to deal with. In fact the redundant router itself's policy
>> >> >> has already been very complex...
>> >> >>
>> >> >> We didn't look into details at the time of implementing redundant
>> >> >> router, but there are lots of concerns e.g. a network glitch may
>> >> >> result in 3 routers running in the network and potentially two of
>> them
>> >> >> are in MASTER state.
>> >> >>
>> >> >> Of course discussion is welcome. We just want to keep it as simple
>> as
>> >> >> possible at the time.
>> >> >>
>> >> >> --Sheng
>> >> >>
>> >> >>
>> >> >> On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland <
>> >> >> DHoogland@schubergphilis.com<ma...@schubergphilis.com>
>> >> >> > wrote:
>> >> >>
>> >> >> > LS,
>> >> >> >
>> >> >> > Schuberg Philis guarantees 100% functional uptime for their
>> >> >> > customers.
>> >> >> > Infrastructure is of course part of this promise and the easier
>> >> >> > factor to provide strong levels of resiliency. For this reason we
>> >> >> > want to make use of redundant virtual routers together with HA
>> >> >> > functionality.
>> >> >> >
>> >> >> > We see HA and redundant routers as to different methods to provide
>> >> >> > higher levels of uptime.
>> >> >> >
>> >> >> >
>> >> >> > 1.      The redundant router setup takes care of seamless failover
>> >> >> without
>> >> >> > lengthy hick-ups in the case of a single router failure.
>> >> >> >
>> >> >> > 2.      HA takes care of restarting a failed VM or router.
>> Restoring
>> >> >> > connectivity in the case of single router or restoring 2n
>> resiliency
>> >> >> > in the case of a redundant router setup.
>> >> >> >
>> >> >> > The combination of these two methods will help us to meet our 100%
>> >> >> > promise; .We need to restore 2N redundancy ASAP in the case of
>> >> >> > single component failure e.g. a router. With these two methods
>> >> >> > combined the system is more autonomous and doesn't need human
>> >> >> > intervention to restore redundancy.
>> >> >> >
>> >> >> > In the current situation we need to send a page to an on call
>> >> >> > engineer to restore redundancy asap, because of the tight SLA's.
>> >> >> > While if we could use HA icw redundant routers. The on-call guy
>> can
>> >> >> > enjoy his sleep and will be a more happy guy :) The present code
>> >> >> > forces the HA offering to off on redundant routers which seems
>> odd.
>> >> >> >
>> >> >> > So my question is: Why is it forced to off; Is there a technical
>> >> >> > restraint or is this a design choice we can discuss and maybe
>> revise?
>> >> >> >
>> >> >> > Cheers,
>> >> >> >
>> >> >> >
>> >> >>
>> >> >
>> >
>> >
>>
>
>

Re: HA redundant virtual router

Posted by Simon Weller <sw...@ena.com>.
I think monit is installed currently in the system vm image. Sounds like it might make more sense to manage haproxy via monit, and allow it to recover the service should it fail. 

----- Original Message -----

From: "Sheng Yang" <sh...@yasker.org> 
To: "<de...@cloudstack.apache.org>" <de...@cloudstack.apache.org> 
Cc: "Daan Hoogland" <da...@gmail.com>, "int-cloud" <in...@schubergphilis.com> 
Sent: Tuesday, September 17, 2013 6:10:42 PM 
Subject: Re: HA redundant virtual router 

No, it's not intentional. 

the HAproxy is a part of services that redundant router would 
enable/disabled according to the MASTER/BACKUP status. All the services 
related to redundant router are controlled by services.sh. 

What's the failure of HAproxy exactly in your case? And what's the root 
cause? 

Also, I think just yield due to haproxy failure won't help much since 
effort still needed for CS to recover the situation, at least it would need 
to notify admin. Better transiting to FAULT state if it's a critical error. 

--Sheng 


On Tue, Sep 17, 2013 at 12:07 AM, Sten Spans <st...@blinkenlights.nl> wrote: 

> On Mon, 16 Sep 2013, Sheng Yang wrote: 
> 
> The reason for no HA as I said before, due to the complexity. E.g, if 
>> there 
>> can be 3 routers in the network(which control network is down but not the 
>> guest network), and it would cause two of them with the same priority(at 
>> certain time). The doc is mainly for describing the current policy and 
>> reason, as the basic for possible improvement. 
>> 
>> I haven't thought much about redundant router with HA, but many times 
>> we're 
>> dealing with intermittent network issue, so you try to plug off then plug 
>> in the network cable to see if HA works as expect. 
>> 
>> The priority cannot changed on the fly, it's a parameter of keepalived 
>> process, which is running. So at least both router need to be stopped 
>> before priority reset. And it's not reset to minimum, since the value can 
>> go up or down based on the different cases. 
>> 
> 
> Looking at the doc you wrote I see no mention of HAProxy. 
> Is this intentional? 
> 
> A failure of HAProxy (which we've observed in practice) would 
> result in a loss of service for loadbalanced ports. 
> 
> Currently I'm thinking of adding something like the following: 
> 
> diff -Nru keepalived.conf.templ.orig keepalived.conf.templ 
> --- keepalived.conf.templ.orig 2013-09-17 09:02:28.410646521 +0200 
> +++ keepalived.conf.templ 2013-09-17 09:03:34.131434084 +0200 
> @@ -19,6 +19,12 @@ 
> router_id [ROUTER_ID] 
> } 
> 
> +vrrp_script check_haproxy { 
> + script "/usr/bin/killall -0 haproxy" 
> + interval 5 
> + weight 10 
> +} 
> + 
> vrrp_script check_bumpup { 
> script "[RROUTER_BIN_PATH]/check_**bumpup.sh" 
> interval 5 
> @@ -47,6 +53,7 @@ 
> } 
> 
> track_script { 
> + check_haproxy 
> check_bumpup 
> heartbeat 
> } 
> 
> 
> This would boost vrrp priorities if haproxy is running, trigger 
> a failover if it fails, and should be harmless on hosts not running 
> haproxy. 
> 
> -- 
> Sten Spans 
> 
> "There is a crack in everything, that's how the light gets in." 
> Leonard Cohen - Anthem 
> 


Re: HA redundant virtual router

Posted by Sheng Yang <sh...@yasker.org>.
No, it's not intentional.

the HAproxy is a part of services that redundant router would
enable/disabled according to the MASTER/BACKUP status. All the services
related to redundant router are controlled by services.sh.

What's the failure of HAproxy exactly in your case? And what's the root
cause?

Also, I think just yield due to haproxy failure won't help much since
effort still needed for CS to recover the situation, at least it would need
to notify admin. Better transiting to FAULT state if it's a critical error.

--Sheng


On Tue, Sep 17, 2013 at 12:07 AM, Sten Spans <st...@blinkenlights.nl> wrote:

> On Mon, 16 Sep 2013, Sheng Yang wrote:
>
>  The reason for no HA as I said before, due to the complexity. E.g, if
>> there
>> can be 3 routers in the network(which control network is down but not the
>> guest network), and it would cause two of them with the same priority(at
>> certain time). The doc is mainly for describing the current policy and
>> reason, as the basic for possible improvement.
>>
>> I haven't thought much about redundant router with HA, but many times
>> we're
>> dealing with intermittent network issue, so you try to plug off then plug
>> in the network cable to see if HA works as expect.
>>
>> The priority cannot changed on the fly, it's a parameter of keepalived
>> process, which is running. So at least both router need to be stopped
>> before priority reset. And it's not reset to minimum, since the value can
>> go up or down based on the different cases.
>>
>
> Looking at the doc you wrote I see no mention of HAProxy.
> Is this intentional?
>
> A failure of HAProxy (which we've observed in practice) would
> result in a loss of service for loadbalanced ports.
>
> Currently I'm thinking of adding something like the following:
>
>  diff -Nru keepalived.conf.templ.orig keepalived.conf.templ
> --- keepalived.conf.templ.orig  2013-09-17 09:02:28.410646521 +0200
> +++ keepalived.conf.templ       2013-09-17 09:03:34.131434084 +0200
> @@ -19,6 +19,12 @@
>     router_id [ROUTER_ID]
>  }
>
> +vrrp_script check_haproxy {
> +    script "/usr/bin/killall -0 haproxy"
> +    interval 5
> +    weight 10
> +}
> +
>  vrrp_script check_bumpup {
>      script "[RROUTER_BIN_PATH]/check_**bumpup.sh"
>      interval 5
> @@ -47,6 +53,7 @@
>      }
>
>      track_script {
> +        check_haproxy
>          check_bumpup
>          heartbeat
>      }
>
>
> This would boost vrrp priorities if haproxy is running, trigger
> a failover if it fails, and should be harmless on hosts not running
> haproxy.
>
> --
> Sten Spans
>
> "There is a crack in everything, that's how the light gets in."
> Leonard Cohen - Anthem
>

Re: HA redundant virtual router

Posted by Sten Spans <st...@blinkenlights.nl>.
On Mon, 16 Sep 2013, Sheng Yang wrote:

> The reason for no HA as I said before, due to the complexity. E.g, if there
> can be 3 routers in the network(which control network is down but not the
> guest network), and it would cause two of them with the same priority(at
> certain time). The doc is mainly for describing the current policy and
> reason, as the basic for possible improvement.
>
> I haven't thought much about redundant router with HA, but many times we're
> dealing with intermittent network issue, so you try to plug off then plug
> in the network cable to see if HA works as expect.
>
> The priority cannot changed on the fly, it's a parameter of keepalived
> process, which is running. So at least both router need to be stopped
> before priority reset. And it's not reset to minimum, since the value can
> go up or down based on the different cases.

Looking at the doc you wrote I see no mention of HAProxy.
Is this intentional?

A failure of HAProxy (which we've observed in practice) would
result in a loss of service for loadbalanced ports.

Currently I'm thinking of adding something like the following:

  diff -Nru keepalived.conf.templ.orig keepalived.conf.templ
--- keepalived.conf.templ.orig  2013-09-17 09:02:28.410646521 +0200
+++ keepalived.conf.templ       2013-09-17 09:03:34.131434084 +0200
@@ -19,6 +19,12 @@
     router_id [ROUTER_ID]
  }

+vrrp_script check_haproxy {
+    script "/usr/bin/killall -0 haproxy"
+    interval 5
+    weight 10
+}
+
  vrrp_script check_bumpup {
      script "[RROUTER_BIN_PATH]/check_bumpup.sh"
      interval 5
@@ -47,6 +53,7 @@
      }

      track_script {
+        check_haproxy
          check_bumpup
          heartbeat
      }


This would boost vrrp priorities if haproxy is running, trigger
a failover if it fails, and should be harmless on hosts not running 
haproxy.

-- 
Sten Spans

"There is a crack in everything, that's how the light gets in."
Leonard Cohen - Anthem

Re: HA redundant virtual router

Posted by Sheng Yang <sh...@yasker.org>.
The reason for no HA as I said before, due to the complexity. E.g, if there
can be 3 routers in the network(which control network is down but not the
guest network), and it would cause two of them with the same priority(at
certain time). The doc is mainly for describing the current policy and
reason, as the basic for possible improvement.

I haven't thought much about redundant router with HA, but many times we're
dealing with intermittent network issue, so you try to plug off then plug
in the network cable to see if HA works as expect.

The priority cannot changed on the fly, it's a parameter of keepalived
process, which is running. So at least both router need to be stopped
before priority reset. And it's not reset to minimum, since the value can
go up or down based on the different cases.

--Sheng


On Mon, Sep 16, 2013 at 2:25 AM, Daan Hoogland <da...@gmail.com>wrote:

> H Sheng,
>
> From your doc I don't read the reasons for disabling HA for RVR.
>
> I do see some possible glitches though. One is that a newly installed
> router should always have higher prio if the master cannot be reached.
> Something similar should happen on rule programming.
>
> To deal with the 255 limit on prios they could be reset to the minimum
> as soon as both routers are found again.
>
> Am I correct?
>
> thanks,
> Daan
>
> On Fri, Sep 6, 2013 at 12:27 AM, Sheng Yang <sh...@yasker.org> wrote:
> > Here is the doc.
> >
> >
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Redundant+Virtual+Router+Functional+Spec
> >
> > It's not extremely detail, but describe today's design generally.
> >
> > --Sheng
> >
> >
> > On Thu, Aug 29, 2013 at 8:17 AM, Daan Hoogland <da...@gmail.com>
> > wrote:
> >>
> >> ok,
> >>
> >> let's postpone the discussion till you are at least halve done. We
> >> will of course continue to deliberate on what we need internally.
> >>
> >> Daan
> >>
> >> On Thu, Aug 29, 2013 at 5:08 PM, Sheng Yang <sh...@yasker.org> wrote:
> >> > Hi Daan,
> >> >
> >> > As I said, I am writing a design doc to describe the current redundant
> >> > router policy, to help understanding redundant router. Current it
> >> > doesn't
> >> > support VPC, so how to implement it in VPC is still open to discuss.
> >> >
> >> > --Sheng
> >> >
> >> >
> >> > On Thu, Aug 29, 2013 at 4:26 AM, Daan Hoogland <
> daan.hoogland@gmail.com>
> >> > wrote:
> >> >>
> >> >> Sheng,
> >> >>
> >> >> just to make sure; You are going to write this document? I see
> Roeland
> >> >> understood your mail like this.
> >> >>
> >> >> When you do, I'd like you to keep in mind that we also want redundant
> >> >> routers within a VPC to ensure ACS upgrades are more seamless for
> >> >> customer application groups and - dtap streets. If you need any help
> >> >> on writing such a doc, let me know.
> >> >>
> >> >> kind regards,
> >> >> Daan
> >> >>
> >> >> On Thu, Aug 29, 2013 at 1:13 PM, Roeland Kuipers
> >> >> <RK...@schubergphilis.com> wrote:
> >> >> > Hi Sheng,
> >> >> >
> >> >> > Thanks for the info. Looking forward to the design doc, I trust
> this
> >> >> > will make things clearer.
> >> >> > In the meantime will be doing some research and thinking too, to
> see
> >> >> > how
> >> >> > we can improve things to also have HA on the RvR in a safe way.
> >> >> > We will share this once ready.
> >> >> >
> >> >> > Thanks,
> >> >> > Roeland
> >> >> >
> >> >> >
> >> >> > From: Sheng Yang [mailto:sheng@yasker.org]
> >> >> > Sent: donderdag 29 augustus 2013 0:19
> >> >> > To: <de...@cloudstack.apache.org>
> >> >> > Cc: int-cloud; Daan Hoogland
> >> >> > Subject: Re: HA redundant virtual router
> >> >> >
> >> >> > Hi Roeland,
> >> >> >
> >> >> > I would write a design doc to explain how redundant router works
> >> >> > currently. For example, for the point 2, we have to force BACKUP
> >> >> > become
> >> >> > MASTER because:
> >> >> >
> >> >> > 1. CS cannot communicate with MASTER at the time
> >> >> > 2. CS can communicate with BACKUP.
> >> >> > 3. Rule has to be programmed immediately.
> >> >> > 4. In case old MASTER come back, it should yield to the VR with
> >> >> > updated
> >> >> > rule, rather than preempt the updated VR.
> >> >> >
> >> >> > In this case, CS need to communicate with RvR to program the new
> >> >> > rule,
> >> >> > thus it need to intervene the RvR to ensure that if there is only
> one
> >> >> > VR got
> >> >> > the rule, it should become MASTER.
> >> >> >
> >> >> > Still, I would write a doc later to try to cover every concern of
> RvR
> >> >> > design.
> >> >> >
> >> >> > --Sheng
> >> >> >
> >> >> > On Tue, Aug 27, 2013 at 3:40 AM, Roeland Kuipers
> >> >> > <RK...@schubergphilis.com>>
> >> >> > wrote:
> >> >> > Hi Sheng,
> >> >> >
> >> >> > Thanks for your reply. I'll see if we can replay this scenario.
> >> >> >
> >> >> > With respect to point 1: a good principal IMHO.
> >> >> >
> >> >> > Point 2: Why do we force a keepalived node to become master and not
> >> >> > wait
> >> >> > for keepalived to become master? This way there is less reason to
> >> >> > intervene
> >> >> > and less risk of multiple masters? As we have seen this behavior
> with
> >> >> > RvR
> >> >> > without HA in the past. The downside that updates to rules do not
> >> >> > function
> >> >> > until backup becomes master. But maybe this is wise anyways since
> >> >> > there is
> >> >> > something wrong. This conflicts a bit with point 2 as we do
> intervene
> >> >> > here.
> >> >> >
> >> >> > Point 3: In my opinion keepalived is solid enough to leave this
> >> >> > responsibility with keepalived and that CS just should check the
> >> >> > state and
> >> >> > not fiddle with priorities to force masters. Because there is
> >> >> > obviously a
> >> >> > reason why BACKUP refuses to become master.
> >> >> > I think we should let keepalived prevent multiple master as is
> >> >> > designed
> >> >> > to prevent this. Or do I miss something here?
> >> >> > Actually in the scenario you described, with a functioning guest
> >> >> > network, keepalived should be able to handle this situation if we
> >> >> > make sure
> >> >> > all routers have different prios.
> >> >> >
> >> >> > I still have the opinion HA and RvR are different mechanisms.
> >> >> >
> >> >> > So what do you think is necessary to have the possibility of HA icw
> >> >> > RvR?
> >> >> > We have a clear business requirement to have this implement on CS.
> >> >> > And we
> >> >> > have Developers willing to create these changes to make this
> >> >> > possible.
> >> >> > We also like to see RvR on VPC's and are also willing to contribute
> >> >> > this
> >> >> > functionality.
> >> >> >
> >> >> > Thanks for your feedback!
> >> >> >
> >> >> > Cheers,
> >> >> > Roeland
> >> >> >
> >> >> > -----Original Message-----
> >> >> > From: Sheng Yang [mailto:sheng@yasker.org<mailto:sheng@yasker.org
> >]
> >> >> > Sent: vrijdag 23 augustus 2013 23:25
> >> >> > To: <de...@cloudstack.apache.org>>
> >> >> > Subject: Re: HA redundant virtual router
> >> >> >
> >> >> > Hi Roeland,
> >> >> >
> >> >> > Thank you for your testing!
> >> >> >
> >> >> > Power off is not an concern right now, because at that time the VM
> >> >> > would
> >> >> > disappear anyway.
> >> >> >
> >> >> > Our concern is more about if VM is still alive but we cannot detect
> >> >> > it
> >> >> > for a while. For example, a network glitch happened, CS lost
> >> >> > connection to
> >> >> > the host temporarily(control network), but the guest network is
> still
> >> >> > working.
> >> >> > HA would start another VR, which would possible result in 3 routers
> >> >> > in
> >> >> > the guest network(at least for a moment). Many of the policy focus
> on
> >> >> > dealing these intermediate status. Also if you plug off the network
> >> >> > cable of
> >> >> > one host many things should happen...
> >> >> >
> >> >> >
> >> >> > In RvR we want to make sure:
> >> >> > 1. The status are self-governed, no need for CS to intervene.
> >> >> > 2. MASTER would always get the latest rules. That means, if we
> cannot
> >> >> > communicate with MASTER, we would turn to BACKUP and program the
> rule
> >> >> > on it
> >> >> > and make it MASTER - even we cannot communicate with MASTER at this
> >> >> > time.
> >> >> > And BACKUP should able to become MASTER if we request. This is
> >> >> > achieved
> >> >> > by using a script to bump up the priority of BACKUP.
> >> >> > 3. Trying best to prevent the dual-MASTER situation. So we would
> >> >> > program
> >> >> > different priority for VRs and the MASTER/BACKUP status completely
> >> >> > depends
> >> >> > on priority.
> >> >> >
> >> >> > And if you take RvR as an alternative to VM's HA mechanism., it's
> not
> >> >> > that counter intuitive in fact.
> >> >> >
> >> >> > --Sheng
> >> >> >
> >> >> >
> >> >> > On Fri, Aug 23, 2013 at 1:56 AM, Roeland Kuipers <
> >> >> > RKuipers@schubergphilis.com<ma...@schubergphilis.com>>
> >> >> > wrote:
> >> >> >
> >> >> >> Hi Sheng,
> >> >> >>
> >> >> >> So far our testing showed no big problems. I've marked a redundant
> >> >> >> set
> >> >> >> of routers to be ha_enabled by setting ha_enabled bit in the
> >> >> >> vm_instance table. (This is our workaround ATM) We tested HA icw
> RvR
> >> >> >> in the scenarios ,shutdown / force power off VM. In these
> scenarios
> >> >> >> HA
> >> >> >> worked a treat and did restore the redundant pair as it should.
> And
> >> >> >> keepalived nicely negotiated MASTER & BACKUP.
> >> >> >> These are obviously basic tests, but we are happy to do some more
> >> >> >> testing.
> >> >> >>
> >> >> >> I understand your concerns and am totally in favour of the KISS
> >> >> >> principle.
> >> >> >> What could be the scenario to end up with 3 routers?
> >> >> >> Why is the situation complex to deal with? These are separate
> >> >> >> mechanisms.
> >> >> >> HA just making sure the router is up and alive. And keepalived
> >> >> >> negotatiating MASTER-BACUP states according to keepalived
> >> >> >> configuration, unless there a 3 routers with conflicting configs.
> >> >> >> But
> >> >> >> so far I do not understand the scenario where we could end up
> with 3
> >> >> >> routers, so I cannot judge end/or test this.
> >> >> >>
> >> >> >> We like to see the hardcoded denial of HA in a redundant router
> >> >> >> setup
> >> >> >> go for several reasons:
> >> >> >> 1. It's counter intuitive - we configured an HA service offering
> on
> >> >> >> purpose for the RvR's. And found out by accident that it was not
> >> >> >> enabled at all.
> >> >> >> 2. CS could implement a default offering without HA for this setup
> >> >> >> (to
> >> >> >> keep it simple by default and keep currently forced behaviour),
> but
> >> >> >> if
> >> >> >> users, like us, deliberately like to have HA, users can create a
> >> >> >> custom offering with HA enabled
> >> >> >>
> >> >> >> This way it's configurable, doesn't change default behavior and is
> >> >> >> more intuitive.
> >> >> >>
> >> >> >> Thanks & Cheers,
> >> >> >> Roeland
> >> >> >>
> >> >> >>
> >> >> >>
> >> >> >> -----Original Message-----
> >> >> >> From: Sheng Yang [mailto:sheng@yasker.org<mailto:sheng@yasker.org
> >]
> >> >> >> Sent: vrijdag 23 augustus 2013 3:03
> >> >> >> To: <de...@cloudstack.apache.org>>
> >> >> >> Subject: Re: HA redundant virtual router
> >> >> >>
> >> >> >> It's a design choice, the only reason is it would be a very
> complex
> >> >> >> situation to deal with. In fact the redundant router itself's
> policy
> >> >> >> has already been very complex...
> >> >> >>
> >> >> >> We didn't look into details at the time of implementing redundant
> >> >> >> router, but there are lots of concerns e.g. a network glitch may
> >> >> >> result in 3 routers running in the network and potentially two of
> >> >> >> them
> >> >> >> are in MASTER state.
> >> >> >>
> >> >> >> Of course discussion is welcome. We just want to keep it as simple
> >> >> >> as
> >> >> >> possible at the time.
> >> >> >>
> >> >> >> --Sheng
> >> >> >>
> >> >> >>
> >> >> >> On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland <
> >> >> >> DHoogland@schubergphilis.com<ma...@schubergphilis.com>
> >> >> >> > wrote:
> >> >> >>
> >> >> >> > LS,
> >> >> >> >
> >> >> >> > Schuberg Philis guarantees 100% functional uptime for their
> >> >> >> > customers.
> >> >> >> > Infrastructure is of course part of this promise and the easier
> >> >> >> > factor to provide strong levels of resiliency. For this reason
> we
> >> >> >> > want to make use of redundant virtual routers together with HA
> >> >> >> > functionality.
> >> >> >> >
> >> >> >> > We see HA and redundant routers as to different methods to
> provide
> >> >> >> > higher levels of uptime.
> >> >> >> >
> >> >> >> >
> >> >> >> > 1.      The redundant router setup takes care of seamless
> failover
> >> >> >> without
> >> >> >> > lengthy hick-ups in the case of a single router failure.
> >> >> >> >
> >> >> >> > 2.      HA takes care of restarting a failed VM or router.
> >> >> >> > Restoring
> >> >> >> > connectivity in the case of single router or restoring 2n
> >> >> >> > resiliency
> >> >> >> > in the case of a redundant router setup.
> >> >> >> >
> >> >> >> > The combination of these two methods will help us to meet our
> 100%
> >> >> >> > promise; .We need to restore 2N redundancy ASAP in the case of
> >> >> >> > single component failure e.g. a router. With these two methods
> >> >> >> > combined the system is more autonomous and doesn't need human
> >> >> >> > intervention to restore redundancy.
> >> >> >> >
> >> >> >> > In the current situation we need to send a page to an on call
> >> >> >> > engineer to restore redundancy asap, because of the tight SLA's.
> >> >> >> > While if we could use HA icw redundant routers. The on-call guy
> >> >> >> > can
> >> >> >> > enjoy his sleep and will be a more happy guy :) The present code
> >> >> >> > forces the HA offering to off on redundant routers which seems
> >> >> >> > odd.
> >> >> >> >
> >> >> >> > So my question is: Why is it forced to off; Is there a technical
> >> >> >> > restraint or is this a design choice we can discuss and maybe
> >> >> >> > revise?
> >> >> >> >
> >> >> >> > Cheers,
> >> >> >> >
> >> >> >> >
> >> >> >>
> >> >> >
> >> >
> >> >
> >
> >
>

Re: HA redundant virtual router

Posted by Daan Hoogland <da...@gmail.com>.
H Sheng,

>From your doc I don't read the reasons for disabling HA for RVR.

I do see some possible glitches though. One is that a newly installed
router should always have higher prio if the master cannot be reached.
Something similar should happen on rule programming.

To deal with the 255 limit on prios they could be reset to the minimum
as soon as both routers are found again.

Am I correct?

thanks,
Daan

On Fri, Sep 6, 2013 at 12:27 AM, Sheng Yang <sh...@yasker.org> wrote:
> Here is the doc.
>
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/Redundant+Virtual+Router+Functional+Spec
>
> It's not extremely detail, but describe today's design generally.
>
> --Sheng
>
>
> On Thu, Aug 29, 2013 at 8:17 AM, Daan Hoogland <da...@gmail.com>
> wrote:
>>
>> ok,
>>
>> let's postpone the discussion till you are at least halve done. We
>> will of course continue to deliberate on what we need internally.
>>
>> Daan
>>
>> On Thu, Aug 29, 2013 at 5:08 PM, Sheng Yang <sh...@yasker.org> wrote:
>> > Hi Daan,
>> >
>> > As I said, I am writing a design doc to describe the current redundant
>> > router policy, to help understanding redundant router. Current it
>> > doesn't
>> > support VPC, so how to implement it in VPC is still open to discuss.
>> >
>> > --Sheng
>> >
>> >
>> > On Thu, Aug 29, 2013 at 4:26 AM, Daan Hoogland <da...@gmail.com>
>> > wrote:
>> >>
>> >> Sheng,
>> >>
>> >> just to make sure; You are going to write this document? I see Roeland
>> >> understood your mail like this.
>> >>
>> >> When you do, I'd like you to keep in mind that we also want redundant
>> >> routers within a VPC to ensure ACS upgrades are more seamless for
>> >> customer application groups and - dtap streets. If you need any help
>> >> on writing such a doc, let me know.
>> >>
>> >> kind regards,
>> >> Daan
>> >>
>> >> On Thu, Aug 29, 2013 at 1:13 PM, Roeland Kuipers
>> >> <RK...@schubergphilis.com> wrote:
>> >> > Hi Sheng,
>> >> >
>> >> > Thanks for the info. Looking forward to the design doc, I trust this
>> >> > will make things clearer.
>> >> > In the meantime will be doing some research and thinking too, to see
>> >> > how
>> >> > we can improve things to also have HA on the RvR in a safe way.
>> >> > We will share this once ready.
>> >> >
>> >> > Thanks,
>> >> > Roeland
>> >> >
>> >> >
>> >> > From: Sheng Yang [mailto:sheng@yasker.org]
>> >> > Sent: donderdag 29 augustus 2013 0:19
>> >> > To: <de...@cloudstack.apache.org>
>> >> > Cc: int-cloud; Daan Hoogland
>> >> > Subject: Re: HA redundant virtual router
>> >> >
>> >> > Hi Roeland,
>> >> >
>> >> > I would write a design doc to explain how redundant router works
>> >> > currently. For example, for the point 2, we have to force BACKUP
>> >> > become
>> >> > MASTER because:
>> >> >
>> >> > 1. CS cannot communicate with MASTER at the time
>> >> > 2. CS can communicate with BACKUP.
>> >> > 3. Rule has to be programmed immediately.
>> >> > 4. In case old MASTER come back, it should yield to the VR with
>> >> > updated
>> >> > rule, rather than preempt the updated VR.
>> >> >
>> >> > In this case, CS need to communicate with RvR to program the new
>> >> > rule,
>> >> > thus it need to intervene the RvR to ensure that if there is only one
>> >> > VR got
>> >> > the rule, it should become MASTER.
>> >> >
>> >> > Still, I would write a doc later to try to cover every concern of RvR
>> >> > design.
>> >> >
>> >> > --Sheng
>> >> >
>> >> > On Tue, Aug 27, 2013 at 3:40 AM, Roeland Kuipers
>> >> > <RK...@schubergphilis.com>>
>> >> > wrote:
>> >> > Hi Sheng,
>> >> >
>> >> > Thanks for your reply. I'll see if we can replay this scenario.
>> >> >
>> >> > With respect to point 1: a good principal IMHO.
>> >> >
>> >> > Point 2: Why do we force a keepalived node to become master and not
>> >> > wait
>> >> > for keepalived to become master? This way there is less reason to
>> >> > intervene
>> >> > and less risk of multiple masters? As we have seen this behavior with
>> >> > RvR
>> >> > without HA in the past. The downside that updates to rules do not
>> >> > function
>> >> > until backup becomes master. But maybe this is wise anyways since
>> >> > there is
>> >> > something wrong. This conflicts a bit with point 2 as we do intervene
>> >> > here.
>> >> >
>> >> > Point 3: In my opinion keepalived is solid enough to leave this
>> >> > responsibility with keepalived and that CS just should check the
>> >> > state and
>> >> > not fiddle with priorities to force masters. Because there is
>> >> > obviously a
>> >> > reason why BACKUP refuses to become master.
>> >> > I think we should let keepalived prevent multiple master as is
>> >> > designed
>> >> > to prevent this. Or do I miss something here?
>> >> > Actually in the scenario you described, with a functioning guest
>> >> > network, keepalived should be able to handle this situation if we
>> >> > make sure
>> >> > all routers have different prios.
>> >> >
>> >> > I still have the opinion HA and RvR are different mechanisms.
>> >> >
>> >> > So what do you think is necessary to have the possibility of HA icw
>> >> > RvR?
>> >> > We have a clear business requirement to have this implement on CS.
>> >> > And we
>> >> > have Developers willing to create these changes to make this
>> >> > possible.
>> >> > We also like to see RvR on VPC's and are also willing to contribute
>> >> > this
>> >> > functionality.
>> >> >
>> >> > Thanks for your feedback!
>> >> >
>> >> > Cheers,
>> >> > Roeland
>> >> >
>> >> > -----Original Message-----
>> >> > From: Sheng Yang [mailto:sheng@yasker.org<ma...@yasker.org>]
>> >> > Sent: vrijdag 23 augustus 2013 23:25
>> >> > To: <de...@cloudstack.apache.org>>
>> >> > Subject: Re: HA redundant virtual router
>> >> >
>> >> > Hi Roeland,
>> >> >
>> >> > Thank you for your testing!
>> >> >
>> >> > Power off is not an concern right now, because at that time the VM
>> >> > would
>> >> > disappear anyway.
>> >> >
>> >> > Our concern is more about if VM is still alive but we cannot detect
>> >> > it
>> >> > for a while. For example, a network glitch happened, CS lost
>> >> > connection to
>> >> > the host temporarily(control network), but the guest network is still
>> >> > working.
>> >> > HA would start another VR, which would possible result in 3 routers
>> >> > in
>> >> > the guest network(at least for a moment). Many of the policy focus on
>> >> > dealing these intermediate status. Also if you plug off the network
>> >> > cable of
>> >> > one host many things should happen...
>> >> >
>> >> >
>> >> > In RvR we want to make sure:
>> >> > 1. The status are self-governed, no need for CS to intervene.
>> >> > 2. MASTER would always get the latest rules. That means, if we cannot
>> >> > communicate with MASTER, we would turn to BACKUP and program the rule
>> >> > on it
>> >> > and make it MASTER - even we cannot communicate with MASTER at this
>> >> > time.
>> >> > And BACKUP should able to become MASTER if we request. This is
>> >> > achieved
>> >> > by using a script to bump up the priority of BACKUP.
>> >> > 3. Trying best to prevent the dual-MASTER situation. So we would
>> >> > program
>> >> > different priority for VRs and the MASTER/BACKUP status completely
>> >> > depends
>> >> > on priority.
>> >> >
>> >> > And if you take RvR as an alternative to VM's HA mechanism., it's not
>> >> > that counter intuitive in fact.
>> >> >
>> >> > --Sheng
>> >> >
>> >> >
>> >> > On Fri, Aug 23, 2013 at 1:56 AM, Roeland Kuipers <
>> >> > RKuipers@schubergphilis.com<ma...@schubergphilis.com>>
>> >> > wrote:
>> >> >
>> >> >> Hi Sheng,
>> >> >>
>> >> >> So far our testing showed no big problems. I've marked a redundant
>> >> >> set
>> >> >> of routers to be ha_enabled by setting ha_enabled bit in the
>> >> >> vm_instance table. (This is our workaround ATM) We tested HA icw RvR
>> >> >> in the scenarios ,shutdown / force power off VM. In these scenarios
>> >> >> HA
>> >> >> worked a treat and did restore the redundant pair as it should. And
>> >> >> keepalived nicely negotiated MASTER & BACKUP.
>> >> >> These are obviously basic tests, but we are happy to do some more
>> >> >> testing.
>> >> >>
>> >> >> I understand your concerns and am totally in favour of the KISS
>> >> >> principle.
>> >> >> What could be the scenario to end up with 3 routers?
>> >> >> Why is the situation complex to deal with? These are separate
>> >> >> mechanisms.
>> >> >> HA just making sure the router is up and alive. And keepalived
>> >> >> negotatiating MASTER-BACUP states according to keepalived
>> >> >> configuration, unless there a 3 routers with conflicting configs.
>> >> >> But
>> >> >> so far I do not understand the scenario where we could end up with 3
>> >> >> routers, so I cannot judge end/or test this.
>> >> >>
>> >> >> We like to see the hardcoded denial of HA in a redundant router
>> >> >> setup
>> >> >> go for several reasons:
>> >> >> 1. It's counter intuitive - we configured an HA service offering on
>> >> >> purpose for the RvR's. And found out by accident that it was not
>> >> >> enabled at all.
>> >> >> 2. CS could implement a default offering without HA for this setup
>> >> >> (to
>> >> >> keep it simple by default and keep currently forced behaviour), but
>> >> >> if
>> >> >> users, like us, deliberately like to have HA, users can create a
>> >> >> custom offering with HA enabled
>> >> >>
>> >> >> This way it's configurable, doesn't change default behavior and is
>> >> >> more intuitive.
>> >> >>
>> >> >> Thanks & Cheers,
>> >> >> Roeland
>> >> >>
>> >> >>
>> >> >>
>> >> >> -----Original Message-----
>> >> >> From: Sheng Yang [mailto:sheng@yasker.org<ma...@yasker.org>]
>> >> >> Sent: vrijdag 23 augustus 2013 3:03
>> >> >> To: <de...@cloudstack.apache.org>>
>> >> >> Subject: Re: HA redundant virtual router
>> >> >>
>> >> >> It's a design choice, the only reason is it would be a very complex
>> >> >> situation to deal with. In fact the redundant router itself's policy
>> >> >> has already been very complex...
>> >> >>
>> >> >> We didn't look into details at the time of implementing redundant
>> >> >> router, but there are lots of concerns e.g. a network glitch may
>> >> >> result in 3 routers running in the network and potentially two of
>> >> >> them
>> >> >> are in MASTER state.
>> >> >>
>> >> >> Of course discussion is welcome. We just want to keep it as simple
>> >> >> as
>> >> >> possible at the time.
>> >> >>
>> >> >> --Sheng
>> >> >>
>> >> >>
>> >> >> On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland <
>> >> >> DHoogland@schubergphilis.com<ma...@schubergphilis.com>
>> >> >> > wrote:
>> >> >>
>> >> >> > LS,
>> >> >> >
>> >> >> > Schuberg Philis guarantees 100% functional uptime for their
>> >> >> > customers.
>> >> >> > Infrastructure is of course part of this promise and the easier
>> >> >> > factor to provide strong levels of resiliency. For this reason we
>> >> >> > want to make use of redundant virtual routers together with HA
>> >> >> > functionality.
>> >> >> >
>> >> >> > We see HA and redundant routers as to different methods to provide
>> >> >> > higher levels of uptime.
>> >> >> >
>> >> >> >
>> >> >> > 1.      The redundant router setup takes care of seamless failover
>> >> >> without
>> >> >> > lengthy hick-ups in the case of a single router failure.
>> >> >> >
>> >> >> > 2.      HA takes care of restarting a failed VM or router.
>> >> >> > Restoring
>> >> >> > connectivity in the case of single router or restoring 2n
>> >> >> > resiliency
>> >> >> > in the case of a redundant router setup.
>> >> >> >
>> >> >> > The combination of these two methods will help us to meet our 100%
>> >> >> > promise; .We need to restore 2N redundancy ASAP in the case of
>> >> >> > single component failure e.g. a router. With these two methods
>> >> >> > combined the system is more autonomous and doesn't need human
>> >> >> > intervention to restore redundancy.
>> >> >> >
>> >> >> > In the current situation we need to send a page to an on call
>> >> >> > engineer to restore redundancy asap, because of the tight SLA's.
>> >> >> > While if we could use HA icw redundant routers. The on-call guy
>> >> >> > can
>> >> >> > enjoy his sleep and will be a more happy guy :) The present code
>> >> >> > forces the HA offering to off on redundant routers which seems
>> >> >> > odd.
>> >> >> >
>> >> >> > So my question is: Why is it forced to off; Is there a technical
>> >> >> > restraint or is this a design choice we can discuss and maybe
>> >> >> > revise?
>> >> >> >
>> >> >> > Cheers,
>> >> >> >
>> >> >> >
>> >> >>
>> >> >
>> >
>> >
>
>

Re: HA redundant virtual router

Posted by Sheng Yang <sh...@yasker.org>.
Here is the doc.

https://cwiki.apache.org/confluence/display/CLOUDSTACK/Redundant+Virtual+Router+Functional+Spec

It's not extremely detail, but describe today's design generally.

--Sheng


On Thu, Aug 29, 2013 at 8:17 AM, Daan Hoogland <da...@gmail.com>wrote:

> ok,
>
> let's postpone the discussion till you are at least halve done. We
> will of course continue to deliberate on what we need internally.
>
> Daan
>
> On Thu, Aug 29, 2013 at 5:08 PM, Sheng Yang <sh...@yasker.org> wrote:
> > Hi Daan,
> >
> > As I said, I am writing a design doc to describe the current redundant
> > router policy, to help understanding redundant router. Current it doesn't
> > support VPC, so how to implement it in VPC is still open to discuss.
> >
> > --Sheng
> >
> >
> > On Thu, Aug 29, 2013 at 4:26 AM, Daan Hoogland <da...@gmail.com>
> > wrote:
> >>
> >> Sheng,
> >>
> >> just to make sure; You are going to write this document? I see Roeland
> >> understood your mail like this.
> >>
> >> When you do, I'd like you to keep in mind that we also want redundant
> >> routers within a VPC to ensure ACS upgrades are more seamless for
> >> customer application groups and - dtap streets. If you need any help
> >> on writing such a doc, let me know.
> >>
> >> kind regards,
> >> Daan
> >>
> >> On Thu, Aug 29, 2013 at 1:13 PM, Roeland Kuipers
> >> <RK...@schubergphilis.com> wrote:
> >> > Hi Sheng,
> >> >
> >> > Thanks for the info. Looking forward to the design doc, I trust this
> >> > will make things clearer.
> >> > In the meantime will be doing some research and thinking too, to see
> how
> >> > we can improve things to also have HA on the RvR in a safe way.
> >> > We will share this once ready.
> >> >
> >> > Thanks,
> >> > Roeland
> >> >
> >> >
> >> > From: Sheng Yang [mailto:sheng@yasker.org]
> >> > Sent: donderdag 29 augustus 2013 0:19
> >> > To: <de...@cloudstack.apache.org>
> >> > Cc: int-cloud; Daan Hoogland
> >> > Subject: Re: HA redundant virtual router
> >> >
> >> > Hi Roeland,
> >> >
> >> > I would write a design doc to explain how redundant router works
> >> > currently. For example, for the point 2, we have to force BACKUP
> become
> >> > MASTER because:
> >> >
> >> > 1. CS cannot communicate with MASTER at the time
> >> > 2. CS can communicate with BACKUP.
> >> > 3. Rule has to be programmed immediately.
> >> > 4. In case old MASTER come back, it should yield to the VR with
> updated
> >> > rule, rather than preempt the updated VR.
> >> >
> >> > In this case, CS need to communicate with RvR to program the new rule,
> >> > thus it need to intervene the RvR to ensure that if there is only one
> VR got
> >> > the rule, it should become MASTER.
> >> >
> >> > Still, I would write a doc later to try to cover every concern of RvR
> >> > design.
> >> >
> >> > --Sheng
> >> >
> >> > On Tue, Aug 27, 2013 at 3:40 AM, Roeland Kuipers
> >> > <RK...@schubergphilis.com>>
> wrote:
> >> > Hi Sheng,
> >> >
> >> > Thanks for your reply. I'll see if we can replay this scenario.
> >> >
> >> > With respect to point 1: a good principal IMHO.
> >> >
> >> > Point 2: Why do we force a keepalived node to become master and not
> wait
> >> > for keepalived to become master? This way there is less reason to
> intervene
> >> > and less risk of multiple masters? As we have seen this behavior with
> RvR
> >> > without HA in the past. The downside that updates to rules do not
> function
> >> > until backup becomes master. But maybe this is wise anyways since
> there is
> >> > something wrong. This conflicts a bit with point 2 as we do intervene
> here.
> >> >
> >> > Point 3: In my opinion keepalived is solid enough to leave this
> >> > responsibility with keepalived and that CS just should check the
> state and
> >> > not fiddle with priorities to force masters. Because there is
> obviously a
> >> > reason why BACKUP refuses to become master.
> >> > I think we should let keepalived prevent multiple master as is
> designed
> >> > to prevent this. Or do I miss something here?
> >> > Actually in the scenario you described, with a functioning guest
> >> > network, keepalived should be able to handle this situation if we
> make sure
> >> > all routers have different prios.
> >> >
> >> > I still have the opinion HA and RvR are different mechanisms.
> >> >
> >> > So what do you think is necessary to have the possibility of HA icw
> RvR?
> >> > We have a clear business requirement to have this implement on CS.
> And we
> >> > have Developers willing to create these changes to make this possible.
> >> > We also like to see RvR on VPC's and are also willing to contribute
> this
> >> > functionality.
> >> >
> >> > Thanks for your feedback!
> >> >
> >> > Cheers,
> >> > Roeland
> >> >
> >> > -----Original Message-----
> >> > From: Sheng Yang [mailto:sheng@yasker.org<ma...@yasker.org>]
> >> > Sent: vrijdag 23 augustus 2013 23:25
> >> > To: <de...@cloudstack.apache.org>>
> >> > Subject: Re: HA redundant virtual router
> >> >
> >> > Hi Roeland,
> >> >
> >> > Thank you for your testing!
> >> >
> >> > Power off is not an concern right now, because at that time the VM
> would
> >> > disappear anyway.
> >> >
> >> > Our concern is more about if VM is still alive but we cannot detect it
> >> > for a while. For example, a network glitch happened, CS lost
> connection to
> >> > the host temporarily(control network), but the guest network is still
> >> > working.
> >> > HA would start another VR, which would possible result in 3 routers in
> >> > the guest network(at least for a moment). Many of the policy focus on
> >> > dealing these intermediate status. Also if you plug off the network
> cable of
> >> > one host many things should happen...
> >> >
> >> >
> >> > In RvR we want to make sure:
> >> > 1. The status are self-governed, no need for CS to intervene.
> >> > 2. MASTER would always get the latest rules. That means, if we cannot
> >> > communicate with MASTER, we would turn to BACKUP and program the rule
> on it
> >> > and make it MASTER - even we cannot communicate with MASTER at this
> time.
> >> > And BACKUP should able to become MASTER if we request. This is
> achieved
> >> > by using a script to bump up the priority of BACKUP.
> >> > 3. Trying best to prevent the dual-MASTER situation. So we would
> program
> >> > different priority for VRs and the MASTER/BACKUP status completely
> depends
> >> > on priority.
> >> >
> >> > And if you take RvR as an alternative to VM's HA mechanism., it's not
> >> > that counter intuitive in fact.
> >> >
> >> > --Sheng
> >> >
> >> >
> >> > On Fri, Aug 23, 2013 at 1:56 AM, Roeland Kuipers <
> >> > RKuipers@schubergphilis.com<ma...@schubergphilis.com>>
> wrote:
> >> >
> >> >> Hi Sheng,
> >> >>
> >> >> So far our testing showed no big problems. I've marked a redundant
> set
> >> >> of routers to be ha_enabled by setting ha_enabled bit in the
> >> >> vm_instance table. (This is our workaround ATM) We tested HA icw RvR
> >> >> in the scenarios ,shutdown / force power off VM. In these scenarios
> HA
> >> >> worked a treat and did restore the redundant pair as it should. And
> >> >> keepalived nicely negotiated MASTER & BACKUP.
> >> >> These are obviously basic tests, but we are happy to do some more
> >> >> testing.
> >> >>
> >> >> I understand your concerns and am totally in favour of the KISS
> >> >> principle.
> >> >> What could be the scenario to end up with 3 routers?
> >> >> Why is the situation complex to deal with? These are separate
> >> >> mechanisms.
> >> >> HA just making sure the router is up and alive. And keepalived
> >> >> negotatiating MASTER-BACUP states according to keepalived
> >> >> configuration, unless there a 3 routers with conflicting configs. But
> >> >> so far I do not understand the scenario where we could end up with 3
> >> >> routers, so I cannot judge end/or test this.
> >> >>
> >> >> We like to see the hardcoded denial of HA in a redundant router setup
> >> >> go for several reasons:
> >> >> 1. It's counter intuitive - we configured an HA service offering on
> >> >> purpose for the RvR's. And found out by accident that it was not
> >> >> enabled at all.
> >> >> 2. CS could implement a default offering without HA for this setup
> (to
> >> >> keep it simple by default and keep currently forced behaviour), but
> if
> >> >> users, like us, deliberately like to have HA, users can create a
> >> >> custom offering with HA enabled
> >> >>
> >> >> This way it's configurable, doesn't change default behavior and is
> >> >> more intuitive.
> >> >>
> >> >> Thanks & Cheers,
> >> >> Roeland
> >> >>
> >> >>
> >> >>
> >> >> -----Original Message-----
> >> >> From: Sheng Yang [mailto:sheng@yasker.org<ma...@yasker.org>]
> >> >> Sent: vrijdag 23 augustus 2013 3:03
> >> >> To: <de...@cloudstack.apache.org>>
> >> >> Subject: Re: HA redundant virtual router
> >> >>
> >> >> It's a design choice, the only reason is it would be a very complex
> >> >> situation to deal with. In fact the redundant router itself's policy
> >> >> has already been very complex...
> >> >>
> >> >> We didn't look into details at the time of implementing redundant
> >> >> router, but there are lots of concerns e.g. a network glitch may
> >> >> result in 3 routers running in the network and potentially two of
> them
> >> >> are in MASTER state.
> >> >>
> >> >> Of course discussion is welcome. We just want to keep it as simple as
> >> >> possible at the time.
> >> >>
> >> >> --Sheng
> >> >>
> >> >>
> >> >> On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland <
> >> >> DHoogland@schubergphilis.com<ma...@schubergphilis.com>
> >> >> > wrote:
> >> >>
> >> >> > LS,
> >> >> >
> >> >> > Schuberg Philis guarantees 100% functional uptime for their
> >> >> > customers.
> >> >> > Infrastructure is of course part of this promise and the easier
> >> >> > factor to provide strong levels of resiliency. For this reason we
> >> >> > want to make use of redundant virtual routers together with HA
> >> >> > functionality.
> >> >> >
> >> >> > We see HA and redundant routers as to different methods to provide
> >> >> > higher levels of uptime.
> >> >> >
> >> >> >
> >> >> > 1.      The redundant router setup takes care of seamless failover
> >> >> without
> >> >> > lengthy hick-ups in the case of a single router failure.
> >> >> >
> >> >> > 2.      HA takes care of restarting a failed VM or router.
> Restoring
> >> >> > connectivity in the case of single router or restoring 2n
> resiliency
> >> >> > in the case of a redundant router setup.
> >> >> >
> >> >> > The combination of these two methods will help us to meet our 100%
> >> >> > promise; .We need to restore 2N redundancy ASAP in the case of
> >> >> > single component failure e.g. a router. With these two methods
> >> >> > combined the system is more autonomous and doesn't need human
> >> >> > intervention to restore redundancy.
> >> >> >
> >> >> > In the current situation we need to send a page to an on call
> >> >> > engineer to restore redundancy asap, because of the tight SLA's.
> >> >> > While if we could use HA icw redundant routers. The on-call guy can
> >> >> > enjoy his sleep and will be a more happy guy :) The present code
> >> >> > forces the HA offering to off on redundant routers which seems odd.
> >> >> >
> >> >> > So my question is: Why is it forced to off; Is there a technical
> >> >> > restraint or is this a design choice we can discuss and maybe
> revise?
> >> >> >
> >> >> > Cheers,
> >> >> >
> >> >> >
> >> >>
> >> >
> >
> >
>

Re: HA redundant virtual router

Posted by Daan Hoogland <da...@gmail.com>.
ok,

let's postpone the discussion till you are at least halve done. We
will of course continue to deliberate on what we need internally.

Daan

On Thu, Aug 29, 2013 at 5:08 PM, Sheng Yang <sh...@yasker.org> wrote:
> Hi Daan,
>
> As I said, I am writing a design doc to describe the current redundant
> router policy, to help understanding redundant router. Current it doesn't
> support VPC, so how to implement it in VPC is still open to discuss.
>
> --Sheng
>
>
> On Thu, Aug 29, 2013 at 4:26 AM, Daan Hoogland <da...@gmail.com>
> wrote:
>>
>> Sheng,
>>
>> just to make sure; You are going to write this document? I see Roeland
>> understood your mail like this.
>>
>> When you do, I'd like you to keep in mind that we also want redundant
>> routers within a VPC to ensure ACS upgrades are more seamless for
>> customer application groups and - dtap streets. If you need any help
>> on writing such a doc, let me know.
>>
>> kind regards,
>> Daan
>>
>> On Thu, Aug 29, 2013 at 1:13 PM, Roeland Kuipers
>> <RK...@schubergphilis.com> wrote:
>> > Hi Sheng,
>> >
>> > Thanks for the info. Looking forward to the design doc, I trust this
>> > will make things clearer.
>> > In the meantime will be doing some research and thinking too, to see how
>> > we can improve things to also have HA on the RvR in a safe way.
>> > We will share this once ready.
>> >
>> > Thanks,
>> > Roeland
>> >
>> >
>> > From: Sheng Yang [mailto:sheng@yasker.org]
>> > Sent: donderdag 29 augustus 2013 0:19
>> > To: <de...@cloudstack.apache.org>
>> > Cc: int-cloud; Daan Hoogland
>> > Subject: Re: HA redundant virtual router
>> >
>> > Hi Roeland,
>> >
>> > I would write a design doc to explain how redundant router works
>> > currently. For example, for the point 2, we have to force BACKUP become
>> > MASTER because:
>> >
>> > 1. CS cannot communicate with MASTER at the time
>> > 2. CS can communicate with BACKUP.
>> > 3. Rule has to be programmed immediately.
>> > 4. In case old MASTER come back, it should yield to the VR with updated
>> > rule, rather than preempt the updated VR.
>> >
>> > In this case, CS need to communicate with RvR to program the new rule,
>> > thus it need to intervene the RvR to ensure that if there is only one VR got
>> > the rule, it should become MASTER.
>> >
>> > Still, I would write a doc later to try to cover every concern of RvR
>> > design.
>> >
>> > --Sheng
>> >
>> > On Tue, Aug 27, 2013 at 3:40 AM, Roeland Kuipers
>> > <RK...@schubergphilis.com>> wrote:
>> > Hi Sheng,
>> >
>> > Thanks for your reply. I'll see if we can replay this scenario.
>> >
>> > With respect to point 1: a good principal IMHO.
>> >
>> > Point 2: Why do we force a keepalived node to become master and not wait
>> > for keepalived to become master? This way there is less reason to intervene
>> > and less risk of multiple masters? As we have seen this behavior with RvR
>> > without HA in the past. The downside that updates to rules do not function
>> > until backup becomes master. But maybe this is wise anyways since there is
>> > something wrong. This conflicts a bit with point 2 as we do intervene here.
>> >
>> > Point 3: In my opinion keepalived is solid enough to leave this
>> > responsibility with keepalived and that CS just should check the state and
>> > not fiddle with priorities to force masters. Because there is obviously a
>> > reason why BACKUP refuses to become master.
>> > I think we should let keepalived prevent multiple master as is designed
>> > to prevent this. Or do I miss something here?
>> > Actually in the scenario you described, with a functioning guest
>> > network, keepalived should be able to handle this situation if we make sure
>> > all routers have different prios.
>> >
>> > I still have the opinion HA and RvR are different mechanisms.
>> >
>> > So what do you think is necessary to have the possibility of HA icw RvR?
>> > We have a clear business requirement to have this implement on CS. And we
>> > have Developers willing to create these changes to make this possible.
>> > We also like to see RvR on VPC's and are also willing to contribute this
>> > functionality.
>> >
>> > Thanks for your feedback!
>> >
>> > Cheers,
>> > Roeland
>> >
>> > -----Original Message-----
>> > From: Sheng Yang [mailto:sheng@yasker.org<ma...@yasker.org>]
>> > Sent: vrijdag 23 augustus 2013 23:25
>> > To: <de...@cloudstack.apache.org>>
>> > Subject: Re: HA redundant virtual router
>> >
>> > Hi Roeland,
>> >
>> > Thank you for your testing!
>> >
>> > Power off is not an concern right now, because at that time the VM would
>> > disappear anyway.
>> >
>> > Our concern is more about if VM is still alive but we cannot detect it
>> > for a while. For example, a network glitch happened, CS lost connection to
>> > the host temporarily(control network), but the guest network is still
>> > working.
>> > HA would start another VR, which would possible result in 3 routers in
>> > the guest network(at least for a moment). Many of the policy focus on
>> > dealing these intermediate status. Also if you plug off the network cable of
>> > one host many things should happen...
>> >
>> >
>> > In RvR we want to make sure:
>> > 1. The status are self-governed, no need for CS to intervene.
>> > 2. MASTER would always get the latest rules. That means, if we cannot
>> > communicate with MASTER, we would turn to BACKUP and program the rule on it
>> > and make it MASTER - even we cannot communicate with MASTER at this time.
>> > And BACKUP should able to become MASTER if we request. This is achieved
>> > by using a script to bump up the priority of BACKUP.
>> > 3. Trying best to prevent the dual-MASTER situation. So we would program
>> > different priority for VRs and the MASTER/BACKUP status completely depends
>> > on priority.
>> >
>> > And if you take RvR as an alternative to VM's HA mechanism., it's not
>> > that counter intuitive in fact.
>> >
>> > --Sheng
>> >
>> >
>> > On Fri, Aug 23, 2013 at 1:56 AM, Roeland Kuipers <
>> > RKuipers@schubergphilis.com<ma...@schubergphilis.com>> wrote:
>> >
>> >> Hi Sheng,
>> >>
>> >> So far our testing showed no big problems. I've marked a redundant set
>> >> of routers to be ha_enabled by setting ha_enabled bit in the
>> >> vm_instance table. (This is our workaround ATM) We tested HA icw RvR
>> >> in the scenarios ,shutdown / force power off VM. In these scenarios HA
>> >> worked a treat and did restore the redundant pair as it should. And
>> >> keepalived nicely negotiated MASTER & BACKUP.
>> >> These are obviously basic tests, but we are happy to do some more
>> >> testing.
>> >>
>> >> I understand your concerns and am totally in favour of the KISS
>> >> principle.
>> >> What could be the scenario to end up with 3 routers?
>> >> Why is the situation complex to deal with? These are separate
>> >> mechanisms.
>> >> HA just making sure the router is up and alive. And keepalived
>> >> negotatiating MASTER-BACUP states according to keepalived
>> >> configuration, unless there a 3 routers with conflicting configs. But
>> >> so far I do not understand the scenario where we could end up with 3
>> >> routers, so I cannot judge end/or test this.
>> >>
>> >> We like to see the hardcoded denial of HA in a redundant router setup
>> >> go for several reasons:
>> >> 1. It's counter intuitive - we configured an HA service offering on
>> >> purpose for the RvR's. And found out by accident that it was not
>> >> enabled at all.
>> >> 2. CS could implement a default offering without HA for this setup (to
>> >> keep it simple by default and keep currently forced behaviour), but if
>> >> users, like us, deliberately like to have HA, users can create a
>> >> custom offering with HA enabled
>> >>
>> >> This way it's configurable, doesn't change default behavior and is
>> >> more intuitive.
>> >>
>> >> Thanks & Cheers,
>> >> Roeland
>> >>
>> >>
>> >>
>> >> -----Original Message-----
>> >> From: Sheng Yang [mailto:sheng@yasker.org<ma...@yasker.org>]
>> >> Sent: vrijdag 23 augustus 2013 3:03
>> >> To: <de...@cloudstack.apache.org>>
>> >> Subject: Re: HA redundant virtual router
>> >>
>> >> It's a design choice, the only reason is it would be a very complex
>> >> situation to deal with. In fact the redundant router itself's policy
>> >> has already been very complex...
>> >>
>> >> We didn't look into details at the time of implementing redundant
>> >> router, but there are lots of concerns e.g. a network glitch may
>> >> result in 3 routers running in the network and potentially two of them
>> >> are in MASTER state.
>> >>
>> >> Of course discussion is welcome. We just want to keep it as simple as
>> >> possible at the time.
>> >>
>> >> --Sheng
>> >>
>> >>
>> >> On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland <
>> >> DHoogland@schubergphilis.com<ma...@schubergphilis.com>
>> >> > wrote:
>> >>
>> >> > LS,
>> >> >
>> >> > Schuberg Philis guarantees 100% functional uptime for their
>> >> > customers.
>> >> > Infrastructure is of course part of this promise and the easier
>> >> > factor to provide strong levels of resiliency. For this reason we
>> >> > want to make use of redundant virtual routers together with HA
>> >> > functionality.
>> >> >
>> >> > We see HA and redundant routers as to different methods to provide
>> >> > higher levels of uptime.
>> >> >
>> >> >
>> >> > 1.      The redundant router setup takes care of seamless failover
>> >> without
>> >> > lengthy hick-ups in the case of a single router failure.
>> >> >
>> >> > 2.      HA takes care of restarting a failed VM or router. Restoring
>> >> > connectivity in the case of single router or restoring 2n resiliency
>> >> > in the case of a redundant router setup.
>> >> >
>> >> > The combination of these two methods will help us to meet our 100%
>> >> > promise; .We need to restore 2N redundancy ASAP in the case of
>> >> > single component failure e.g. a router. With these two methods
>> >> > combined the system is more autonomous and doesn't need human
>> >> > intervention to restore redundancy.
>> >> >
>> >> > In the current situation we need to send a page to an on call
>> >> > engineer to restore redundancy asap, because of the tight SLA's.
>> >> > While if we could use HA icw redundant routers. The on-call guy can
>> >> > enjoy his sleep and will be a more happy guy :) The present code
>> >> > forces the HA offering to off on redundant routers which seems odd.
>> >> >
>> >> > So my question is: Why is it forced to off; Is there a technical
>> >> > restraint or is this a design choice we can discuss and maybe revise?
>> >> >
>> >> > Cheers,
>> >> >
>> >> >
>> >>
>> >
>
>

Re: HA redundant virtual router

Posted by Sheng Yang <sh...@yasker.org>.
Hi Daan,

As I said, I am writing a design doc to describe the current redundant
router policy, to help understanding redundant router. Current it doesn't
support VPC, so how to implement it in VPC is still open to discuss.

--Sheng


On Thu, Aug 29, 2013 at 4:26 AM, Daan Hoogland <da...@gmail.com>wrote:

> Sheng,
>
> just to make sure; You are going to write this document? I see Roeland
> understood your mail like this.
>
> When you do, I'd like you to keep in mind that we also want redundant
> routers within a VPC to ensure ACS upgrades are more seamless for
> customer application groups and - dtap streets. If you need any help
> on writing such a doc, let me know.
>
> kind regards,
> Daan
>
> On Thu, Aug 29, 2013 at 1:13 PM, Roeland Kuipers
> <RK...@schubergphilis.com> wrote:
> > Hi Sheng,
> >
> > Thanks for the info. Looking forward to the design doc, I trust this
> will make things clearer.
> > In the meantime will be doing some research and thinking too, to see how
> we can improve things to also have HA on the RvR in a safe way.
> > We will share this once ready.
> >
> > Thanks,
> > Roeland
> >
> >
> > From: Sheng Yang [mailto:sheng@yasker.org]
> > Sent: donderdag 29 augustus 2013 0:19
> > To: <de...@cloudstack.apache.org>
> > Cc: int-cloud; Daan Hoogland
> > Subject: Re: HA redundant virtual router
> >
> > Hi Roeland,
> >
> > I would write a design doc to explain how redundant router works
> currently. For example, for the point 2, we have to force BACKUP become
> MASTER because:
> >
> > 1. CS cannot communicate with MASTER at the time
> > 2. CS can communicate with BACKUP.
> > 3. Rule has to be programmed immediately.
> > 4. In case old MASTER come back, it should yield to the VR with updated
> rule, rather than preempt the updated VR.
> >
> > In this case, CS need to communicate with RvR to program the new rule,
> thus it need to intervene the RvR to ensure that if there is only one VR
> got the rule, it should become MASTER.
> >
> > Still, I would write a doc later to try to cover every concern of RvR
> design.
> >
> > --Sheng
> >
> > On Tue, Aug 27, 2013 at 3:40 AM, Roeland Kuipers <
> RKuipers@schubergphilis.com<ma...@schubergphilis.com>> wrote:
> > Hi Sheng,
> >
> > Thanks for your reply. I'll see if we can replay this scenario.
> >
> > With respect to point 1: a good principal IMHO.
> >
> > Point 2: Why do we force a keepalived node to become master and not wait
> for keepalived to become master? This way there is less reason to intervene
> and less risk of multiple masters? As we have seen this behavior with RvR
> without HA in the past. The downside that updates to rules do not function
> until backup becomes master. But maybe this is wise anyways since there is
> something wrong. This conflicts a bit with point 2 as we do intervene here.
> >
> > Point 3: In my opinion keepalived is solid enough to leave this
> responsibility with keepalived and that CS just should check the state and
> not fiddle with priorities to force masters. Because there is obviously a
> reason why BACKUP refuses to become master.
> > I think we should let keepalived prevent multiple master as is designed
> to prevent this. Or do I miss something here?
> > Actually in the scenario you described, with a functioning guest
> network, keepalived should be able to handle this situation if we make sure
> all routers have different prios.
> >
> > I still have the opinion HA and RvR are different mechanisms.
> >
> > So what do you think is necessary to have the possibility of HA icw RvR?
> We have a clear business requirement to have this implement on CS. And we
> have Developers willing to create these changes to make this possible.
> > We also like to see RvR on VPC's and are also willing to contribute this
> functionality.
> >
> > Thanks for your feedback!
> >
> > Cheers,
> > Roeland
> >
> > -----Original Message-----
> > From: Sheng Yang [mailto:sheng@yasker.org<ma...@yasker.org>]
> > Sent: vrijdag 23 augustus 2013 23:25
> > To: <de...@cloudstack.apache.org>>
> > Subject: Re: HA redundant virtual router
> >
> > Hi Roeland,
> >
> > Thank you for your testing!
> >
> > Power off is not an concern right now, because at that time the VM would
> disappear anyway.
> >
> > Our concern is more about if VM is still alive but we cannot detect it
> for a while. For example, a network glitch happened, CS lost connection to
> the host temporarily(control network), but the guest network is still
> working.
> > HA would start another VR, which would possible result in 3 routers in
> the guest network(at least for a moment). Many of the policy focus on
> dealing these intermediate status. Also if you plug off the network cable
> of one host many things should happen...
> >
> >
> > In RvR we want to make sure:
> > 1. The status are self-governed, no need for CS to intervene.
> > 2. MASTER would always get the latest rules. That means, if we cannot
> communicate with MASTER, we would turn to BACKUP and program the rule on it
> and make it MASTER - even we cannot communicate with MASTER at this time.
> > And BACKUP should able to become MASTER if we request. This is achieved
> by using a script to bump up the priority of BACKUP.
> > 3. Trying best to prevent the dual-MASTER situation. So we would program
> different priority for VRs and the MASTER/BACKUP status completely depends
> on priority.
> >
> > And if you take RvR as an alternative to VM's HA mechanism., it's not
> that counter intuitive in fact.
> >
> > --Sheng
> >
> >
> > On Fri, Aug 23, 2013 at 1:56 AM, Roeland Kuipers <
> RKuipers@schubergphilis.com<ma...@schubergphilis.com>> wrote:
> >
> >> Hi Sheng,
> >>
> >> So far our testing showed no big problems. I've marked a redundant set
> >> of routers to be ha_enabled by setting ha_enabled bit in the
> >> vm_instance table. (This is our workaround ATM) We tested HA icw RvR
> >> in the scenarios ,shutdown / force power off VM. In these scenarios HA
> >> worked a treat and did restore the redundant pair as it should. And
> >> keepalived nicely negotiated MASTER & BACKUP.
> >> These are obviously basic tests, but we are happy to do some more
> testing.
> >>
> >> I understand your concerns and am totally in favour of the KISS
> principle.
> >> What could be the scenario to end up with 3 routers?
> >> Why is the situation complex to deal with? These are separate
> mechanisms.
> >> HA just making sure the router is up and alive. And keepalived
> >> negotatiating MASTER-BACUP states according to keepalived
> >> configuration, unless there a 3 routers with conflicting configs. But
> >> so far I do not understand the scenario where we could end up with 3
> >> routers, so I cannot judge end/or test this.
> >>
> >> We like to see the hardcoded denial of HA in a redundant router setup
> >> go for several reasons:
> >> 1. It's counter intuitive - we configured an HA service offering on
> >> purpose for the RvR's. And found out by accident that it was not
> >> enabled at all.
> >> 2. CS could implement a default offering without HA for this setup (to
> >> keep it simple by default and keep currently forced behaviour), but if
> >> users, like us, deliberately like to have HA, users can create a
> >> custom offering with HA enabled
> >>
> >> This way it's configurable, doesn't change default behavior and is
> >> more intuitive.
> >>
> >> Thanks & Cheers,
> >> Roeland
> >>
> >>
> >>
> >> -----Original Message-----
> >> From: Sheng Yang [mailto:sheng@yasker.org<ma...@yasker.org>]
> >> Sent: vrijdag 23 augustus 2013 3:03
> >> To: <de...@cloudstack.apache.org>>
> >> Subject: Re: HA redundant virtual router
> >>
> >> It's a design choice, the only reason is it would be a very complex
> >> situation to deal with. In fact the redundant router itself's policy
> >> has already been very complex...
> >>
> >> We didn't look into details at the time of implementing redundant
> >> router, but there are lots of concerns e.g. a network glitch may
> >> result in 3 routers running in the network and potentially two of them
> >> are in MASTER state.
> >>
> >> Of course discussion is welcome. We just want to keep it as simple as
> >> possible at the time.
> >>
> >> --Sheng
> >>
> >>
> >> On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland <
> >> DHoogland@schubergphilis.com<ma...@schubergphilis.com>
> >> > wrote:
> >>
> >> > LS,
> >> >
> >> > Schuberg Philis guarantees 100% functional uptime for their customers.
> >> > Infrastructure is of course part of this promise and the easier
> >> > factor to provide strong levels of resiliency. For this reason we
> >> > want to make use of redundant virtual routers together with HA
> functionality.
> >> >
> >> > We see HA and redundant routers as to different methods to provide
> >> > higher levels of uptime.
> >> >
> >> >
> >> > 1.      The redundant router setup takes care of seamless failover
> >> without
> >> > lengthy hick-ups in the case of a single router failure.
> >> >
> >> > 2.      HA takes care of restarting a failed VM or router. Restoring
> >> > connectivity in the case of single router or restoring 2n resiliency
> >> > in the case of a redundant router setup.
> >> >
> >> > The combination of these two methods will help us to meet our 100%
> >> > promise; .We need to restore 2N redundancy ASAP in the case of
> >> > single component failure e.g. a router. With these two methods
> >> > combined the system is more autonomous and doesn't need human
> >> > intervention to restore redundancy.
> >> >
> >> > In the current situation we need to send a page to an on call
> >> > engineer to restore redundancy asap, because of the tight SLA's.
> >> > While if we could use HA icw redundant routers. The on-call guy can
> >> > enjoy his sleep and will be a more happy guy :) The present code
> >> > forces the HA offering to off on redundant routers which seems odd.
> >> >
> >> > So my question is: Why is it forced to off; Is there a technical
> >> > restraint or is this a design choice we can discuss and maybe revise?
> >> >
> >> > Cheers,
> >> >
> >> >
> >>
> >
>

Re: HA redundant virtual router

Posted by Daan Hoogland <da...@gmail.com>.
Sheng,

just to make sure; You are going to write this document? I see Roeland
understood your mail like this.

When you do, I'd like you to keep in mind that we also want redundant
routers within a VPC to ensure ACS upgrades are more seamless for
customer application groups and - dtap streets. If you need any help
on writing such a doc, let me know.

kind regards,
Daan

On Thu, Aug 29, 2013 at 1:13 PM, Roeland Kuipers
<RK...@schubergphilis.com> wrote:
> Hi Sheng,
>
> Thanks for the info. Looking forward to the design doc, I trust this will make things clearer.
> In the meantime will be doing some research and thinking too, to see how we can improve things to also have HA on the RvR in a safe way.
> We will share this once ready.
>
> Thanks,
> Roeland
>
>
> From: Sheng Yang [mailto:sheng@yasker.org]
> Sent: donderdag 29 augustus 2013 0:19
> To: <de...@cloudstack.apache.org>
> Cc: int-cloud; Daan Hoogland
> Subject: Re: HA redundant virtual router
>
> Hi Roeland,
>
> I would write a design doc to explain how redundant router works currently. For example, for the point 2, we have to force BACKUP become MASTER because:
>
> 1. CS cannot communicate with MASTER at the time
> 2. CS can communicate with BACKUP.
> 3. Rule has to be programmed immediately.
> 4. In case old MASTER come back, it should yield to the VR with updated rule, rather than preempt the updated VR.
>
> In this case, CS need to communicate with RvR to program the new rule, thus it need to intervene the RvR to ensure that if there is only one VR got the rule, it should become MASTER.
>
> Still, I would write a doc later to try to cover every concern of RvR design.
>
> --Sheng
>
> On Tue, Aug 27, 2013 at 3:40 AM, Roeland Kuipers <RK...@schubergphilis.com>> wrote:
> Hi Sheng,
>
> Thanks for your reply. I'll see if we can replay this scenario.
>
> With respect to point 1: a good principal IMHO.
>
> Point 2: Why do we force a keepalived node to become master and not wait for keepalived to become master? This way there is less reason to intervene and less risk of multiple masters? As we have seen this behavior with RvR without HA in the past. The downside that updates to rules do not function until backup becomes master. But maybe this is wise anyways since there is something wrong. This conflicts a bit with point 2 as we do intervene here.
>
> Point 3: In my opinion keepalived is solid enough to leave this responsibility with keepalived and that CS just should check the state and not fiddle with priorities to force masters. Because there is obviously a reason why BACKUP refuses to become master.
> I think we should let keepalived prevent multiple master as is designed to prevent this. Or do I miss something here?
> Actually in the scenario you described, with a functioning guest network, keepalived should be able to handle this situation if we make sure all routers have different prios.
>
> I still have the opinion HA and RvR are different mechanisms.
>
> So what do you think is necessary to have the possibility of HA icw RvR? We have a clear business requirement to have this implement on CS. And we have Developers willing to create these changes to make this possible.
> We also like to see RvR on VPC's and are also willing to contribute this functionality.
>
> Thanks for your feedback!
>
> Cheers,
> Roeland
>
> -----Original Message-----
> From: Sheng Yang [mailto:sheng@yasker.org<ma...@yasker.org>]
> Sent: vrijdag 23 augustus 2013 23:25
> To: <de...@cloudstack.apache.org>>
> Subject: Re: HA redundant virtual router
>
> Hi Roeland,
>
> Thank you for your testing!
>
> Power off is not an concern right now, because at that time the VM would disappear anyway.
>
> Our concern is more about if VM is still alive but we cannot detect it for a while. For example, a network glitch happened, CS lost connection to the host temporarily(control network), but the guest network is still working.
> HA would start another VR, which would possible result in 3 routers in the guest network(at least for a moment). Many of the policy focus on dealing these intermediate status. Also if you plug off the network cable of one host many things should happen...
>
>
> In RvR we want to make sure:
> 1. The status are self-governed, no need for CS to intervene.
> 2. MASTER would always get the latest rules. That means, if we cannot communicate with MASTER, we would turn to BACKUP and program the rule on it and make it MASTER - even we cannot communicate with MASTER at this time.
> And BACKUP should able to become MASTER if we request. This is achieved by using a script to bump up the priority of BACKUP.
> 3. Trying best to prevent the dual-MASTER situation. So we would program different priority for VRs and the MASTER/BACKUP status completely depends on priority.
>
> And if you take RvR as an alternative to VM's HA mechanism., it's not that counter intuitive in fact.
>
> --Sheng
>
>
> On Fri, Aug 23, 2013 at 1:56 AM, Roeland Kuipers < RKuipers@schubergphilis.com<ma...@schubergphilis.com>> wrote:
>
>> Hi Sheng,
>>
>> So far our testing showed no big problems. I've marked a redundant set
>> of routers to be ha_enabled by setting ha_enabled bit in the
>> vm_instance table. (This is our workaround ATM) We tested HA icw RvR
>> in the scenarios ,shutdown / force power off VM. In these scenarios HA
>> worked a treat and did restore the redundant pair as it should. And
>> keepalived nicely negotiated MASTER & BACKUP.
>> These are obviously basic tests, but we are happy to do some more testing.
>>
>> I understand your concerns and am totally in favour of the KISS principle.
>> What could be the scenario to end up with 3 routers?
>> Why is the situation complex to deal with? These are separate mechanisms.
>> HA just making sure the router is up and alive. And keepalived
>> negotatiating MASTER-BACUP states according to keepalived
>> configuration, unless there a 3 routers with conflicting configs. But
>> so far I do not understand the scenario where we could end up with 3
>> routers, so I cannot judge end/or test this.
>>
>> We like to see the hardcoded denial of HA in a redundant router setup
>> go for several reasons:
>> 1. It's counter intuitive - we configured an HA service offering on
>> purpose for the RvR's. And found out by accident that it was not
>> enabled at all.
>> 2. CS could implement a default offering without HA for this setup (to
>> keep it simple by default and keep currently forced behaviour), but if
>> users, like us, deliberately like to have HA, users can create a
>> custom offering with HA enabled
>>
>> This way it's configurable, doesn't change default behavior and is
>> more intuitive.
>>
>> Thanks & Cheers,
>> Roeland
>>
>>
>>
>> -----Original Message-----
>> From: Sheng Yang [mailto:sheng@yasker.org<ma...@yasker.org>]
>> Sent: vrijdag 23 augustus 2013 3:03
>> To: <de...@cloudstack.apache.org>>
>> Subject: Re: HA redundant virtual router
>>
>> It's a design choice, the only reason is it would be a very complex
>> situation to deal with. In fact the redundant router itself's policy
>> has already been very complex...
>>
>> We didn't look into details at the time of implementing redundant
>> router, but there are lots of concerns e.g. a network glitch may
>> result in 3 routers running in the network and potentially two of them
>> are in MASTER state.
>>
>> Of course discussion is welcome. We just want to keep it as simple as
>> possible at the time.
>>
>> --Sheng
>>
>>
>> On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland <
>> DHoogland@schubergphilis.com<ma...@schubergphilis.com>
>> > wrote:
>>
>> > LS,
>> >
>> > Schuberg Philis guarantees 100% functional uptime for their customers.
>> > Infrastructure is of course part of this promise and the easier
>> > factor to provide strong levels of resiliency. For this reason we
>> > want to make use of redundant virtual routers together with HA functionality.
>> >
>> > We see HA and redundant routers as to different methods to provide
>> > higher levels of uptime.
>> >
>> >
>> > 1.      The redundant router setup takes care of seamless failover
>> without
>> > lengthy hick-ups in the case of a single router failure.
>> >
>> > 2.      HA takes care of restarting a failed VM or router. Restoring
>> > connectivity in the case of single router or restoring 2n resiliency
>> > in the case of a redundant router setup.
>> >
>> > The combination of these two methods will help us to meet our 100%
>> > promise; .We need to restore 2N redundancy ASAP in the case of
>> > single component failure e.g. a router. With these two methods
>> > combined the system is more autonomous and doesn't need human
>> > intervention to restore redundancy.
>> >
>> > In the current situation we need to send a page to an on call
>> > engineer to restore redundancy asap, because of the tight SLA's.
>> > While if we could use HA icw redundant routers. The on-call guy can
>> > enjoy his sleep and will be a more happy guy :) The present code
>> > forces the HA offering to off on redundant routers which seems odd.
>> >
>> > So my question is: Why is it forced to off; Is there a technical
>> > restraint or is this a design choice we can discuss and maybe revise?
>> >
>> > Cheers,
>> >
>> >
>>
>

RE: HA redundant virtual router

Posted by Roeland Kuipers <RK...@schubergphilis.com>.
Hi Sheng,

Thanks for the info. Looking forward to the design doc, I trust this will make things clearer.
In the meantime will be doing some research and thinking too, to see how we can improve things to also have HA on the RvR in a safe way.
We will share this once ready.

Thanks,
Roeland


From: Sheng Yang [mailto:sheng@yasker.org]
Sent: donderdag 29 augustus 2013 0:19
To: <de...@cloudstack.apache.org>
Cc: int-cloud; Daan Hoogland
Subject: Re: HA redundant virtual router

Hi Roeland,

I would write a design doc to explain how redundant router works currently. For example, for the point 2, we have to force BACKUP become MASTER because:

1. CS cannot communicate with MASTER at the time
2. CS can communicate with BACKUP.
3. Rule has to be programmed immediately.
4. In case old MASTER come back, it should yield to the VR with updated rule, rather than preempt the updated VR.

In this case, CS need to communicate with RvR to program the new rule, thus it need to intervene the RvR to ensure that if there is only one VR got the rule, it should become MASTER.

Still, I would write a doc later to try to cover every concern of RvR design.

--Sheng

On Tue, Aug 27, 2013 at 3:40 AM, Roeland Kuipers <RK...@schubergphilis.com>> wrote:
Hi Sheng,

Thanks for your reply. I'll see if we can replay this scenario.

With respect to point 1: a good principal IMHO.

Point 2: Why do we force a keepalived node to become master and not wait for keepalived to become master? This way there is less reason to intervene and less risk of multiple masters? As we have seen this behavior with RvR without HA in the past. The downside that updates to rules do not function until backup becomes master. But maybe this is wise anyways since there is something wrong. This conflicts a bit with point 2 as we do intervene here.

Point 3: In my opinion keepalived is solid enough to leave this responsibility with keepalived and that CS just should check the state and not fiddle with priorities to force masters. Because there is obviously a reason why BACKUP refuses to become master.
I think we should let keepalived prevent multiple master as is designed to prevent this. Or do I miss something here?
Actually in the scenario you described, with a functioning guest network, keepalived should be able to handle this situation if we make sure all routers have different prios.

I still have the opinion HA and RvR are different mechanisms.

So what do you think is necessary to have the possibility of HA icw RvR? We have a clear business requirement to have this implement on CS. And we have Developers willing to create these changes to make this possible.
We also like to see RvR on VPC's and are also willing to contribute this functionality.

Thanks for your feedback!

Cheers,
Roeland

-----Original Message-----
From: Sheng Yang [mailto:sheng@yasker.org<ma...@yasker.org>]
Sent: vrijdag 23 augustus 2013 23:25
To: <de...@cloudstack.apache.org>>
Subject: Re: HA redundant virtual router

Hi Roeland,

Thank you for your testing!

Power off is not an concern right now, because at that time the VM would disappear anyway.

Our concern is more about if VM is still alive but we cannot detect it for a while. For example, a network glitch happened, CS lost connection to the host temporarily(control network), but the guest network is still working.
HA would start another VR, which would possible result in 3 routers in the guest network(at least for a moment). Many of the policy focus on dealing these intermediate status. Also if you plug off the network cable of one host many things should happen...


In RvR we want to make sure:
1. The status are self-governed, no need for CS to intervene.
2. MASTER would always get the latest rules. That means, if we cannot communicate with MASTER, we would turn to BACKUP and program the rule on it and make it MASTER - even we cannot communicate with MASTER at this time.
And BACKUP should able to become MASTER if we request. This is achieved by using a script to bump up the priority of BACKUP.
3. Trying best to prevent the dual-MASTER situation. So we would program different priority for VRs and the MASTER/BACKUP status completely depends on priority.

And if you take RvR as an alternative to VM's HA mechanism., it's not that counter intuitive in fact.

--Sheng


On Fri, Aug 23, 2013 at 1:56 AM, Roeland Kuipers < RKuipers@schubergphilis.com<ma...@schubergphilis.com>> wrote:

> Hi Sheng,
>
> So far our testing showed no big problems. I've marked a redundant set
> of routers to be ha_enabled by setting ha_enabled bit in the
> vm_instance table. (This is our workaround ATM) We tested HA icw RvR
> in the scenarios ,shutdown / force power off VM. In these scenarios HA
> worked a treat and did restore the redundant pair as it should. And
> keepalived nicely negotiated MASTER & BACKUP.
> These are obviously basic tests, but we are happy to do some more testing.
>
> I understand your concerns and am totally in favour of the KISS principle.
> What could be the scenario to end up with 3 routers?
> Why is the situation complex to deal with? These are separate mechanisms.
> HA just making sure the router is up and alive. And keepalived
> negotatiating MASTER-BACUP states according to keepalived
> configuration, unless there a 3 routers with conflicting configs. But
> so far I do not understand the scenario where we could end up with 3
> routers, so I cannot judge end/or test this.
>
> We like to see the hardcoded denial of HA in a redundant router setup
> go for several reasons:
> 1. It's counter intuitive - we configured an HA service offering on
> purpose for the RvR's. And found out by accident that it was not
> enabled at all.
> 2. CS could implement a default offering without HA for this setup (to
> keep it simple by default and keep currently forced behaviour), but if
> users, like us, deliberately like to have HA, users can create a
> custom offering with HA enabled
>
> This way it's configurable, doesn't change default behavior and is
> more intuitive.
>
> Thanks & Cheers,
> Roeland
>
>
>
> -----Original Message-----
> From: Sheng Yang [mailto:sheng@yasker.org<ma...@yasker.org>]
> Sent: vrijdag 23 augustus 2013 3:03
> To: <de...@cloudstack.apache.org>>
> Subject: Re: HA redundant virtual router
>
> It's a design choice, the only reason is it would be a very complex
> situation to deal with. In fact the redundant router itself's policy
> has already been very complex...
>
> We didn't look into details at the time of implementing redundant
> router, but there are lots of concerns e.g. a network glitch may
> result in 3 routers running in the network and potentially two of them
> are in MASTER state.
>
> Of course discussion is welcome. We just want to keep it as simple as
> possible at the time.
>
> --Sheng
>
>
> On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland <
> DHoogland@schubergphilis.com<ma...@schubergphilis.com>
> > wrote:
>
> > LS,
> >
> > Schuberg Philis guarantees 100% functional uptime for their customers.
> > Infrastructure is of course part of this promise and the easier
> > factor to provide strong levels of resiliency. For this reason we
> > want to make use of redundant virtual routers together with HA functionality.
> >
> > We see HA and redundant routers as to different methods to provide
> > higher levels of uptime.
> >
> >
> > 1.      The redundant router setup takes care of seamless failover
> without
> > lengthy hick-ups in the case of a single router failure.
> >
> > 2.      HA takes care of restarting a failed VM or router. Restoring
> > connectivity in the case of single router or restoring 2n resiliency
> > in the case of a redundant router setup.
> >
> > The combination of these two methods will help us to meet our 100%
> > promise; .We need to restore 2N redundancy ASAP in the case of
> > single component failure e.g. a router. With these two methods
> > combined the system is more autonomous and doesn't need human
> > intervention to restore redundancy.
> >
> > In the current situation we need to send a page to an on call
> > engineer to restore redundancy asap, because of the tight SLA's.
> > While if we could use HA icw redundant routers. The on-call guy can
> > enjoy his sleep and will be a more happy guy :) The present code
> > forces the HA offering to off on redundant routers which seems odd.
> >
> > So my question is: Why is it forced to off; Is there a technical
> > restraint or is this a design choice we can discuss and maybe revise?
> >
> > Cheers,
> >
> >
>


Re: HA redundant virtual router

Posted by Sheng Yang <sh...@yasker.org>.
Hi Roeland,

I would write a design doc to explain how redundant router works currently.
For example, for the point 2, we have to force BACKUP become MASTER because:

1. CS cannot communicate with MASTER at the time
2. CS can communicate with BACKUP.
3. Rule has to be programmed immediately.
4. In case old MASTER come back, it should yield to the VR with updated
rule, rather than preempt the updated VR.

In this case, CS need to communicate with RvR to program the new rule, thus
it need to intervene the RvR to ensure that if there is only one VR got the
rule, it should become MASTER.

Still, I would write a doc later to try to cover every concern of RvR
design.

--Sheng


On Tue, Aug 27, 2013 at 3:40 AM, Roeland Kuipers <
RKuipers@schubergphilis.com> wrote:

> Hi Sheng,
>
> Thanks for your reply. I'll see if we can replay this scenario.
>
> With respect to point 1: a good principal IMHO.
>
> Point 2: Why do we force a keepalived node to become master and not wait
> for keepalived to become master? This way there is less reason to intervene
> and less risk of multiple masters? As we have seen this behavior with RvR
> without HA in the past. The downside that updates to rules do not function
> until backup becomes master. But maybe this is wise anyways since there is
> something wrong. This conflicts a bit with point 2 as we do intervene here.
>
> Point 3: In my opinion keepalived is solid enough to leave this
> responsibility with keepalived and that CS just should check the state and
> not fiddle with priorities to force masters. Because there is obviously a
> reason why BACKUP refuses to become master.
> I think we should let keepalived prevent multiple master as is designed to
> prevent this. Or do I miss something here?
> Actually in the scenario you described, with a functioning guest network,
> keepalived should be able to handle this situation if we make sure all
> routers have different prios.
>
> I still have the opinion HA and RvR are different mechanisms.
>
> So what do you think is necessary to have the possibility of HA icw RvR?
> We have a clear business requirement to have this implement on CS. And we
> have Developers willing to create these changes to make this possible.
> We also like to see RvR on VPC's and are also willing to contribute this
> functionality.
>
> Thanks for your feedback!
>
> Cheers,
> Roeland
>
> -----Original Message-----
> From: Sheng Yang [mailto:sheng@yasker.org]
> Sent: vrijdag 23 augustus 2013 23:25
> To: <de...@cloudstack.apache.org>
> Subject: Re: HA redundant virtual router
>
> Hi Roeland,
>
> Thank you for your testing!
>
> Power off is not an concern right now, because at that time the VM would
> disappear anyway.
>
> Our concern is more about if VM is still alive but we cannot detect it for
> a while. For example, a network glitch happened, CS lost connection to the
> host temporarily(control network), but the guest network is still working.
> HA would start another VR, which would possible result in 3 routers in the
> guest network(at least for a moment). Many of the policy focus on dealing
> these intermediate status. Also if you plug off the network cable of one
> host many things should happen...
>
>
> In RvR we want to make sure:
> 1. The status are self-governed, no need for CS to intervene.
> 2. MASTER would always get the latest rules. That means, if we cannot
> communicate with MASTER, we would turn to BACKUP and program the rule on it
> and make it MASTER - even we cannot communicate with MASTER at this time.
> And BACKUP should able to become MASTER if we request. This is achieved by
> using a script to bump up the priority of BACKUP.
> 3. Trying best to prevent the dual-MASTER situation. So we would program
> different priority for VRs and the MASTER/BACKUP status completely depends
> on priority.
>
> And if you take RvR as an alternative to VM's HA mechanism., it's not that
> counter intuitive in fact.
>
> --Sheng
>
>
> On Fri, Aug 23, 2013 at 1:56 AM, Roeland Kuipers <
> RKuipers@schubergphilis.com> wrote:
>
> > Hi Sheng,
> >
> > So far our testing showed no big problems. I've marked a redundant set
> > of routers to be ha_enabled by setting ha_enabled bit in the
> > vm_instance table. (This is our workaround ATM) We tested HA icw RvR
> > in the scenarios ,shutdown / force power off VM. In these scenarios HA
> > worked a treat and did restore the redundant pair as it should. And
> > keepalived nicely negotiated MASTER & BACKUP.
> > These are obviously basic tests, but we are happy to do some more
> testing.
> >
> > I understand your concerns and am totally in favour of the KISS
> principle.
> > What could be the scenario to end up with 3 routers?
> > Why is the situation complex to deal with? These are separate mechanisms.
> > HA just making sure the router is up and alive. And keepalived
> > negotatiating MASTER-BACUP states according to keepalived
> > configuration, unless there a 3 routers with conflicting configs. But
> > so far I do not understand the scenario where we could end up with 3
> > routers, so I cannot judge end/or test this.
> >
> > We like to see the hardcoded denial of HA in a redundant router setup
> > go for several reasons:
> > 1. It's counter intuitive - we configured an HA service offering on
> > purpose for the RvR's. And found out by accident that it was not
> > enabled at all.
> > 2. CS could implement a default offering without HA for this setup (to
> > keep it simple by default and keep currently forced behaviour), but if
> > users, like us, deliberately like to have HA, users can create a
> > custom offering with HA enabled
> >
> > This way it's configurable, doesn't change default behavior and is
> > more intuitive.
> >
> > Thanks & Cheers,
> > Roeland
> >
> >
> >
> > -----Original Message-----
> > From: Sheng Yang [mailto:sheng@yasker.org]
> > Sent: vrijdag 23 augustus 2013 3:03
> > To: <de...@cloudstack.apache.org>
> > Subject: Re: HA redundant virtual router
> >
> > It's a design choice, the only reason is it would be a very complex
> > situation to deal with. In fact the redundant router itself's policy
> > has already been very complex...
> >
> > We didn't look into details at the time of implementing redundant
> > router, but there are lots of concerns e.g. a network glitch may
> > result in 3 routers running in the network and potentially two of them
> > are in MASTER state.
> >
> > Of course discussion is welcome. We just want to keep it as simple as
> > possible at the time.
> >
> > --Sheng
> >
> >
> > On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland <
> > DHoogland@schubergphilis.com
> > > wrote:
> >
> > > LS,
> > >
> > > Schuberg Philis guarantees 100% functional uptime for their customers.
> > > Infrastructure is of course part of this promise and the easier
> > > factor to provide strong levels of resiliency. For this reason we
> > > want to make use of redundant virtual routers together with HA
> functionality.
> > >
> > > We see HA and redundant routers as to different methods to provide
> > > higher levels of uptime.
> > >
> > >
> > > 1.      The redundant router setup takes care of seamless failover
> > without
> > > lengthy hick-ups in the case of a single router failure.
> > >
> > > 2.      HA takes care of restarting a failed VM or router. Restoring
> > > connectivity in the case of single router or restoring 2n resiliency
> > > in the case of a redundant router setup.
> > >
> > > The combination of these two methods will help us to meet our 100%
> > > promise; .We need to restore 2N redundancy ASAP in the case of
> > > single component failure e.g. a router. With these two methods
> > > combined the system is more autonomous and doesn't need human
> > > intervention to restore redundancy.
> > >
> > > In the current situation we need to send a page to an on call
> > > engineer to restore redundancy asap, because of the tight SLA's.
> > > While if we could use HA icw redundant routers. The on-call guy can
> > > enjoy his sleep and will be a more happy guy :) The present code
> > > forces the HA offering to off on redundant routers which seems odd.
> > >
> > > So my question is: Why is it forced to off; Is there a technical
> > > restraint or is this a design choice we can discuss and maybe revise?
> > >
> > > Cheers,
> > >
> > >
> >
>

RE: HA redundant virtual router

Posted by Roeland Kuipers <RK...@schubergphilis.com>.
Hi Sheng,

Thanks for your reply. I'll see if we can replay this scenario.

With respect to point 1: a good principal IMHO.

Point 2: Why do we force a keepalived node to become master and not wait for keepalived to become master? This way there is less reason to intervene and less risk of multiple masters? As we have seen this behavior with RvR without HA in the past. The downside that updates to rules do not function until backup becomes master. But maybe this is wise anyways since there is something wrong. This conflicts a bit with point 2 as we do intervene here.

Point 3: In my opinion keepalived is solid enough to leave this responsibility with keepalived and that CS just should check the state and not fiddle with priorities to force masters. Because there is obviously a reason why BACKUP refuses to become master.
I think we should let keepalived prevent multiple master as is designed to prevent this. Or do I miss something here?
Actually in the scenario you described, with a functioning guest network, keepalived should be able to handle this situation if we make sure all routers have different prios. 

I still have the opinion HA and RvR are different mechanisms.

So what do you think is necessary to have the possibility of HA icw RvR? We have a clear business requirement to have this implement on CS. And we have Developers willing to create these changes to make this possible.
We also like to see RvR on VPC's and are also willing to contribute this functionality.

Thanks for your feedback!

Cheers,
Roeland

-----Original Message-----
From: Sheng Yang [mailto:sheng@yasker.org] 
Sent: vrijdag 23 augustus 2013 23:25
To: <de...@cloudstack.apache.org>
Subject: Re: HA redundant virtual router

Hi Roeland,

Thank you for your testing!

Power off is not an concern right now, because at that time the VM would disappear anyway.

Our concern is more about if VM is still alive but we cannot detect it for a while. For example, a network glitch happened, CS lost connection to the host temporarily(control network), but the guest network is still working.
HA would start another VR, which would possible result in 3 routers in the guest network(at least for a moment). Many of the policy focus on dealing these intermediate status. Also if you plug off the network cable of one host many things should happen...


In RvR we want to make sure:
1. The status are self-governed, no need for CS to intervene.
2. MASTER would always get the latest rules. That means, if we cannot communicate with MASTER, we would turn to BACKUP and program the rule on it and make it MASTER - even we cannot communicate with MASTER at this time.
And BACKUP should able to become MASTER if we request. This is achieved by using a script to bump up the priority of BACKUP.
3. Trying best to prevent the dual-MASTER situation. So we would program different priority for VRs and the MASTER/BACKUP status completely depends on priority.

And if you take RvR as an alternative to VM's HA mechanism., it's not that counter intuitive in fact.

--Sheng


On Fri, Aug 23, 2013 at 1:56 AM, Roeland Kuipers < RKuipers@schubergphilis.com> wrote:

> Hi Sheng,
>
> So far our testing showed no big problems. I've marked a redundant set 
> of routers to be ha_enabled by setting ha_enabled bit in the 
> vm_instance table. (This is our workaround ATM) We tested HA icw RvR 
> in the scenarios ,shutdown / force power off VM. In these scenarios HA 
> worked a treat and did restore the redundant pair as it should. And 
> keepalived nicely negotiated MASTER & BACKUP.
> These are obviously basic tests, but we are happy to do some more testing.
>
> I understand your concerns and am totally in favour of the KISS principle.
> What could be the scenario to end up with 3 routers?
> Why is the situation complex to deal with? These are separate mechanisms.
> HA just making sure the router is up and alive. And keepalived 
> negotatiating MASTER-BACUP states according to keepalived 
> configuration, unless there a 3 routers with conflicting configs. But 
> so far I do not understand the scenario where we could end up with 3 
> routers, so I cannot judge end/or test this.
>
> We like to see the hardcoded denial of HA in a redundant router setup 
> go for several reasons:
> 1. It's counter intuitive - we configured an HA service offering on 
> purpose for the RvR's. And found out by accident that it was not 
> enabled at all.
> 2. CS could implement a default offering without HA for this setup (to 
> keep it simple by default and keep currently forced behaviour), but if 
> users, like us, deliberately like to have HA, users can create a 
> custom offering with HA enabled
>
> This way it's configurable, doesn't change default behavior and is 
> more intuitive.
>
> Thanks & Cheers,
> Roeland
>
>
>
> -----Original Message-----
> From: Sheng Yang [mailto:sheng@yasker.org]
> Sent: vrijdag 23 augustus 2013 3:03
> To: <de...@cloudstack.apache.org>
> Subject: Re: HA redundant virtual router
>
> It's a design choice, the only reason is it would be a very complex 
> situation to deal with. In fact the redundant router itself's policy 
> has already been very complex...
>
> We didn't look into details at the time of implementing redundant 
> router, but there are lots of concerns e.g. a network glitch may 
> result in 3 routers running in the network and potentially two of them 
> are in MASTER state.
>
> Of course discussion is welcome. We just want to keep it as simple as 
> possible at the time.
>
> --Sheng
>
>
> On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland < 
> DHoogland@schubergphilis.com
> > wrote:
>
> > LS,
> >
> > Schuberg Philis guarantees 100% functional uptime for their customers.
> > Infrastructure is of course part of this promise and the easier 
> > factor to provide strong levels of resiliency. For this reason we 
> > want to make use of redundant virtual routers together with HA functionality.
> >
> > We see HA and redundant routers as to different methods to provide 
> > higher levels of uptime.
> >
> >
> > 1.      The redundant router setup takes care of seamless failover
> without
> > lengthy hick-ups in the case of a single router failure.
> >
> > 2.      HA takes care of restarting a failed VM or router. Restoring
> > connectivity in the case of single router or restoring 2n resiliency 
> > in the case of a redundant router setup.
> >
> > The combination of these two methods will help us to meet our 100% 
> > promise; .We need to restore 2N redundancy ASAP in the case of 
> > single component failure e.g. a router. With these two methods 
> > combined the system is more autonomous and doesn't need human 
> > intervention to restore redundancy.
> >
> > In the current situation we need to send a page to an on call 
> > engineer to restore redundancy asap, because of the tight SLA's. 
> > While if we could use HA icw redundant routers. The on-call guy can 
> > enjoy his sleep and will be a more happy guy :) The present code 
> > forces the HA offering to off on redundant routers which seems odd.
> >
> > So my question is: Why is it forced to off; Is there a technical 
> > restraint or is this a design choice we can discuss and maybe revise?
> >
> > Cheers,
> >
> >
>

Re: HA redundant virtual router

Posted by Sheng Yang <sh...@yasker.org>.
Hi Roeland,

Thank you for your testing!

Power off is not an concern right now, because at that time the VM would
disappear anyway.

Our concern is more about if VM is still alive but we cannot detect it for
a while. For example, a network glitch happened, CS lost connection to the
host temporarily(control network), but the guest network is still working.
HA would start another VR, which would possible result in 3 routers in the
guest network(at least for a moment). Many of the policy focus on dealing
these intermediate status. Also if you plug off the network cable of one
host many things should happen...

In RvR we want to make sure:
1. The status are self-governed, no need for CS to intervene.
2. MASTER would always get the latest rules. That means, if we cannot
communicate with MASTER, we would turn to BACKUP and program the rule on it
and make it MASTER - even we cannot communicate with MASTER at this time.
And BACKUP should able to become MASTER if we request. This is achieved by
using a script to bump up the priority of BACKUP.
3. Trying best to prevent the dual-MASTER situation. So we would program
different priority for VRs and the MASTER/BACKUP status completely depends
on priority.

And if you take RvR as an alternative to VM's HA mechanism., it's not that
counter intuitive in fact.

--Sheng


On Fri, Aug 23, 2013 at 1:56 AM, Roeland Kuipers <
RKuipers@schubergphilis.com> wrote:

> Hi Sheng,
>
> So far our testing showed no big problems. I've marked a redundant set of
> routers to be ha_enabled by setting ha_enabled bit in the vm_instance
> table. (This is our workaround ATM)
> We tested HA icw RvR in the scenarios ,shutdown / force power off VM. In
> these scenarios HA worked a treat and did restore the redundant pair as it
> should. And keepalived nicely negotiated MASTER & BACKUP.
> These are obviously basic tests, but we are happy to do some more testing.
>
> I understand your concerns and am totally in favour of the KISS principle.
> What could be the scenario to end up with 3 routers?
> Why is the situation complex to deal with? These are separate mechanisms.
> HA just making sure the router is up and alive. And keepalived
> negotatiating MASTER-BACUP states according to keepalived configuration,
> unless there a 3 routers with conflicting configs. But so far I do not
> understand the scenario where we could end up with 3 routers, so I cannot
> judge end/or test this.
>
> We like to see the hardcoded denial of HA in a redundant router setup go
> for several reasons:
> 1. It's counter intuitive - we configured an HA service offering on
> purpose for the RvR's. And found out by accident that it was not enabled at
> all.
> 2. CS could implement a default offering without HA for this setup (to
> keep it simple by default and keep currently forced behaviour), but if
> users, like us, deliberately like to have HA, users can create a custom
> offering with HA enabled
>
> This way it's configurable, doesn't change default behavior and is more
> intuitive.
>
> Thanks & Cheers,
> Roeland
>
>
>
> -----Original Message-----
> From: Sheng Yang [mailto:sheng@yasker.org]
> Sent: vrijdag 23 augustus 2013 3:03
> To: <de...@cloudstack.apache.org>
> Subject: Re: HA redundant virtual router
>
> It's a design choice, the only reason is it would be a very complex
> situation to deal with. In fact the redundant router itself's policy has
> already been very complex...
>
> We didn't look into details at the time of implementing redundant router,
> but there are lots of concerns e.g. a network glitch may result in 3
> routers running in the network and potentially two of them are in MASTER
> state.
>
> Of course discussion is welcome. We just want to keep it as simple as
> possible at the time.
>
> --Sheng
>
>
> On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland <
> DHoogland@schubergphilis.com
> > wrote:
>
> > LS,
> >
> > Schuberg Philis guarantees 100% functional uptime for their customers.
> > Infrastructure is of course part of this promise and the easier factor
> > to provide strong levels of resiliency. For this reason we want to
> > make use of redundant virtual routers together with HA functionality.
> >
> > We see HA and redundant routers as to different methods to provide
> > higher levels of uptime.
> >
> >
> > 1.      The redundant router setup takes care of seamless failover
> without
> > lengthy hick-ups in the case of a single router failure.
> >
> > 2.      HA takes care of restarting a failed VM or router. Restoring
> > connectivity in the case of single router or restoring 2n resiliency
> > in the case of a redundant router setup.
> >
> > The combination of these two methods will help us to meet our 100%
> > promise; .We need to restore 2N redundancy ASAP in the case of single
> > component failure e.g. a router. With these two methods combined the
> > system is more autonomous and doesn't need human intervention to
> > restore redundancy.
> >
> > In the current situation we need to send a page to an on call engineer
> > to restore redundancy asap, because of the tight SLA's. While if we
> > could use HA icw redundant routers. The on-call guy can enjoy his
> > sleep and will be a more happy guy :) The present code forces the HA
> > offering to off on redundant routers which seems odd.
> >
> > So my question is: Why is it forced to off; Is there a technical
> > restraint or is this a design choice we can discuss and maybe revise?
> >
> > Cheers,
> >
> >
>

RE: HA redundant virtual router

Posted by Roeland Kuipers <RK...@schubergphilis.com>.
Hi Sheng,

So far our testing showed no big problems. I've marked a redundant set of routers to be ha_enabled by setting ha_enabled bit in the vm_instance table. (This is our workaround ATM)
We tested HA icw RvR in the scenarios ,shutdown / force power off VM. In these scenarios HA worked a treat and did restore the redundant pair as it should. And keepalived nicely negotiated MASTER & BACKUP.
These are obviously basic tests, but we are happy to do some more testing.

I understand your concerns and am totally in favour of the KISS principle. What could be the scenario to end up with 3 routers? 
Why is the situation complex to deal with? These are separate mechanisms. HA just making sure the router is up and alive. And keepalived negotatiating MASTER-BACUP states according to keepalived configuration, unless there a 3 routers with conflicting configs. But so far I do not understand the scenario where we could end up with 3 routers, so I cannot judge end/or test this.

We like to see the hardcoded denial of HA in a redundant router setup go for several reasons:
1. It's counter intuitive - we configured an HA service offering on purpose for the RvR's. And found out by accident that it was not enabled at all. 
2. CS could implement a default offering without HA for this setup (to keep it simple by default and keep currently forced behaviour), but if users, like us, deliberately like to have HA, users can create a custom offering with HA enabled

This way it's configurable, doesn't change default behavior and is more intuitive.

Thanks & Cheers,
Roeland



-----Original Message-----
From: Sheng Yang [mailto:sheng@yasker.org] 
Sent: vrijdag 23 augustus 2013 3:03
To: <de...@cloudstack.apache.org>
Subject: Re: HA redundant virtual router

It's a design choice, the only reason is it would be a very complex situation to deal with. In fact the redundant router itself's policy has already been very complex...

We didn't look into details at the time of implementing redundant router, but there are lots of concerns e.g. a network glitch may result in 3 routers running in the network and potentially two of them are in MASTER state.

Of course discussion is welcome. We just want to keep it as simple as possible at the time.

--Sheng


On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland <DHoogland@schubergphilis.com
> wrote:

> LS,
>
> Schuberg Philis guarantees 100% functional uptime for their customers.
> Infrastructure is of course part of this promise and the easier factor 
> to provide strong levels of resiliency. For this reason we want to 
> make use of redundant virtual routers together with HA functionality.
>
> We see HA and redundant routers as to different methods to provide 
> higher levels of uptime.
>
>
> 1.      The redundant router setup takes care of seamless failover without
> lengthy hick-ups in the case of a single router failure.
>
> 2.      HA takes care of restarting a failed VM or router. Restoring
> connectivity in the case of single router or restoring 2n resiliency 
> in the case of a redundant router setup.
>
> The combination of these two methods will help us to meet our 100% 
> promise; .We need to restore 2N redundancy ASAP in the case of single 
> component failure e.g. a router. With these two methods combined the 
> system is more autonomous and doesn't need human intervention to 
> restore redundancy.
>
> In the current situation we need to send a page to an on call engineer 
> to restore redundancy asap, because of the tight SLA's. While if we 
> could use HA icw redundant routers. The on-call guy can enjoy his 
> sleep and will be a more happy guy :) The present code forces the HA 
> offering to off on redundant routers which seems odd.
>
> So my question is: Why is it forced to off; Is there a technical 
> restraint or is this a design choice we can discuss and maybe revise?
>
> Cheers,
>
>

Re: HA redundant virtual router

Posted by Sheng Yang <sh...@yasker.org>.
It's a design choice, the only reason is it would be a very complex
situation to deal with. In fact the redundant router itself's policy has
already been very complex...

We didn't look into details at the time of implementing redundant router,
but there are lots of concerns e.g. a network glitch may result in 3
routers running in the network and potentially two of them are in MASTER
state.

Of course discussion is welcome. We just want to keep it as simple as
possible at the time.

--Sheng


On Thu, Aug 22, 2013 at 3:31 AM, Daan Hoogland <DHoogland@schubergphilis.com
> wrote:

> LS,
>
> Schuberg Philis guarantees 100% functional uptime for their customers.
> Infrastructure is of course part of this promise and the easier factor to
> provide strong levels of resiliency. For this reason we want to make use of
> redundant virtual routers together with HA functionality.
>
> We see HA and redundant routers as to different methods to provide higher
> levels of uptime.
>
>
> 1.      The redundant router setup takes care of seamless failover without
> lengthy hick-ups in the case of a single router failure.
>
> 2.      HA takes care of restarting a failed VM or router. Restoring
> connectivity in the case of single router or restoring 2n resiliency in the
> case of a redundant router setup.
>
> The combination of these two methods will help us to meet our 100%
> promise; .We need to restore 2N redundancy ASAP in the case of single
> component failure e.g. a router. With these two methods combined the system
> is more autonomous and doesn't need human intervention to restore
> redundancy.
>
> In the current situation we need to send a page to an on call engineer to
> restore redundancy asap, because of the tight SLA's. While if we could use
> HA icw redundant routers. The on-call guy can enjoy his sleep and will be a
> more happy guy :)
> The present code forces the HA offering to off on redundant routers which
> seems odd.
>
> So my question is: Why is it forced to off; Is there a technical restraint
> or is this a design choice we can discuss and maybe revise?
>
> Cheers,
>
>