You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by Logan Barfield <lb...@tqhosting.com> on 2015/01/06 20:45:33 UTC

Multi-Datacenter Deployment

We are currently running a single location CloudStack deployment:
- 1 Hardware firewall
- 1 Mangement/Database Server
- 1 NFS staging store (for S3 secondary storage)
- Ceph RBD for primary storage
- 4 Hypervisors
- 1 Zone/Pod/Cluster

We are looking to expand our deployment to other datacenters, and I'm
trying to determine the best way to go about it.  The documentation is a
bit lacking for multi-site deployments.

Our goal for the multi-site deployment is to have a zone for each site
(E.G. US East, US West, Europe) that our customers can use to deploy
instances in their preferred geographic area.

Since we don't want to have different accounts for every datacenter, I
don't think using Regions makes sense for us (and I'm not sure what they're
actually good for without keeping accounts/users/domains in sync).

Right now I'm thinking our setup will be as follows:
- Firewall, Management Server, NFS staging server, primary storage, and
Hypervisors in each datacenter.
- All Management servers will be on the same management network.
- Management servers will be connected via site-to-site VPN links over WAN.
- MySQL replication (Percona?) will be set up on the management servers.
Having an odd number of servers to protect against split brain, and keeping
redundant database backups.
- One region (default)
- One zone for each datacenter
- Geo-enabled DNS to direct customers to the nearest Management server
- Object storage for secondary storage across cloud.

My primary concerns with this setup are:
- I haven't really seen multi-site deployments details anywhere.
- Potential for split-brain.
- How will HA be handled (e.g., if a VPN link goes down and one of the
remote management servers can't contact a host, will it try to initiate
HA?) - This sort of goes along with the split brain problem.

Are my assumptions here sound, or is there a standard recommended way of
doing multi-site deployments?

Any suggestions are much appreciated.

RE: Multi-Datacenter Deployment

Posted by Paul Angus <pa...@shapeblue.com>.
Hi Logan,

The biggest issue to deal with in a multi-datacenter deployment is the database, specifically latency between database nodes and between the CloudStack management servers and the database.  Hosts in remote zones quite happily talk back to remote management servers over WAN links.

Currently CloudStack employs record locking which is not multi-master writing friendly. So you can only really write to one node. (this is something that is being looked at).

As the latency between the management servers and the hosts is far less important than that relating to DB traffic the standard configuration is to have a master zone which contains your primary management server(s) and a MySQL master and slave pair. You then have a second zone (chosen largely by the latency on the data link) which contains secondary management server(s) and MySQL slave to the primary zone MySQL master (and probably a slave to that slave - I'll call it a secondary slave).  Then you would you global server load balancing (GSLB) to switch from the primary zone to the secondary zone in the event of a primary zone failure.  The slave in the secondary zone becomes the master and the secondary slave becomes 'the' slave.

All other zones use the primary zone mgmt. infrastructure unless GSLB directs them to the secondary zone.


Regards,

Paul Angus
Cloud Architect
S: +44 20 3603 0540 | M: +447711418784 | T: @CloudyAngus
paul.angus@shapeblue.com

-----Original Message-----
From: Logan Barfield [mailto:lbarfield@tqhosting.com]
Sent: 06 January 2015 19:46
To: dev@cloudstack.apache.org; users@cloudstack.apache.org
Subject: Multi-Datacenter Deployment

We are currently running a single location CloudStack deployment:
- 1 Hardware firewall
- 1 Mangement/Database Server
- 1 NFS staging store (for S3 secondary storage)
- Ceph RBD for primary storage
- 4 Hypervisors
- 1 Zone/Pod/Cluster

We are looking to expand our deployment to other datacenters, and I'm trying to determine the best way to go about it.  The documentation is a bit lacking for multi-site deployments.

Our goal for the multi-site deployment is to have a zone for each site (E.G. US East, US West, Europe) that our customers can use to deploy instances in their preferred geographic area.

Since we don't want to have different accounts for every datacenter, I don't think using Regions makes sense for us (and I'm not sure what they're actually good for without keeping accounts/users/domains in sync).

Right now I'm thinking our setup will be as follows:
- Firewall, Management Server, NFS staging server, primary storage, and Hypervisors in each datacenter.
- All Management servers will be on the same management network.
- Management servers will be connected via site-to-site VPN links over WAN.
- MySQL replication (Percona?) will be set up on the management servers.
Having an odd number of servers to protect against split brain, and keeping redundant database backups.
- One region (default)
- One zone for each datacenter
- Geo-enabled DNS to direct customers to the nearest Management server
- Object storage for secondary storage across cloud.

My primary concerns with this setup are:
- I haven't really seen multi-site deployments details anywhere.
- Potential for split-brain.
- How will HA be handled (e.g., if a VPN link goes down and one of the remote management servers can't contact a host, will it try to initiate
HA?) - This sort of goes along with the split brain problem.

Are my assumptions here sound, or is there a standard recommended way of doing multi-site deployments?

Any suggestions are much appreciated.
Find out more about ShapeBlue and our range of CloudStack related services

IaaS Cloud Design & Build<http://shapeblue.com/iaas-cloud-design-and-build//>
CSForge – rapid IaaS deployment framework<http://shapeblue.com/csforge/>
CloudStack Consulting<http://shapeblue.com/cloudstack-consultancy/>
CloudStack Software Engineering<http://shapeblue.com/cloudstack-software-engineering/>
CloudStack Infrastructure Support<http://shapeblue.com/cloudstack-infrastructure-support/>
CloudStack Bootcamp Training Courses<http://shapeblue.com/cloudstack-training/>

This email and any attachments to it may be confidential and are intended solely for the use of the individual to whom it is addressed. Any views or opinions expressed are solely those of the author and do not necessarily represent those of Shape Blue Ltd or related companies. If you are not the intended recipient of this email, you must neither take any action based upon its contents, nor copy or show it to anyone. Please contact the sender if you believe you have received this email in error. Shape Blue Ltd is a company incorporated in England & Wales. ShapeBlue Services India LLP is a company incorporated in India and is operated under license from Shape Blue Ltd. Shape Blue Brasil Consultoria Ltda is a company incorporated in Brasil and is operated under license from Shape Blue Ltd. ShapeBlue SA Pty Ltd is a company registered by The Republic of South Africa and is traded under license from Shape Blue Ltd. ShapeBlue is a registered trademark.

Re: Multi-Datacenter Deployment

Posted by Logan Barfield <lb...@tqhosting.com>.
Thought so, just trying to look at all the potential options right now.
We'll probably end up going the single region route, but the idea of a
whole zone rebooting because of misbehaving HA worries me.  We're going
from a traditional single node, non-HA setup.  We're used to having a
single node go down (hence our interest in HA), but having a whole
datacenter's worth of customer VMs rebooting at once would be a nightmare.


Thank You,

Logan Barfield
Tranquil Hosting

On Wed, Jan 7, 2015 at 3:57 PM, Simon Weller <sw...@ena.com> wrote:

> Regions are designed to be completely separate from one other, so no, as
> far as I'm aware there is no way to sync secondary storage data between
> them. I don't think you'd want to do that anyway, as it defeats the purpose
> of maintaining an isolated cloud region from another.
>
> - Si
>
>
> ________________________________________
> From: Logan Barfield <lb...@tqhosting.com>
> Sent: Wednesday, January 07, 2015 2:00 PM
> To: dev@cloudstack.apache.org
> Cc: users@cloudstack.apache.org
> Subject: Re: Multi-Datacenter Deployment
>
> A followup here:  You can't have secondary storage that spans regions (e.g,
> templates/snapshots in sync), even with S3/Swift correct?  If not that's
> another downside to regions on top of the account sync.
>
> It seems like the best solution to prevent weird split-brain/HA issues
> would be to have at least 3 databases set up as master/master/master with
> quorum.  That way if two sites lose contact and re-establish there's a 2/1
> majority saying the hosts are all reachable.  Would hopefully prevent the
> ones that lost contact from kicking off HA immediately.  I don't even know
> how feasible that would be; maybe with Galera?
>
> Even then it would have to be on a table level since there would be a
> conflict, for instance:
> - Given sites 1, 2, and 3, where site 1 loses contact with site 2 and comes
> back up
> - Site 1: Thinks site 1 is up and site 2 is down
> - Site 2: Thinks site 2 is up and 1 is down.
> - Site 3: Thinks all sites are up.
>
> In the above case the least harmful thing would be to push site 3 to the
> other two, but since all three sites have different data it may just hang
> instead.
>
> This is going to drive me nuts. :D
>
>
> Thank You,
>
> Logan Barfield
> Tranquil Hosting
>
> On Wed, Jan 7, 2015 at 12:57 PM, Simon Weller <sw...@ena.com> wrote:
>
> > See inline.
> > ________________________________________
> > From: Logan Barfield <lb...@tqhosting.com>
> > Sent: Wednesday, January 07, 2015 11:43 AM
> > To: dev@cloudstack.apache.org
> > Cc: users@cloudstack.apache.org
> > Subject: Re: Multi-Datacenter Deployment
> >
> > I appreciate the explanation.  That seems to confirm what I was thinking,
> > that until regions are working 100% we'll just have to make sure the
> > DC-to-DC links are as stable/redundant as possible to prevent HA issues.
> > If we increase the HA delay it shouldn't be a major issue, and it will
> > still be better than nothing.
> >
> > For us is probably also makes sense to not worry about having management
> > servers in each DC for now.  If we have a big enough outage in our
> primary
> > DC to affect access to the management server we probably have bigger
> > problems to worry about.
> >
> > > Yeah, I agree. Even with Mgmt down, it's not going to stop any existing
> > services from running or functioning as long as the clusters are healthy.
> >
> > - Si
> >
> > Much appreciated!
> >
> >
> > Thank You,
> >
> > Logan Barfield
> > Tranquil Hosting
> >
> > On Wed, Jan 7, 2015 at 12:15 PM, Simon Weller <sw...@ena.com> wrote:
> >
> > > Logan,
> > >
> > > We currently run CS in multiple geographically separate DCs, and may be
> > > able to give you a little insight into things.
> > >
> > > We run KVM in advanced networking mode, with CLVM clusters backed onto
> > > Dell Compellent SANs. We currently have different DCs running different
> > > zones per DC, in a single region. We've been running CS in production
> now
> > > since 4.0 prior to regions, so that functionality (along with its
> > > limitations) hasn't been something we've adopted yet. We run our
> > Management
> > > (With Multiple clustered nodes) out of 1 DC, and have a backup set of
> > > Management Nodes in another DC should we need to invoke BCDR in the
> event
> > > the primary Management nodes became unavailable.
> > >
> > > Your concerns regarding HA problems are founded. We run our own
> > nationwide
> > > MPLS backbone, and therefore have multiple high capacity bandwidth
> paths
> > > between our different DCs, and even with that capacity and fault
> tolerant
> > > design, we've seen issues where Management has attempted to invoke HA
> due
> > > to brief loss of connectivity (typically due to maintenance or grooming
> > > activity), and this can be quite problematic. VPN tunnels are going to
> be
> > > very challenging for you, and you really need to look at VPLS or some
> > other
> > > technology that can layer on top of a resilient infrastructure with
> > > multiple paths and fast failover (e.g. MPLS Fast Reroute).
> > >
> > > Ideally, regions should solve this with dedicated local management
> nodes,
> > > but until the syncing is sorted out, and those newer releases are
> stable,
> > > there isn't much option short of using a single region right now, short
> > of
> > > setting up a completely separate CS instances per DC.
> > >
> > > Hope this helps a little.
> > >
> > > - Si
> > >
> > > ________________________________________
> > > From: Logan Barfield <lb...@tqhosting.com>
> > > Sent: Tuesday, January 06, 2015 1:45 PM
> > > To: dev@cloudstack.apache.org; users@cloudstack.apache.org
> > > Subject: Multi-Datacenter Deployment
> > >
> > > We are currently running a single location CloudStack deployment:
> > > - 1 Hardware firewall
> > > - 1 Mangement/Database Server
> > > - 1 NFS staging store (for S3 secondary storage)
> > > - Ceph RBD for primary storage
> > > - 4 Hypervisors
> > > - 1 Zone/Pod/Cluster
> > >
> > > We are looking to expand our deployment to other datacenters, and I'm
> > > trying to determine the best way to go about it.  The documentation is
> a
> > > bit lacking for multi-site deployments.
> > >
> > > Our goal for the multi-site deployment is to have a zone for each site
> > > (E.G. US East, US West, Europe) that our customers can use to deploy
> > > instances in their preferred geographic area.
> > >
> > > Since we don't want to have different accounts for every datacenter, I
> > > don't think using Regions makes sense for us (and I'm not sure what
> > they're
> > > actually good for without keeping accounts/users/domains in sync).
> > >
> > > Right now I'm thinking our setup will be as follows:
> > > - Firewall, Management Server, NFS staging server, primary storage, and
> > > Hypervisors in each datacenter.
> > > - All Management servers will be on the same management network.
> > > - Management servers will be connected via site-to-site VPN links over
> > WAN.
> > > - MySQL replication (Percona?) will be set up on the management
> servers.
> > > Having an odd number of servers to protect against split brain, and
> > keeping
> > > redundant database backups.
> > > - One region (default)
> > > - One zone for each datacenter
> > > - Geo-enabled DNS to direct customers to the nearest Management server
> > > - Object storage for secondary storage across cloud.
> > >
> > > My primary concerns with this setup are:
> > > - I haven't really seen multi-site deployments details anywhere.
> > > - Potential for split-brain.
> > > - How will HA be handled (e.g., if a VPN link goes down and one of the
> > > remote management servers can't contact a host, will it try to initiate
> > > HA?) - This sort of goes along with the split brain problem.
> > >
> > > Are my assumptions here sound, or is there a standard recommended way
> of
> > > doing multi-site deployments?
> > >
> > > Any suggestions are much appreciated.
> > >
> >
>

Re: Multi-Datacenter Deployment

Posted by Logan Barfield <lb...@tqhosting.com>.
Thought so, just trying to look at all the potential options right now.
We'll probably end up going the single region route, but the idea of a
whole zone rebooting because of misbehaving HA worries me.  We're going
from a traditional single node, non-HA setup.  We're used to having a
single node go down (hence our interest in HA), but having a whole
datacenter's worth of customer VMs rebooting at once would be a nightmare.


Thank You,

Logan Barfield
Tranquil Hosting

On Wed, Jan 7, 2015 at 3:57 PM, Simon Weller <sw...@ena.com> wrote:

> Regions are designed to be completely separate from one other, so no, as
> far as I'm aware there is no way to sync secondary storage data between
> them. I don't think you'd want to do that anyway, as it defeats the purpose
> of maintaining an isolated cloud region from another.
>
> - Si
>
>
> ________________________________________
> From: Logan Barfield <lb...@tqhosting.com>
> Sent: Wednesday, January 07, 2015 2:00 PM
> To: dev@cloudstack.apache.org
> Cc: users@cloudstack.apache.org
> Subject: Re: Multi-Datacenter Deployment
>
> A followup here:  You can't have secondary storage that spans regions (e.g,
> templates/snapshots in sync), even with S3/Swift correct?  If not that's
> another downside to regions on top of the account sync.
>
> It seems like the best solution to prevent weird split-brain/HA issues
> would be to have at least 3 databases set up as master/master/master with
> quorum.  That way if two sites lose contact and re-establish there's a 2/1
> majority saying the hosts are all reachable.  Would hopefully prevent the
> ones that lost contact from kicking off HA immediately.  I don't even know
> how feasible that would be; maybe with Galera?
>
> Even then it would have to be on a table level since there would be a
> conflict, for instance:
> - Given sites 1, 2, and 3, where site 1 loses contact with site 2 and comes
> back up
> - Site 1: Thinks site 1 is up and site 2 is down
> - Site 2: Thinks site 2 is up and 1 is down.
> - Site 3: Thinks all sites are up.
>
> In the above case the least harmful thing would be to push site 3 to the
> other two, but since all three sites have different data it may just hang
> instead.
>
> This is going to drive me nuts. :D
>
>
> Thank You,
>
> Logan Barfield
> Tranquil Hosting
>
> On Wed, Jan 7, 2015 at 12:57 PM, Simon Weller <sw...@ena.com> wrote:
>
> > See inline.
> > ________________________________________
> > From: Logan Barfield <lb...@tqhosting.com>
> > Sent: Wednesday, January 07, 2015 11:43 AM
> > To: dev@cloudstack.apache.org
> > Cc: users@cloudstack.apache.org
> > Subject: Re: Multi-Datacenter Deployment
> >
> > I appreciate the explanation.  That seems to confirm what I was thinking,
> > that until regions are working 100% we'll just have to make sure the
> > DC-to-DC links are as stable/redundant as possible to prevent HA issues.
> > If we increase the HA delay it shouldn't be a major issue, and it will
> > still be better than nothing.
> >
> > For us is probably also makes sense to not worry about having management
> > servers in each DC for now.  If we have a big enough outage in our
> primary
> > DC to affect access to the management server we probably have bigger
> > problems to worry about.
> >
> > > Yeah, I agree. Even with Mgmt down, it's not going to stop any existing
> > services from running or functioning as long as the clusters are healthy.
> >
> > - Si
> >
> > Much appreciated!
> >
> >
> > Thank You,
> >
> > Logan Barfield
> > Tranquil Hosting
> >
> > On Wed, Jan 7, 2015 at 12:15 PM, Simon Weller <sw...@ena.com> wrote:
> >
> > > Logan,
> > >
> > > We currently run CS in multiple geographically separate DCs, and may be
> > > able to give you a little insight into things.
> > >
> > > We run KVM in advanced networking mode, with CLVM clusters backed onto
> > > Dell Compellent SANs. We currently have different DCs running different
> > > zones per DC, in a single region. We've been running CS in production
> now
> > > since 4.0 prior to regions, so that functionality (along with its
> > > limitations) hasn't been something we've adopted yet. We run our
> > Management
> > > (With Multiple clustered nodes) out of 1 DC, and have a backup set of
> > > Management Nodes in another DC should we need to invoke BCDR in the
> event
> > > the primary Management nodes became unavailable.
> > >
> > > Your concerns regarding HA problems are founded. We run our own
> > nationwide
> > > MPLS backbone, and therefore have multiple high capacity bandwidth
> paths
> > > between our different DCs, and even with that capacity and fault
> tolerant
> > > design, we've seen issues where Management has attempted to invoke HA
> due
> > > to brief loss of connectivity (typically due to maintenance or grooming
> > > activity), and this can be quite problematic. VPN tunnels are going to
> be
> > > very challenging for you, and you really need to look at VPLS or some
> > other
> > > technology that can layer on top of a resilient infrastructure with
> > > multiple paths and fast failover (e.g. MPLS Fast Reroute).
> > >
> > > Ideally, regions should solve this with dedicated local management
> nodes,
> > > but until the syncing is sorted out, and those newer releases are
> stable,
> > > there isn't much option short of using a single region right now, short
> > of
> > > setting up a completely separate CS instances per DC.
> > >
> > > Hope this helps a little.
> > >
> > > - Si
> > >
> > > ________________________________________
> > > From: Logan Barfield <lb...@tqhosting.com>
> > > Sent: Tuesday, January 06, 2015 1:45 PM
> > > To: dev@cloudstack.apache.org; users@cloudstack.apache.org
> > > Subject: Multi-Datacenter Deployment
> > >
> > > We are currently running a single location CloudStack deployment:
> > > - 1 Hardware firewall
> > > - 1 Mangement/Database Server
> > > - 1 NFS staging store (for S3 secondary storage)
> > > - Ceph RBD for primary storage
> > > - 4 Hypervisors
> > > - 1 Zone/Pod/Cluster
> > >
> > > We are looking to expand our deployment to other datacenters, and I'm
> > > trying to determine the best way to go about it.  The documentation is
> a
> > > bit lacking for multi-site deployments.
> > >
> > > Our goal for the multi-site deployment is to have a zone for each site
> > > (E.G. US East, US West, Europe) that our customers can use to deploy
> > > instances in their preferred geographic area.
> > >
> > > Since we don't want to have different accounts for every datacenter, I
> > > don't think using Regions makes sense for us (and I'm not sure what
> > they're
> > > actually good for without keeping accounts/users/domains in sync).
> > >
> > > Right now I'm thinking our setup will be as follows:
> > > - Firewall, Management Server, NFS staging server, primary storage, and
> > > Hypervisors in each datacenter.
> > > - All Management servers will be on the same management network.
> > > - Management servers will be connected via site-to-site VPN links over
> > WAN.
> > > - MySQL replication (Percona?) will be set up on the management
> servers.
> > > Having an odd number of servers to protect against split brain, and
> > keeping
> > > redundant database backups.
> > > - One region (default)
> > > - One zone for each datacenter
> > > - Geo-enabled DNS to direct customers to the nearest Management server
> > > - Object storage for secondary storage across cloud.
> > >
> > > My primary concerns with this setup are:
> > > - I haven't really seen multi-site deployments details anywhere.
> > > - Potential for split-brain.
> > > - How will HA be handled (e.g., if a VPN link goes down and one of the
> > > remote management servers can't contact a host, will it try to initiate
> > > HA?) - This sort of goes along with the split brain problem.
> > >
> > > Are my assumptions here sound, or is there a standard recommended way
> of
> > > doing multi-site deployments?
> > >
> > > Any suggestions are much appreciated.
> > >
> >
>

Re: Multi-Datacenter Deployment

Posted by Simon Weller <sw...@ena.com>.
Regions are designed to be completely separate from one other, so no, as far as I'm aware there is no way to sync secondary storage data between them. I don't think you'd want to do that anyway, as it defeats the purpose of maintaining an isolated cloud region from another.

- Si


________________________________________
From: Logan Barfield <lb...@tqhosting.com>
Sent: Wednesday, January 07, 2015 2:00 PM
To: dev@cloudstack.apache.org
Cc: users@cloudstack.apache.org
Subject: Re: Multi-Datacenter Deployment

A followup here:  You can't have secondary storage that spans regions (e.g,
templates/snapshots in sync), even with S3/Swift correct?  If not that's
another downside to regions on top of the account sync.

It seems like the best solution to prevent weird split-brain/HA issues
would be to have at least 3 databases set up as master/master/master with
quorum.  That way if two sites lose contact and re-establish there's a 2/1
majority saying the hosts are all reachable.  Would hopefully prevent the
ones that lost contact from kicking off HA immediately.  I don't even know
how feasible that would be; maybe with Galera?

Even then it would have to be on a table level since there would be a
conflict, for instance:
- Given sites 1, 2, and 3, where site 1 loses contact with site 2 and comes
back up
- Site 1: Thinks site 1 is up and site 2 is down
- Site 2: Thinks site 2 is up and 1 is down.
- Site 3: Thinks all sites are up.

In the above case the least harmful thing would be to push site 3 to the
other two, but since all three sites have different data it may just hang
instead.

This is going to drive me nuts. :D


Thank You,

Logan Barfield
Tranquil Hosting

On Wed, Jan 7, 2015 at 12:57 PM, Simon Weller <sw...@ena.com> wrote:

> See inline.
> ________________________________________
> From: Logan Barfield <lb...@tqhosting.com>
> Sent: Wednesday, January 07, 2015 11:43 AM
> To: dev@cloudstack.apache.org
> Cc: users@cloudstack.apache.org
> Subject: Re: Multi-Datacenter Deployment
>
> I appreciate the explanation.  That seems to confirm what I was thinking,
> that until regions are working 100% we'll just have to make sure the
> DC-to-DC links are as stable/redundant as possible to prevent HA issues.
> If we increase the HA delay it shouldn't be a major issue, and it will
> still be better than nothing.
>
> For us is probably also makes sense to not worry about having management
> servers in each DC for now.  If we have a big enough outage in our primary
> DC to affect access to the management server we probably have bigger
> problems to worry about.
>
> > Yeah, I agree. Even with Mgmt down, it's not going to stop any existing
> services from running or functioning as long as the clusters are healthy.
>
> - Si
>
> Much appreciated!
>
>
> Thank You,
>
> Logan Barfield
> Tranquil Hosting
>
> On Wed, Jan 7, 2015 at 12:15 PM, Simon Weller <sw...@ena.com> wrote:
>
> > Logan,
> >
> > We currently run CS in multiple geographically separate DCs, and may be
> > able to give you a little insight into things.
> >
> > We run KVM in advanced networking mode, with CLVM clusters backed onto
> > Dell Compellent SANs. We currently have different DCs running different
> > zones per DC, in a single region. We've been running CS in production now
> > since 4.0 prior to regions, so that functionality (along with its
> > limitations) hasn't been something we've adopted yet. We run our
> Management
> > (With Multiple clustered nodes) out of 1 DC, and have a backup set of
> > Management Nodes in another DC should we need to invoke BCDR in the event
> > the primary Management nodes became unavailable.
> >
> > Your concerns regarding HA problems are founded. We run our own
> nationwide
> > MPLS backbone, and therefore have multiple high capacity bandwidth paths
> > between our different DCs, and even with that capacity and fault tolerant
> > design, we've seen issues where Management has attempted to invoke HA due
> > to brief loss of connectivity (typically due to maintenance or grooming
> > activity), and this can be quite problematic. VPN tunnels are going to be
> > very challenging for you, and you really need to look at VPLS or some
> other
> > technology that can layer on top of a resilient infrastructure with
> > multiple paths and fast failover (e.g. MPLS Fast Reroute).
> >
> > Ideally, regions should solve this with dedicated local management nodes,
> > but until the syncing is sorted out, and those newer releases are stable,
> > there isn't much option short of using a single region right now, short
> of
> > setting up a completely separate CS instances per DC.
> >
> > Hope this helps a little.
> >
> > - Si
> >
> > ________________________________________
> > From: Logan Barfield <lb...@tqhosting.com>
> > Sent: Tuesday, January 06, 2015 1:45 PM
> > To: dev@cloudstack.apache.org; users@cloudstack.apache.org
> > Subject: Multi-Datacenter Deployment
> >
> > We are currently running a single location CloudStack deployment:
> > - 1 Hardware firewall
> > - 1 Mangement/Database Server
> > - 1 NFS staging store (for S3 secondary storage)
> > - Ceph RBD for primary storage
> > - 4 Hypervisors
> > - 1 Zone/Pod/Cluster
> >
> > We are looking to expand our deployment to other datacenters, and I'm
> > trying to determine the best way to go about it.  The documentation is a
> > bit lacking for multi-site deployments.
> >
> > Our goal for the multi-site deployment is to have a zone for each site
> > (E.G. US East, US West, Europe) that our customers can use to deploy
> > instances in their preferred geographic area.
> >
> > Since we don't want to have different accounts for every datacenter, I
> > don't think using Regions makes sense for us (and I'm not sure what
> they're
> > actually good for without keeping accounts/users/domains in sync).
> >
> > Right now I'm thinking our setup will be as follows:
> > - Firewall, Management Server, NFS staging server, primary storage, and
> > Hypervisors in each datacenter.
> > - All Management servers will be on the same management network.
> > - Management servers will be connected via site-to-site VPN links over
> WAN.
> > - MySQL replication (Percona?) will be set up on the management servers.
> > Having an odd number of servers to protect against split brain, and
> keeping
> > redundant database backups.
> > - One region (default)
> > - One zone for each datacenter
> > - Geo-enabled DNS to direct customers to the nearest Management server
> > - Object storage for secondary storage across cloud.
> >
> > My primary concerns with this setup are:
> > - I haven't really seen multi-site deployments details anywhere.
> > - Potential for split-brain.
> > - How will HA be handled (e.g., if a VPN link goes down and one of the
> > remote management servers can't contact a host, will it try to initiate
> > HA?) - This sort of goes along with the split brain problem.
> >
> > Are my assumptions here sound, or is there a standard recommended way of
> > doing multi-site deployments?
> >
> > Any suggestions are much appreciated.
> >
>

Re: Multi-Datacenter Deployment

Posted by Simon Weller <sw...@ena.com>.
Regions are designed to be completely separate from one other, so no, as far as I'm aware there is no way to sync secondary storage data between them. I don't think you'd want to do that anyway, as it defeats the purpose of maintaining an isolated cloud region from another.

- Si


________________________________________
From: Logan Barfield <lb...@tqhosting.com>
Sent: Wednesday, January 07, 2015 2:00 PM
To: dev@cloudstack.apache.org
Cc: users@cloudstack.apache.org
Subject: Re: Multi-Datacenter Deployment

A followup here:  You can't have secondary storage that spans regions (e.g,
templates/snapshots in sync), even with S3/Swift correct?  If not that's
another downside to regions on top of the account sync.

It seems like the best solution to prevent weird split-brain/HA issues
would be to have at least 3 databases set up as master/master/master with
quorum.  That way if two sites lose contact and re-establish there's a 2/1
majority saying the hosts are all reachable.  Would hopefully prevent the
ones that lost contact from kicking off HA immediately.  I don't even know
how feasible that would be; maybe with Galera?

Even then it would have to be on a table level since there would be a
conflict, for instance:
- Given sites 1, 2, and 3, where site 1 loses contact with site 2 and comes
back up
- Site 1: Thinks site 1 is up and site 2 is down
- Site 2: Thinks site 2 is up and 1 is down.
- Site 3: Thinks all sites are up.

In the above case the least harmful thing would be to push site 3 to the
other two, but since all three sites have different data it may just hang
instead.

This is going to drive me nuts. :D


Thank You,

Logan Barfield
Tranquil Hosting

On Wed, Jan 7, 2015 at 12:57 PM, Simon Weller <sw...@ena.com> wrote:

> See inline.
> ________________________________________
> From: Logan Barfield <lb...@tqhosting.com>
> Sent: Wednesday, January 07, 2015 11:43 AM
> To: dev@cloudstack.apache.org
> Cc: users@cloudstack.apache.org
> Subject: Re: Multi-Datacenter Deployment
>
> I appreciate the explanation.  That seems to confirm what I was thinking,
> that until regions are working 100% we'll just have to make sure the
> DC-to-DC links are as stable/redundant as possible to prevent HA issues.
> If we increase the HA delay it shouldn't be a major issue, and it will
> still be better than nothing.
>
> For us is probably also makes sense to not worry about having management
> servers in each DC for now.  If we have a big enough outage in our primary
> DC to affect access to the management server we probably have bigger
> problems to worry about.
>
> > Yeah, I agree. Even with Mgmt down, it's not going to stop any existing
> services from running or functioning as long as the clusters are healthy.
>
> - Si
>
> Much appreciated!
>
>
> Thank You,
>
> Logan Barfield
> Tranquil Hosting
>
> On Wed, Jan 7, 2015 at 12:15 PM, Simon Weller <sw...@ena.com> wrote:
>
> > Logan,
> >
> > We currently run CS in multiple geographically separate DCs, and may be
> > able to give you a little insight into things.
> >
> > We run KVM in advanced networking mode, with CLVM clusters backed onto
> > Dell Compellent SANs. We currently have different DCs running different
> > zones per DC, in a single region. We've been running CS in production now
> > since 4.0 prior to regions, so that functionality (along with its
> > limitations) hasn't been something we've adopted yet. We run our
> Management
> > (With Multiple clustered nodes) out of 1 DC, and have a backup set of
> > Management Nodes in another DC should we need to invoke BCDR in the event
> > the primary Management nodes became unavailable.
> >
> > Your concerns regarding HA problems are founded. We run our own
> nationwide
> > MPLS backbone, and therefore have multiple high capacity bandwidth paths
> > between our different DCs, and even with that capacity and fault tolerant
> > design, we've seen issues where Management has attempted to invoke HA due
> > to brief loss of connectivity (typically due to maintenance or grooming
> > activity), and this can be quite problematic. VPN tunnels are going to be
> > very challenging for you, and you really need to look at VPLS or some
> other
> > technology that can layer on top of a resilient infrastructure with
> > multiple paths and fast failover (e.g. MPLS Fast Reroute).
> >
> > Ideally, regions should solve this with dedicated local management nodes,
> > but until the syncing is sorted out, and those newer releases are stable,
> > there isn't much option short of using a single region right now, short
> of
> > setting up a completely separate CS instances per DC.
> >
> > Hope this helps a little.
> >
> > - Si
> >
> > ________________________________________
> > From: Logan Barfield <lb...@tqhosting.com>
> > Sent: Tuesday, January 06, 2015 1:45 PM
> > To: dev@cloudstack.apache.org; users@cloudstack.apache.org
> > Subject: Multi-Datacenter Deployment
> >
> > We are currently running a single location CloudStack deployment:
> > - 1 Hardware firewall
> > - 1 Mangement/Database Server
> > - 1 NFS staging store (for S3 secondary storage)
> > - Ceph RBD for primary storage
> > - 4 Hypervisors
> > - 1 Zone/Pod/Cluster
> >
> > We are looking to expand our deployment to other datacenters, and I'm
> > trying to determine the best way to go about it.  The documentation is a
> > bit lacking for multi-site deployments.
> >
> > Our goal for the multi-site deployment is to have a zone for each site
> > (E.G. US East, US West, Europe) that our customers can use to deploy
> > instances in their preferred geographic area.
> >
> > Since we don't want to have different accounts for every datacenter, I
> > don't think using Regions makes sense for us (and I'm not sure what
> they're
> > actually good for without keeping accounts/users/domains in sync).
> >
> > Right now I'm thinking our setup will be as follows:
> > - Firewall, Management Server, NFS staging server, primary storage, and
> > Hypervisors in each datacenter.
> > - All Management servers will be on the same management network.
> > - Management servers will be connected via site-to-site VPN links over
> WAN.
> > - MySQL replication (Percona?) will be set up on the management servers.
> > Having an odd number of servers to protect against split brain, and
> keeping
> > redundant database backups.
> > - One region (default)
> > - One zone for each datacenter
> > - Geo-enabled DNS to direct customers to the nearest Management server
> > - Object storage for secondary storage across cloud.
> >
> > My primary concerns with this setup are:
> > - I haven't really seen multi-site deployments details anywhere.
> > - Potential for split-brain.
> > - How will HA be handled (e.g., if a VPN link goes down and one of the
> > remote management servers can't contact a host, will it try to initiate
> > HA?) - This sort of goes along with the split brain problem.
> >
> > Are my assumptions here sound, or is there a standard recommended way of
> > doing multi-site deployments?
> >
> > Any suggestions are much appreciated.
> >
>

Re: Multi-Datacenter Deployment

Posted by Logan Barfield <lb...@tqhosting.com>.
A followup here:  You can't have secondary storage that spans regions (e.g,
templates/snapshots in sync), even with S3/Swift correct?  If not that's
another downside to regions on top of the account sync.

It seems like the best solution to prevent weird split-brain/HA issues
would be to have at least 3 databases set up as master/master/master with
quorum.  That way if two sites lose contact and re-establish there's a 2/1
majority saying the hosts are all reachable.  Would hopefully prevent the
ones that lost contact from kicking off HA immediately.  I don't even know
how feasible that would be; maybe with Galera?

Even then it would have to be on a table level since there would be a
conflict, for instance:
- Given sites 1, 2, and 3, where site 1 loses contact with site 2 and comes
back up
- Site 1: Thinks site 1 is up and site 2 is down
- Site 2: Thinks site 2 is up and 1 is down.
- Site 3: Thinks all sites are up.

In the above case the least harmful thing would be to push site 3 to the
other two, but since all three sites have different data it may just hang
instead.

This is going to drive me nuts. :D


Thank You,

Logan Barfield
Tranquil Hosting

On Wed, Jan 7, 2015 at 12:57 PM, Simon Weller <sw...@ena.com> wrote:

> See inline.
> ________________________________________
> From: Logan Barfield <lb...@tqhosting.com>
> Sent: Wednesday, January 07, 2015 11:43 AM
> To: dev@cloudstack.apache.org
> Cc: users@cloudstack.apache.org
> Subject: Re: Multi-Datacenter Deployment
>
> I appreciate the explanation.  That seems to confirm what I was thinking,
> that until regions are working 100% we'll just have to make sure the
> DC-to-DC links are as stable/redundant as possible to prevent HA issues.
> If we increase the HA delay it shouldn't be a major issue, and it will
> still be better than nothing.
>
> For us is probably also makes sense to not worry about having management
> servers in each DC for now.  If we have a big enough outage in our primary
> DC to affect access to the management server we probably have bigger
> problems to worry about.
>
> > Yeah, I agree. Even with Mgmt down, it's not going to stop any existing
> services from running or functioning as long as the clusters are healthy.
>
> - Si
>
> Much appreciated!
>
>
> Thank You,
>
> Logan Barfield
> Tranquil Hosting
>
> On Wed, Jan 7, 2015 at 12:15 PM, Simon Weller <sw...@ena.com> wrote:
>
> > Logan,
> >
> > We currently run CS in multiple geographically separate DCs, and may be
> > able to give you a little insight into things.
> >
> > We run KVM in advanced networking mode, with CLVM clusters backed onto
> > Dell Compellent SANs. We currently have different DCs running different
> > zones per DC, in a single region. We've been running CS in production now
> > since 4.0 prior to regions, so that functionality (along with its
> > limitations) hasn't been something we've adopted yet. We run our
> Management
> > (With Multiple clustered nodes) out of 1 DC, and have a backup set of
> > Management Nodes in another DC should we need to invoke BCDR in the event
> > the primary Management nodes became unavailable.
> >
> > Your concerns regarding HA problems are founded. We run our own
> nationwide
> > MPLS backbone, and therefore have multiple high capacity bandwidth paths
> > between our different DCs, and even with that capacity and fault tolerant
> > design, we've seen issues where Management has attempted to invoke HA due
> > to brief loss of connectivity (typically due to maintenance or grooming
> > activity), and this can be quite problematic. VPN tunnels are going to be
> > very challenging for you, and you really need to look at VPLS or some
> other
> > technology that can layer on top of a resilient infrastructure with
> > multiple paths and fast failover (e.g. MPLS Fast Reroute).
> >
> > Ideally, regions should solve this with dedicated local management nodes,
> > but until the syncing is sorted out, and those newer releases are stable,
> > there isn't much option short of using a single region right now, short
> of
> > setting up a completely separate CS instances per DC.
> >
> > Hope this helps a little.
> >
> > - Si
> >
> > ________________________________________
> > From: Logan Barfield <lb...@tqhosting.com>
> > Sent: Tuesday, January 06, 2015 1:45 PM
> > To: dev@cloudstack.apache.org; users@cloudstack.apache.org
> > Subject: Multi-Datacenter Deployment
> >
> > We are currently running a single location CloudStack deployment:
> > - 1 Hardware firewall
> > - 1 Mangement/Database Server
> > - 1 NFS staging store (for S3 secondary storage)
> > - Ceph RBD for primary storage
> > - 4 Hypervisors
> > - 1 Zone/Pod/Cluster
> >
> > We are looking to expand our deployment to other datacenters, and I'm
> > trying to determine the best way to go about it.  The documentation is a
> > bit lacking for multi-site deployments.
> >
> > Our goal for the multi-site deployment is to have a zone for each site
> > (E.G. US East, US West, Europe) that our customers can use to deploy
> > instances in their preferred geographic area.
> >
> > Since we don't want to have different accounts for every datacenter, I
> > don't think using Regions makes sense for us (and I'm not sure what
> they're
> > actually good for without keeping accounts/users/domains in sync).
> >
> > Right now I'm thinking our setup will be as follows:
> > - Firewall, Management Server, NFS staging server, primary storage, and
> > Hypervisors in each datacenter.
> > - All Management servers will be on the same management network.
> > - Management servers will be connected via site-to-site VPN links over
> WAN.
> > - MySQL replication (Percona?) will be set up on the management servers.
> > Having an odd number of servers to protect against split brain, and
> keeping
> > redundant database backups.
> > - One region (default)
> > - One zone for each datacenter
> > - Geo-enabled DNS to direct customers to the nearest Management server
> > - Object storage for secondary storage across cloud.
> >
> > My primary concerns with this setup are:
> > - I haven't really seen multi-site deployments details anywhere.
> > - Potential for split-brain.
> > - How will HA be handled (e.g., if a VPN link goes down and one of the
> > remote management servers can't contact a host, will it try to initiate
> > HA?) - This sort of goes along with the split brain problem.
> >
> > Are my assumptions here sound, or is there a standard recommended way of
> > doing multi-site deployments?
> >
> > Any suggestions are much appreciated.
> >
>

Re: Multi-Datacenter Deployment

Posted by Logan Barfield <lb...@tqhosting.com>.
A followup here:  You can't have secondary storage that spans regions (e.g,
templates/snapshots in sync), even with S3/Swift correct?  If not that's
another downside to regions on top of the account sync.

It seems like the best solution to prevent weird split-brain/HA issues
would be to have at least 3 databases set up as master/master/master with
quorum.  That way if two sites lose contact and re-establish there's a 2/1
majority saying the hosts are all reachable.  Would hopefully prevent the
ones that lost contact from kicking off HA immediately.  I don't even know
how feasible that would be; maybe with Galera?

Even then it would have to be on a table level since there would be a
conflict, for instance:
- Given sites 1, 2, and 3, where site 1 loses contact with site 2 and comes
back up
- Site 1: Thinks site 1 is up and site 2 is down
- Site 2: Thinks site 2 is up and 1 is down.
- Site 3: Thinks all sites are up.

In the above case the least harmful thing would be to push site 3 to the
other two, but since all three sites have different data it may just hang
instead.

This is going to drive me nuts. :D


Thank You,

Logan Barfield
Tranquil Hosting

On Wed, Jan 7, 2015 at 12:57 PM, Simon Weller <sw...@ena.com> wrote:

> See inline.
> ________________________________________
> From: Logan Barfield <lb...@tqhosting.com>
> Sent: Wednesday, January 07, 2015 11:43 AM
> To: dev@cloudstack.apache.org
> Cc: users@cloudstack.apache.org
> Subject: Re: Multi-Datacenter Deployment
>
> I appreciate the explanation.  That seems to confirm what I was thinking,
> that until regions are working 100% we'll just have to make sure the
> DC-to-DC links are as stable/redundant as possible to prevent HA issues.
> If we increase the HA delay it shouldn't be a major issue, and it will
> still be better than nothing.
>
> For us is probably also makes sense to not worry about having management
> servers in each DC for now.  If we have a big enough outage in our primary
> DC to affect access to the management server we probably have bigger
> problems to worry about.
>
> > Yeah, I agree. Even with Mgmt down, it's not going to stop any existing
> services from running or functioning as long as the clusters are healthy.
>
> - Si
>
> Much appreciated!
>
>
> Thank You,
>
> Logan Barfield
> Tranquil Hosting
>
> On Wed, Jan 7, 2015 at 12:15 PM, Simon Weller <sw...@ena.com> wrote:
>
> > Logan,
> >
> > We currently run CS in multiple geographically separate DCs, and may be
> > able to give you a little insight into things.
> >
> > We run KVM in advanced networking mode, with CLVM clusters backed onto
> > Dell Compellent SANs. We currently have different DCs running different
> > zones per DC, in a single region. We've been running CS in production now
> > since 4.0 prior to regions, so that functionality (along with its
> > limitations) hasn't been something we've adopted yet. We run our
> Management
> > (With Multiple clustered nodes) out of 1 DC, and have a backup set of
> > Management Nodes in another DC should we need to invoke BCDR in the event
> > the primary Management nodes became unavailable.
> >
> > Your concerns regarding HA problems are founded. We run our own
> nationwide
> > MPLS backbone, and therefore have multiple high capacity bandwidth paths
> > between our different DCs, and even with that capacity and fault tolerant
> > design, we've seen issues where Management has attempted to invoke HA due
> > to brief loss of connectivity (typically due to maintenance or grooming
> > activity), and this can be quite problematic. VPN tunnels are going to be
> > very challenging for you, and you really need to look at VPLS or some
> other
> > technology that can layer on top of a resilient infrastructure with
> > multiple paths and fast failover (e.g. MPLS Fast Reroute).
> >
> > Ideally, regions should solve this with dedicated local management nodes,
> > but until the syncing is sorted out, and those newer releases are stable,
> > there isn't much option short of using a single region right now, short
> of
> > setting up a completely separate CS instances per DC.
> >
> > Hope this helps a little.
> >
> > - Si
> >
> > ________________________________________
> > From: Logan Barfield <lb...@tqhosting.com>
> > Sent: Tuesday, January 06, 2015 1:45 PM
> > To: dev@cloudstack.apache.org; users@cloudstack.apache.org
> > Subject: Multi-Datacenter Deployment
> >
> > We are currently running a single location CloudStack deployment:
> > - 1 Hardware firewall
> > - 1 Mangement/Database Server
> > - 1 NFS staging store (for S3 secondary storage)
> > - Ceph RBD for primary storage
> > - 4 Hypervisors
> > - 1 Zone/Pod/Cluster
> >
> > We are looking to expand our deployment to other datacenters, and I'm
> > trying to determine the best way to go about it.  The documentation is a
> > bit lacking for multi-site deployments.
> >
> > Our goal for the multi-site deployment is to have a zone for each site
> > (E.G. US East, US West, Europe) that our customers can use to deploy
> > instances in their preferred geographic area.
> >
> > Since we don't want to have different accounts for every datacenter, I
> > don't think using Regions makes sense for us (and I'm not sure what
> they're
> > actually good for without keeping accounts/users/domains in sync).
> >
> > Right now I'm thinking our setup will be as follows:
> > - Firewall, Management Server, NFS staging server, primary storage, and
> > Hypervisors in each datacenter.
> > - All Management servers will be on the same management network.
> > - Management servers will be connected via site-to-site VPN links over
> WAN.
> > - MySQL replication (Percona?) will be set up on the management servers.
> > Having an odd number of servers to protect against split brain, and
> keeping
> > redundant database backups.
> > - One region (default)
> > - One zone for each datacenter
> > - Geo-enabled DNS to direct customers to the nearest Management server
> > - Object storage for secondary storage across cloud.
> >
> > My primary concerns with this setup are:
> > - I haven't really seen multi-site deployments details anywhere.
> > - Potential for split-brain.
> > - How will HA be handled (e.g., if a VPN link goes down and one of the
> > remote management servers can't contact a host, will it try to initiate
> > HA?) - This sort of goes along with the split brain problem.
> >
> > Are my assumptions here sound, or is there a standard recommended way of
> > doing multi-site deployments?
> >
> > Any suggestions are much appreciated.
> >
>

Re: Multi-Datacenter Deployment

Posted by Simon Weller <sw...@ena.com>.
See inline.
________________________________________
From: Logan Barfield <lb...@tqhosting.com>
Sent: Wednesday, January 07, 2015 11:43 AM
To: dev@cloudstack.apache.org
Cc: users@cloudstack.apache.org
Subject: Re: Multi-Datacenter Deployment

I appreciate the explanation.  That seems to confirm what I was thinking,
that until regions are working 100% we'll just have to make sure the
DC-to-DC links are as stable/redundant as possible to prevent HA issues.
If we increase the HA delay it shouldn't be a major issue, and it will
still be better than nothing.

For us is probably also makes sense to not worry about having management
servers in each DC for now.  If we have a big enough outage in our primary
DC to affect access to the management server we probably have bigger
problems to worry about.

> Yeah, I agree. Even with Mgmt down, it's not going to stop any existing services from running or functioning as long as the clusters are healthy.

- Si

Much appreciated!


Thank You,

Logan Barfield
Tranquil Hosting

On Wed, Jan 7, 2015 at 12:15 PM, Simon Weller <sw...@ena.com> wrote:

> Logan,
>
> We currently run CS in multiple geographically separate DCs, and may be
> able to give you a little insight into things.
>
> We run KVM in advanced networking mode, with CLVM clusters backed onto
> Dell Compellent SANs. We currently have different DCs running different
> zones per DC, in a single region. We've been running CS in production now
> since 4.0 prior to regions, so that functionality (along with its
> limitations) hasn't been something we've adopted yet. We run our Management
> (With Multiple clustered nodes) out of 1 DC, and have a backup set of
> Management Nodes in another DC should we need to invoke BCDR in the event
> the primary Management nodes became unavailable.
>
> Your concerns regarding HA problems are founded. We run our own nationwide
> MPLS backbone, and therefore have multiple high capacity bandwidth paths
> between our different DCs, and even with that capacity and fault tolerant
> design, we've seen issues where Management has attempted to invoke HA due
> to brief loss of connectivity (typically due to maintenance or grooming
> activity), and this can be quite problematic. VPN tunnels are going to be
> very challenging for you, and you really need to look at VPLS or some other
> technology that can layer on top of a resilient infrastructure with
> multiple paths and fast failover (e.g. MPLS Fast Reroute).
>
> Ideally, regions should solve this with dedicated local management nodes,
> but until the syncing is sorted out, and those newer releases are stable,
> there isn't much option short of using a single region right now, short of
> setting up a completely separate CS instances per DC.
>
> Hope this helps a little.
>
> - Si
>
> ________________________________________
> From: Logan Barfield <lb...@tqhosting.com>
> Sent: Tuesday, January 06, 2015 1:45 PM
> To: dev@cloudstack.apache.org; users@cloudstack.apache.org
> Subject: Multi-Datacenter Deployment
>
> We are currently running a single location CloudStack deployment:
> - 1 Hardware firewall
> - 1 Mangement/Database Server
> - 1 NFS staging store (for S3 secondary storage)
> - Ceph RBD for primary storage
> - 4 Hypervisors
> - 1 Zone/Pod/Cluster
>
> We are looking to expand our deployment to other datacenters, and I'm
> trying to determine the best way to go about it.  The documentation is a
> bit lacking for multi-site deployments.
>
> Our goal for the multi-site deployment is to have a zone for each site
> (E.G. US East, US West, Europe) that our customers can use to deploy
> instances in their preferred geographic area.
>
> Since we don't want to have different accounts for every datacenter, I
> don't think using Regions makes sense for us (and I'm not sure what they're
> actually good for without keeping accounts/users/domains in sync).
>
> Right now I'm thinking our setup will be as follows:
> - Firewall, Management Server, NFS staging server, primary storage, and
> Hypervisors in each datacenter.
> - All Management servers will be on the same management network.
> - Management servers will be connected via site-to-site VPN links over WAN.
> - MySQL replication (Percona?) will be set up on the management servers.
> Having an odd number of servers to protect against split brain, and keeping
> redundant database backups.
> - One region (default)
> - One zone for each datacenter
> - Geo-enabled DNS to direct customers to the nearest Management server
> - Object storage for secondary storage across cloud.
>
> My primary concerns with this setup are:
> - I haven't really seen multi-site deployments details anywhere.
> - Potential for split-brain.
> - How will HA be handled (e.g., if a VPN link goes down and one of the
> remote management servers can't contact a host, will it try to initiate
> HA?) - This sort of goes along with the split brain problem.
>
> Are my assumptions here sound, or is there a standard recommended way of
> doing multi-site deployments?
>
> Any suggestions are much appreciated.
>

Re: Multi-Datacenter Deployment

Posted by Simon Weller <sw...@ena.com>.
See inline.
________________________________________
From: Logan Barfield <lb...@tqhosting.com>
Sent: Wednesday, January 07, 2015 11:43 AM
To: dev@cloudstack.apache.org
Cc: users@cloudstack.apache.org
Subject: Re: Multi-Datacenter Deployment

I appreciate the explanation.  That seems to confirm what I was thinking,
that until regions are working 100% we'll just have to make sure the
DC-to-DC links are as stable/redundant as possible to prevent HA issues.
If we increase the HA delay it shouldn't be a major issue, and it will
still be better than nothing.

For us is probably also makes sense to not worry about having management
servers in each DC for now.  If we have a big enough outage in our primary
DC to affect access to the management server we probably have bigger
problems to worry about.

> Yeah, I agree. Even with Mgmt down, it's not going to stop any existing services from running or functioning as long as the clusters are healthy.

- Si

Much appreciated!


Thank You,

Logan Barfield
Tranquil Hosting

On Wed, Jan 7, 2015 at 12:15 PM, Simon Weller <sw...@ena.com> wrote:

> Logan,
>
> We currently run CS in multiple geographically separate DCs, and may be
> able to give you a little insight into things.
>
> We run KVM in advanced networking mode, with CLVM clusters backed onto
> Dell Compellent SANs. We currently have different DCs running different
> zones per DC, in a single region. We've been running CS in production now
> since 4.0 prior to regions, so that functionality (along with its
> limitations) hasn't been something we've adopted yet. We run our Management
> (With Multiple clustered nodes) out of 1 DC, and have a backup set of
> Management Nodes in another DC should we need to invoke BCDR in the event
> the primary Management nodes became unavailable.
>
> Your concerns regarding HA problems are founded. We run our own nationwide
> MPLS backbone, and therefore have multiple high capacity bandwidth paths
> between our different DCs, and even with that capacity and fault tolerant
> design, we've seen issues where Management has attempted to invoke HA due
> to brief loss of connectivity (typically due to maintenance or grooming
> activity), and this can be quite problematic. VPN tunnels are going to be
> very challenging for you, and you really need to look at VPLS or some other
> technology that can layer on top of a resilient infrastructure with
> multiple paths and fast failover (e.g. MPLS Fast Reroute).
>
> Ideally, regions should solve this with dedicated local management nodes,
> but until the syncing is sorted out, and those newer releases are stable,
> there isn't much option short of using a single region right now, short of
> setting up a completely separate CS instances per DC.
>
> Hope this helps a little.
>
> - Si
>
> ________________________________________
> From: Logan Barfield <lb...@tqhosting.com>
> Sent: Tuesday, January 06, 2015 1:45 PM
> To: dev@cloudstack.apache.org; users@cloudstack.apache.org
> Subject: Multi-Datacenter Deployment
>
> We are currently running a single location CloudStack deployment:
> - 1 Hardware firewall
> - 1 Mangement/Database Server
> - 1 NFS staging store (for S3 secondary storage)
> - Ceph RBD for primary storage
> - 4 Hypervisors
> - 1 Zone/Pod/Cluster
>
> We are looking to expand our deployment to other datacenters, and I'm
> trying to determine the best way to go about it.  The documentation is a
> bit lacking for multi-site deployments.
>
> Our goal for the multi-site deployment is to have a zone for each site
> (E.G. US East, US West, Europe) that our customers can use to deploy
> instances in their preferred geographic area.
>
> Since we don't want to have different accounts for every datacenter, I
> don't think using Regions makes sense for us (and I'm not sure what they're
> actually good for without keeping accounts/users/domains in sync).
>
> Right now I'm thinking our setup will be as follows:
> - Firewall, Management Server, NFS staging server, primary storage, and
> Hypervisors in each datacenter.
> - All Management servers will be on the same management network.
> - Management servers will be connected via site-to-site VPN links over WAN.
> - MySQL replication (Percona?) will be set up on the management servers.
> Having an odd number of servers to protect against split brain, and keeping
> redundant database backups.
> - One region (default)
> - One zone for each datacenter
> - Geo-enabled DNS to direct customers to the nearest Management server
> - Object storage for secondary storage across cloud.
>
> My primary concerns with this setup are:
> - I haven't really seen multi-site deployments details anywhere.
> - Potential for split-brain.
> - How will HA be handled (e.g., if a VPN link goes down and one of the
> remote management servers can't contact a host, will it try to initiate
> HA?) - This sort of goes along with the split brain problem.
>
> Are my assumptions here sound, or is there a standard recommended way of
> doing multi-site deployments?
>
> Any suggestions are much appreciated.
>

Re: Multi-Datacenter Deployment

Posted by Logan Barfield <lb...@tqhosting.com>.
I appreciate the explanation.  That seems to confirm what I was thinking,
that until regions are working 100% we'll just have to make sure the
DC-to-DC links are as stable/redundant as possible to prevent HA issues.
If we increase the HA delay it shouldn't be a major issue, and it will
still be better than nothing.

For us is probably also makes sense to not worry about having management
servers in each DC for now.  If we have a big enough outage in our primary
DC to affect access to the management server we probably have bigger
problems to worry about.

Much appreciated!


Thank You,

Logan Barfield
Tranquil Hosting

On Wed, Jan 7, 2015 at 12:15 PM, Simon Weller <sw...@ena.com> wrote:

> Logan,
>
> We currently run CS in multiple geographically separate DCs, and may be
> able to give you a little insight into things.
>
> We run KVM in advanced networking mode, with CLVM clusters backed onto
> Dell Compellent SANs. We currently have different DCs running different
> zones per DC, in a single region. We've been running CS in production now
> since 4.0 prior to regions, so that functionality (along with its
> limitations) hasn't been something we've adopted yet. We run our Management
> (With Multiple clustered nodes) out of 1 DC, and have a backup set of
> Management Nodes in another DC should we need to invoke BCDR in the event
> the primary Management nodes became unavailable.
>
> Your concerns regarding HA problems are founded. We run our own nationwide
> MPLS backbone, and therefore have multiple high capacity bandwidth paths
> between our different DCs, and even with that capacity and fault tolerant
> design, we've seen issues where Management has attempted to invoke HA due
> to brief loss of connectivity (typically due to maintenance or grooming
> activity), and this can be quite problematic. VPN tunnels are going to be
> very challenging for you, and you really need to look at VPLS or some other
> technology that can layer on top of a resilient infrastructure with
> multiple paths and fast failover (e.g. MPLS Fast Reroute).
>
> Ideally, regions should solve this with dedicated local management nodes,
> but until the syncing is sorted out, and those newer releases are stable,
> there isn't much option short of using a single region right now, short of
> setting up a completely separate CS instances per DC.
>
> Hope this helps a little.
>
> - Si
>
> ________________________________________
> From: Logan Barfield <lb...@tqhosting.com>
> Sent: Tuesday, January 06, 2015 1:45 PM
> To: dev@cloudstack.apache.org; users@cloudstack.apache.org
> Subject: Multi-Datacenter Deployment
>
> We are currently running a single location CloudStack deployment:
> - 1 Hardware firewall
> - 1 Mangement/Database Server
> - 1 NFS staging store (for S3 secondary storage)
> - Ceph RBD for primary storage
> - 4 Hypervisors
> - 1 Zone/Pod/Cluster
>
> We are looking to expand our deployment to other datacenters, and I'm
> trying to determine the best way to go about it.  The documentation is a
> bit lacking for multi-site deployments.
>
> Our goal for the multi-site deployment is to have a zone for each site
> (E.G. US East, US West, Europe) that our customers can use to deploy
> instances in their preferred geographic area.
>
> Since we don't want to have different accounts for every datacenter, I
> don't think using Regions makes sense for us (and I'm not sure what they're
> actually good for without keeping accounts/users/domains in sync).
>
> Right now I'm thinking our setup will be as follows:
> - Firewall, Management Server, NFS staging server, primary storage, and
> Hypervisors in each datacenter.
> - All Management servers will be on the same management network.
> - Management servers will be connected via site-to-site VPN links over WAN.
> - MySQL replication (Percona?) will be set up on the management servers.
> Having an odd number of servers to protect against split brain, and keeping
> redundant database backups.
> - One region (default)
> - One zone for each datacenter
> - Geo-enabled DNS to direct customers to the nearest Management server
> - Object storage for secondary storage across cloud.
>
> My primary concerns with this setup are:
> - I haven't really seen multi-site deployments details anywhere.
> - Potential for split-brain.
> - How will HA be handled (e.g., if a VPN link goes down and one of the
> remote management servers can't contact a host, will it try to initiate
> HA?) - This sort of goes along with the split brain problem.
>
> Are my assumptions here sound, or is there a standard recommended way of
> doing multi-site deployments?
>
> Any suggestions are much appreciated.
>

Re: Multi-Datacenter Deployment

Posted by Logan Barfield <lb...@tqhosting.com>.
I appreciate the explanation.  That seems to confirm what I was thinking,
that until regions are working 100% we'll just have to make sure the
DC-to-DC links are as stable/redundant as possible to prevent HA issues.
If we increase the HA delay it shouldn't be a major issue, and it will
still be better than nothing.

For us is probably also makes sense to not worry about having management
servers in each DC for now.  If we have a big enough outage in our primary
DC to affect access to the management server we probably have bigger
problems to worry about.

Much appreciated!


Thank You,

Logan Barfield
Tranquil Hosting

On Wed, Jan 7, 2015 at 12:15 PM, Simon Weller <sw...@ena.com> wrote:

> Logan,
>
> We currently run CS in multiple geographically separate DCs, and may be
> able to give you a little insight into things.
>
> We run KVM in advanced networking mode, with CLVM clusters backed onto
> Dell Compellent SANs. We currently have different DCs running different
> zones per DC, in a single region. We've been running CS in production now
> since 4.0 prior to regions, so that functionality (along with its
> limitations) hasn't been something we've adopted yet. We run our Management
> (With Multiple clustered nodes) out of 1 DC, and have a backup set of
> Management Nodes in another DC should we need to invoke BCDR in the event
> the primary Management nodes became unavailable.
>
> Your concerns regarding HA problems are founded. We run our own nationwide
> MPLS backbone, and therefore have multiple high capacity bandwidth paths
> between our different DCs, and even with that capacity and fault tolerant
> design, we've seen issues where Management has attempted to invoke HA due
> to brief loss of connectivity (typically due to maintenance or grooming
> activity), and this can be quite problematic. VPN tunnels are going to be
> very challenging for you, and you really need to look at VPLS or some other
> technology that can layer on top of a resilient infrastructure with
> multiple paths and fast failover (e.g. MPLS Fast Reroute).
>
> Ideally, regions should solve this with dedicated local management nodes,
> but until the syncing is sorted out, and those newer releases are stable,
> there isn't much option short of using a single region right now, short of
> setting up a completely separate CS instances per DC.
>
> Hope this helps a little.
>
> - Si
>
> ________________________________________
> From: Logan Barfield <lb...@tqhosting.com>
> Sent: Tuesday, January 06, 2015 1:45 PM
> To: dev@cloudstack.apache.org; users@cloudstack.apache.org
> Subject: Multi-Datacenter Deployment
>
> We are currently running a single location CloudStack deployment:
> - 1 Hardware firewall
> - 1 Mangement/Database Server
> - 1 NFS staging store (for S3 secondary storage)
> - Ceph RBD for primary storage
> - 4 Hypervisors
> - 1 Zone/Pod/Cluster
>
> We are looking to expand our deployment to other datacenters, and I'm
> trying to determine the best way to go about it.  The documentation is a
> bit lacking for multi-site deployments.
>
> Our goal for the multi-site deployment is to have a zone for each site
> (E.G. US East, US West, Europe) that our customers can use to deploy
> instances in their preferred geographic area.
>
> Since we don't want to have different accounts for every datacenter, I
> don't think using Regions makes sense for us (and I'm not sure what they're
> actually good for without keeping accounts/users/domains in sync).
>
> Right now I'm thinking our setup will be as follows:
> - Firewall, Management Server, NFS staging server, primary storage, and
> Hypervisors in each datacenter.
> - All Management servers will be on the same management network.
> - Management servers will be connected via site-to-site VPN links over WAN.
> - MySQL replication (Percona?) will be set up on the management servers.
> Having an odd number of servers to protect against split brain, and keeping
> redundant database backups.
> - One region (default)
> - One zone for each datacenter
> - Geo-enabled DNS to direct customers to the nearest Management server
> - Object storage for secondary storage across cloud.
>
> My primary concerns with this setup are:
> - I haven't really seen multi-site deployments details anywhere.
> - Potential for split-brain.
> - How will HA be handled (e.g., if a VPN link goes down and one of the
> remote management servers can't contact a host, will it try to initiate
> HA?) - This sort of goes along with the split brain problem.
>
> Are my assumptions here sound, or is there a standard recommended way of
> doing multi-site deployments?
>
> Any suggestions are much appreciated.
>

Re: Multi-Datacenter Deployment

Posted by Simon Weller <sw...@ena.com>.
Logan,

We currently run CS in multiple geographically separate DCs, and may be able to give you a little insight into things.

We run KVM in advanced networking mode, with CLVM clusters backed onto Dell Compellent SANs. We currently have different DCs running different zones per DC, in a single region. We've been running CS in production now since 4.0 prior to regions, so that functionality (along with its limitations) hasn't been something we've adopted yet. We run our Management (With Multiple clustered nodes) out of 1 DC, and have a backup set of Management Nodes in another DC should we need to invoke BCDR in the event the primary Management nodes became unavailable.

Your concerns regarding HA problems are founded. We run our own nationwide MPLS backbone, and therefore have multiple high capacity bandwidth paths between our different DCs, and even with that capacity and fault tolerant design, we've seen issues where Management has attempted to invoke HA due to brief loss of connectivity (typically due to maintenance or grooming activity), and this can be quite problematic. VPN tunnels are going to be very challenging for you, and you really need to look at VPLS or some other technology that can layer on top of a resilient infrastructure with multiple paths and fast failover (e.g. MPLS Fast Reroute).

Ideally, regions should solve this with dedicated local management nodes, but until the syncing is sorted out, and those newer releases are stable, there isn't much option short of using a single region right now, short of setting up a completely separate CS instances per DC. 

Hope this helps a little.

- Si

________________________________________
From: Logan Barfield <lb...@tqhosting.com>
Sent: Tuesday, January 06, 2015 1:45 PM
To: dev@cloudstack.apache.org; users@cloudstack.apache.org
Subject: Multi-Datacenter Deployment

We are currently running a single location CloudStack deployment:
- 1 Hardware firewall
- 1 Mangement/Database Server
- 1 NFS staging store (for S3 secondary storage)
- Ceph RBD for primary storage
- 4 Hypervisors
- 1 Zone/Pod/Cluster

We are looking to expand our deployment to other datacenters, and I'm
trying to determine the best way to go about it.  The documentation is a
bit lacking for multi-site deployments.

Our goal for the multi-site deployment is to have a zone for each site
(E.G. US East, US West, Europe) that our customers can use to deploy
instances in their preferred geographic area.

Since we don't want to have different accounts for every datacenter, I
don't think using Regions makes sense for us (and I'm not sure what they're
actually good for without keeping accounts/users/domains in sync).

Right now I'm thinking our setup will be as follows:
- Firewall, Management Server, NFS staging server, primary storage, and
Hypervisors in each datacenter.
- All Management servers will be on the same management network.
- Management servers will be connected via site-to-site VPN links over WAN.
- MySQL replication (Percona?) will be set up on the management servers.
Having an odd number of servers to protect against split brain, and keeping
redundant database backups.
- One region (default)
- One zone for each datacenter
- Geo-enabled DNS to direct customers to the nearest Management server
- Object storage for secondary storage across cloud.

My primary concerns with this setup are:
- I haven't really seen multi-site deployments details anywhere.
- Potential for split-brain.
- How will HA be handled (e.g., if a VPN link goes down and one of the
remote management servers can't contact a host, will it try to initiate
HA?) - This sort of goes along with the split brain problem.

Are my assumptions here sound, or is there a standard recommended way of
doing multi-site deployments?

Any suggestions are much appreciated.

Re: Multi-Datacenter Deployment

Posted by Simon Weller <sw...@ena.com>.
Logan,

We currently run CS in multiple geographically separate DCs, and may be able to give you a little insight into things.

We run KVM in advanced networking mode, with CLVM clusters backed onto Dell Compellent SANs. We currently have different DCs running different zones per DC, in a single region. We've been running CS in production now since 4.0 prior to regions, so that functionality (along with its limitations) hasn't been something we've adopted yet. We run our Management (With Multiple clustered nodes) out of 1 DC, and have a backup set of Management Nodes in another DC should we need to invoke BCDR in the event the primary Management nodes became unavailable.

Your concerns regarding HA problems are founded. We run our own nationwide MPLS backbone, and therefore have multiple high capacity bandwidth paths between our different DCs, and even with that capacity and fault tolerant design, we've seen issues where Management has attempted to invoke HA due to brief loss of connectivity (typically due to maintenance or grooming activity), and this can be quite problematic. VPN tunnels are going to be very challenging for you, and you really need to look at VPLS or some other technology that can layer on top of a resilient infrastructure with multiple paths and fast failover (e.g. MPLS Fast Reroute).

Ideally, regions should solve this with dedicated local management nodes, but until the syncing is sorted out, and those newer releases are stable, there isn't much option short of using a single region right now, short of setting up a completely separate CS instances per DC. 

Hope this helps a little.

- Si

________________________________________
From: Logan Barfield <lb...@tqhosting.com>
Sent: Tuesday, January 06, 2015 1:45 PM
To: dev@cloudstack.apache.org; users@cloudstack.apache.org
Subject: Multi-Datacenter Deployment

We are currently running a single location CloudStack deployment:
- 1 Hardware firewall
- 1 Mangement/Database Server
- 1 NFS staging store (for S3 secondary storage)
- Ceph RBD for primary storage
- 4 Hypervisors
- 1 Zone/Pod/Cluster

We are looking to expand our deployment to other datacenters, and I'm
trying to determine the best way to go about it.  The documentation is a
bit lacking for multi-site deployments.

Our goal for the multi-site deployment is to have a zone for each site
(E.G. US East, US West, Europe) that our customers can use to deploy
instances in their preferred geographic area.

Since we don't want to have different accounts for every datacenter, I
don't think using Regions makes sense for us (and I'm not sure what they're
actually good for without keeping accounts/users/domains in sync).

Right now I'm thinking our setup will be as follows:
- Firewall, Management Server, NFS staging server, primary storage, and
Hypervisors in each datacenter.
- All Management servers will be on the same management network.
- Management servers will be connected via site-to-site VPN links over WAN.
- MySQL replication (Percona?) will be set up on the management servers.
Having an odd number of servers to protect against split brain, and keeping
redundant database backups.
- One region (default)
- One zone for each datacenter
- Geo-enabled DNS to direct customers to the nearest Management server
- Object storage for secondary storage across cloud.

My primary concerns with this setup are:
- I haven't really seen multi-site deployments details anywhere.
- Potential for split-brain.
- How will HA be handled (e.g., if a VPN link goes down and one of the
remote management servers can't contact a host, will it try to initiate
HA?) - This sort of goes along with the split brain problem.

Are my assumptions here sound, or is there a standard recommended way of
doing multi-site deployments?

Any suggestions are much appreciated.