You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by Laurent Steff <La...@inria.fr> on 2013/07/08 14:39:05 UTC

outage feedback and questions

Hello,

Cloudstack is used in our company as a core component of a "Continuous Integration"
Service.

We are mainly happy with it, for a lot of reasons too long to describe. :)

We encountered recently a major service outage on Cloudstack mainly linked
to bad practices on our side, and the aim of this post is :

- ask questions about things we didn't understand yet
- gather some practical best practices we missed
- if problems detected are still present on Cloudstack 4.x, helping
to robustify Cloudstack with our feedback

we know that 3.x version is not supported and plan to move ASAP in 4.x version.

It's quite a long mail, and it may be badly directed (dev mailing list ? multiple bugs ?)

Any response is appreciated ;)

Regards,


--------------------long part----------------------------------------

Architecture :
--------------

Old and non Apache CloudStack 3.0.2 release
1 Zone, 1 physical network, 1 pod
1 Virtual Router VM, 1 SSVM
4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi storage
Management Server on Vmware virtual machine



Incidents :
-----------

Day 1 : Management Server DoSed by internal synchronization scripts (ldap to Cloudstack)
Day 3 : DoS corrected, Management Server RAM and CPU ugraded, and rebooted (never rebooted in more than a year). Cloudstack
is running again normally (vm creation/stop/start/console/...)
Day 4 : (week-end) Network outage on core datacenter switch. Network unstable 2 days.

Symptoms :
----------

Day 7 : The network is operationnal but most of VMs down (250 of 300) since Day 4. 
Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased).

VirtualRouter VM fileystem was on of them. Filesystem corruption prevented it to reboot normally.

Survivors VMs are on the same KVM/GFS2 Cluster.
SSVM is one of them. Messages on the console indicates she was temporarily in read-only mode

Hard way to revival (actions):
-----------------------------

1. VirtualRouter VM destructed by an administrator, to let CloudStack recreate it from template.

BUT :)

the SystemVM KVM Template is not available. Status in GUI is "CONNECTION REFUSED".
The url from where it was downloaded during install is no more valid (old and unavailable
internal mirror server  instead of http://download.cloud.com)

=> we are unable to start again VMs stopped and create new ones

2. Manual download on the Managment Server of the template, like in a fresh install

---
/usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt -m /mnt/secondary/  -u http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2 -h kvm -F
---

It's no sufficient. mysql table template_host_ref does not change. Even when changing url in mysql tables.
We still have "CONNECTION REFUSED" on template status in mysql and on the GUI

3. after analysis, we needed to alter manualy mysql tables (template_id of systemVM KVM was x) :

---
update template_host_ref set download_state='DOWNLOADED' where template_id=x;
update template_host_ref set job_id='NULL' where template_id=x; <= may be useless
update template_host_ref set job_id='NULL' where template_id=x; <= may be useless
---

4. As in MySQL, status on GUI is DOWNLOADED

5. Poweron of a stopped VM, Cloudstack builds a new VirtualRouter VM and we can let users
start manually their stopped VM


Questions :
-----------

1. What did stop and destroyed the libvirt domains of our VMs ? There's some part
of code who could do this, but I'm not sure

2. Is it possible that Cloudstack triggered autonomously the re-download of the 
systemVM template ? Or has it to be an human interaction.

3. In 4.x is the risk of a corrupted, or systemVM template with a bad status
still present. Is there any warning more than a simple "connexion refused" not
really visible as an alert ?

4. Is Cloudstack retrying by default to restart VMs who should be up, or do
we need configuration for this ?


--------------------end of long part----------------------------------------


-- 
Laurent Steff

DSI/SESI
http://www.inria.fr/

Re: Windows Licensing

Posted by Kirk Jantzer <ki...@gmail.com>.
Windows licensing is per guest, as if they were physical instances. An
enterprise would want to do KMS activations,  and just true up with Ms at
the end of the year.

Regards,

Kirk Jantzer
http://about.me/kirkjantzer
On Jul 19, 2013 12:57 PM, "Dean Bruhn" <de...@artisnotcrime.com>
wrote:

> How are the people dealing with Windows Licensing for their user base on
> non hyper-v systems? My research is SPLA licensing, but I wanted to see if
> anyone else had any insight.
>
> - Dean

Windows Licensing

Posted by Dean Bruhn <de...@artisnotcrime.com>.
How are the people dealing with Windows Licensing for their user base on non hyper-v systems? My research is SPLA licensing, but I wanted to see if anyone else had any insight. 

- Dean 

RE: outage feedback and questions

Posted by David Ortiz <dp...@outlook.com>.
Dean,
     The system I've been working with is a small dev system, so we only have the one appliance which is most definitely a single point of failure.  It looks like nexenta does have a HA plugin available for clustering two or more of them, but I don't really know anything about it other than having just read the intro paragraph on this data sheet.
http://info.nexenta.com/rs/nexenta/images/data_sheet_ha_cluster.pdf
Thanks,     Dave


> From: dean.kamali@gmail.com
> Date: Fri, 19 Jul 2013 12:48:23 -0400
> Subject: Re: outage feedback and questions
> To: users@cloudstack.apache.org
> 
> For primary storage dose nexentastor provide you with HA?
> 
> 
> On Fri, Jul 19, 2013 at 12:09 PM, David Ortiz <dp...@outlook.com> wrote:
> 
> > Dean,
> >     We didn't really have a recovery plan in place at the time.
> >  Fortunately for us, this was just before we went live for other users to
> > hit our system, so what ended up happening was I was able to compare the
> > mysql database entries for volumes with the list of files that were still
> > present on primary storage.  From there I could figure out which VMs were
> > missing root disks and delete/rebuild them as needed, and then for data
> > volumes that were missing we were able to simply recreate them and go into
> > the instances to reformat and do any other configuration.  Fortunately we
> > had created all the VMs that went down, and I had created base templates
> > for each basic system type we were using (e.g. hadoop node, web server,
> > etc.), so recovery was pretty straightforward.
> > We now have been taking snaphosts of our vms and vendor vms so we can
> > restore from those if things get corrupted.  We also are using nexentastor
> > for our shared storage, which I believe lets you snapshot the entire shared
> > filesystem as well.
> > Thanks,     Dave
> >
> > > Date: Mon, 15 Jul 2013 17:27:24 -0400
> > > Subject: RE: outage feedback and questions
> > > From: dean.kamali@gmail.com
> > > To: users@cloudstack.apache.org
> > >
> > > Just wondering if you had a recovery plan?
> > > Would you please share with us your experience.
> > >
> > > Thank you
> > > On Jul 15, 2013 4:47 PM, "David Ortiz" <dp...@outlook.com> wrote:
> > >
> > > > Laurent,
> > > >     We too had some issues where we lost VMs after a switch went down.
> >  We
> > > > are also using gfs2 over iScsi for our primary storage.  Once I got the
> > > > cluster back up, fsck found a lot of corruption on the gfs2 fs, which
> > > > resulted in probably 6 VMs out of the 25 we had needing to have volumes
> > > > rebuilt, or having to be rebuilt completely.  I would guess this is
> > what
> > > > happened in your case as well.
> > > > Thanks,     David Ortiz
> > > >
> > > > > From: dean.kamali@gmail.com
> > > > > Date: Tue, 9 Jul 2013 19:35:52 -0400
> > > > > Subject: Re: outage feedback and questions
> > > > > To: users@cloudstack.apache.org
> > > > >
> > > > > courtesy to geoff.higginbottom@shapeblue.comfor answering this
> > question
> > > > first
> > > > >
> > > > >
> > > > > On Tue, Jul 9, 2013 at 7:33 PM, Dean Kamali <de...@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Well, I have asked in the mailing list sometime ago, about
> > > > > > cloudstack behaviour when I lose connectively to primary storage,
> > then
> > > > > > hypervisor start rebooting randomly.
> > > > > >
> > > > > > I believe this what is very similar to what happend in your case.
> > > > > >
> > > > > > This is actually 'by design'.  The logic is that if the storage
> > goes
> > > > > > offline, then all VMs must have also failed, and a 'forced' reboot
> > of
> > > > the
> > > > > > Host 'might' automatically fix things.
> > > > > >
> > > > > > This is great if you only have one Primary Storage, but typically
> > you
> > > > > > have more than one, so whilst the reboot might fix the failed
> > storage,
> > > > it
> > > > > > will also kill off all the perfectly good VMs which were still
> > happily
> > > > > > running.
> > > > > >
> > > > > > The answer what I got was for xenserver not KVM, it included
> > removing
> > > > the
> > > > > > reboot -f option for a config file.
> > > > > >
> > > > > >
> > > > > >
> > > > > > The fix for XenServer Hosts is to:
> > > > > >
> > > > > > 1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts,
> > > > > > commenting out the two entries which have "reboot -f"
> > > > > >
> > > > > > 2. Identify the PID of the script  - pidof -x xenheartbeat.sh
> > > > > >
> > > > > > 3. Restart the Script  - kill <pid>
> > > > > >
> > > > > > 4. Force reconnect Host from the UI,  the script will then
> > re-launch on
> > > > > > reconnect
> > > > > >
> > > > > >
> > > > > >
> > > > > > On Tue, Jul 9, 2013 at 7:08 PM, Laurent Steff <
> > Laurent.Steff@inria.fr
> > > > >wrote:
> > > > > >
> > > > > >> Hi Dean,
> > > > > >>
> > > > > >> And thanks for your answer.
> > > > > >>
> > > > > >> Yes the network troubles lead to issue with the main storage
> > > > > >> on clusters (iscsi).
> > > > > >>
> > > > > >> So is that a fact if the main storage is lost on KVM, VMs are
> > stopped
> > > > > >> and domain destroyed ?
> > > > > >>
> > > > > >> It was an hypothesis as I found traces in
> > > > > >>
> > > > > >>
> > > > > >>
> > > >
> > apache-cloudstack-4.0.2-src/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java
> > > > > >>
> > > > > >> which "kills -9 qemu processes" if main storage is not found, but
> > I
> > > > was
> > > > > >> not sure when the function was called.
> > > > > >>
> > > > > >> It's on the function  checkingMountPoint, which calls destroyVMs
> > if
> > > > mount
> > > > > >> point not found.
> > > > > >>
> > > > > >> Regards,
> > > > > >>
> > > > > >> ----- Mail original -----
> > > > > >> > De: "Dean Kamali" <de...@gmail.com>
> > > > > >> > À: users@cloudstack.apache.org
> > > > > >> > Envoyé: Lundi 8 Juillet 2013 16:34:04
> > > > > >> > Objet: Re: outage feedback and questions
> > > > > >> >
> > > > > >> > Survivors VMs are on the same KVM/GFS2 Cluster.
> > > > > >> > SSVM is one of them. Messages on the console indicates she was
> > > > > >> > temporarily
> > > > > >> > in read-only mode
> > > > > >> >
> > > > > >> > Do you have an issue with storage?
> > > > > >> >
> > > > > >> > I wouldn't expect a failure in switch could cause all of this,
> > it
> > > > > >> > will
> > > > > >> > cause loss of network connectivity but it shouldn't cause your
> > vms
> > > > to
> > > > > >> > go
> > > > > >> > down.
> > > > > >> >
> > > > > >> > This behavior usually happens when you lose your primary
> > storage.
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> >
> > > > > >> > On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff
> > > > > >> > <La...@inria.fr>wrote:
> > > > > >> >
> > > > > >> > > Hello,
> > > > > >> > >
> > > > > >> > > Cloudstack is used in our company as a core component of a
> > > > > >> > > "Continuous
> > > > > >> > > Integration"
> > > > > >> > > Service.
> > > > > >> > >
> > > > > >> > > We are mainly happy with it, for a lot of reasons too long to
> > > > > >> > > describe. :)
> > > > > >> > >
> > > > > >> > > We encountered recently a major service outage on Cloudstack
> > > > mainly
> > > > > >> > > linked
> > > > > >> > > to bad practices on our side, and the aim of this post is :
> > > > > >> > >
> > > > > >> > > - ask questions about things we didn't understand yet
> > > > > >> > > - gather some practical best practices we missed
> > > > > >> > > - if problems detected are still present on Cloudstack 4.x,
> > > > helping
> > > > > >> > > to robustify Cloudstack with our feedback
> > > > > >> > >
> > > > > >> > > we know that 3.x version is not supported and plan to move
> > ASAP in
> > > > > >> > > 4.x
> > > > > >> > > version.
> > > > > >> > >
> > > > > >> > > It's quite a long mail, and it may be badly directed (dev
> > mailing
> > > > > >> > > list ?
> > > > > >> > > multiple bugs ?)
> > > > > >> > >
> > > > > >> > > Any response is appreciated ;)
> > > > > >> > >
> > > > > >> > > Regards,
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > --------------------long
> > > > > >> > > part----------------------------------------
> > > > > >> > >
> > > > > >> > > Architecture :
> > > > > >> > > --------------
> > > > > >> > >
> > > > > >> > > Old and non Apache CloudStack 3.0.2 release
> > > > > >> > > 1 Zone, 1 physical network, 1 pod
> > > > > >> > > 1 Virtual Router VM, 1 SSVM
> > > > > >> > > 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi
> > storage
> > > > > >> > > Management Server on Vmware virtual machine
> > > > > >> > >
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Incidents :
> > > > > >> > > -----------
> > > > > >> > >
> > > > > >> > > Day 1 : Management Server DoSed by internal synchronization
> > > > scripts
> > > > > >> > > (ldap
> > > > > >> > > to Cloudstack)
> > > > > >> > > Day 3 : DoS corrected, Management Server RAM and CPU ugraded,
> > and
> > > > > >> > > rebooted
> > > > > >> > > (never rebooted in more than a year). Cloudstack
> > > > > >> > > is running again normally (vm creation/stop/start/console/...)
> > > > > >> > > Day 4 : (week-end) Network outage on core datacenter switch.
> > > > > >> > > Network
> > > > > >> > > unstable 2 days.
> > > > > >> > >
> > > > > >> > > Symptoms :
> > > > > >> > > ----------
> > > > > >> > >
> > > > > >> > > Day 7 : The network is operationnal but most of VMs down (250
> > of
> > > > > >> > > 300)
> > > > > >> > > since Day 4.
> > > > > >> > > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased).
> > > > > >> > >
> > > > > >> > > VirtualRouter VM fileystem was on of them. Filesystem
> > corruption
> > > > > >> > > prevented
> > > > > >> > > it to reboot normally.
> > > > > >> > >
> > > > > >> > > Survivors VMs are on the same KVM/GFS2 Cluster.
> > > > > >> > > SSVM is one of them. Messages on the console indicates she was
> > > > > >> > > temporarily
> > > > > >> > > in read-only mode
> > > > > >> > >
> > > > > >> > > Hard way to revival (actions):
> > > > > >> > > -----------------------------
> > > > > >> > >
> > > > > >> > > 1. VirtualRouter VM destructed by an administrator, to let
> > > > > >> > > CloudStack
> > > > > >> > > recreate it from template.
> > > > > >> > >
> > > > > >> > > BUT :)
> > > > > >> > >
> > > > > >> > > the SystemVM KVM Template is not available. Status in GUI is
> > > > > >> > > "CONNECTION
> > > > > >> > > REFUSED".
> > > > > >> > > The url from where it was downloaded during install is no more
> > > > > >> > > valid (old
> > > > > >> > > and unavailable
> > > > > >> > > internal mirror server  instead of http://download.cloud.com)
> > > > > >> > >
> > > > > >> > > => we are unable to start again VMs stopped and create new
> > ones
> > > > > >> > >
> > > > > >> > > 2. Manual download on the Managment Server of the template,
> > like
> > > > in
> > > > > >> > > a
> > > > > >> > > fresh install
> > > > > >> > >
> > > > > >> > > ---
> > > > > >> > >
> > > > > >>
> > > >
> > /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt
> > > > > >> > > -m /mnt/secondary/  -u
> > > > > >> > >
> > > > > >>
> > > >
> > http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h
> > > > > >> > > kvm -F
> > > > > >> > > ---
> > > > > >> > >
> > > > > >> > > It's no sufficient. mysql table template_host_ref does not
> > change.
> > > > > >> > > Even
> > > > > >> > > when changing url in mysql tables.
> > > > > >> > > We still have "CONNECTION REFUSED" on template status in
> > mysql and
> > > > > >> > > on the
> > > > > >> > > GUI
> > > > > >> > >
> > > > > >> > > 3. after analysis, we needed to alter manualy mysql tables
> > > > > >> > > (template_id of
> > > > > >> > > systemVM KVM was x) :
> > > > > >> > >
> > > > > >> > > ---
> > > > > >> > > update template_host_ref set download_state='DOWNLOADED' where
> > > > > >> > > template_id=x;
> > > > > >> > > update template_host_ref set job_id='NULL' where
> > template_id=x; <=
> > > > > >> > > may be
> > > > > >> > > useless
> > > > > >> > > update template_host_ref set job_id='NULL' where
> > template_id=x; <=
> > > > > >> > > may be
> > > > > >> > > useless
> > > > > >> > > ---
> > > > > >> > >
> > > > > >> > > 4. As in MySQL, status on GUI is DOWNLOADED
> > > > > >> > >
> > > > > >> > > 5. Poweron of a stopped VM, Cloudstack builds a new
> > VirtualRouter
> > > > > >> > > VM and
> > > > > >> > > we can let users
> > > > > >> > > start manually their stopped VM
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > Questions :
> > > > > >> > > -----------
> > > > > >> > >
> > > > > >> > > 1. What did stop and destroyed the libvirt domains of our VMs
> > ?
> > > > > >> > > There's
> > > > > >> > > some part
> > > > > >> > > of code who could do this, but I'm not sure
> > > > > >> > >
> > > > > >> > > 2. Is it possible that Cloudstack triggered autonomously the
> > > > > >> > > re-download
> > > > > >> > > of the
> > > > > >> > > systemVM template ? Or has it to be an human interaction.
> > > > > >> > >
> > > > > >> > > 3. In 4.x is the risk of a corrupted, or systemVM template
> > with a
> > > > > >> > > bad
> > > > > >> > > status
> > > > > >> > > still present. Is there any warning more than a simple
> > "connexion
> > > > > >> > > refused"
> > > > > >> > > not
> > > > > >> > > really visible as an alert ?
> > > > > >> > >
> > > > > >> > > 4. Is Cloudstack retrying by default to restart VMs who
> > should be
> > > > > >> > > up, or do
> > > > > >> > > we need configuration for this ?
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > --------------------end of long
> > > > > >> > > part----------------------------------------
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > --
> > > > > >> > > Laurent Steff
> > > > > >> > >
> > > > > >> > > DSI/SESI
> > > > > >> > > http://www.inria.fr/
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >> --
> > > > > >> Laurent Steff
> > > > > >>
> > > > > >> DSI/SESI
> > > > > >> INRIA
> > > > > >> Tél.  : +33 1 39 63 50 81
> > > > > >> Port. : +33 6 87 66 77 85
> > > > > >> http://www.inria.fr/
> > > > > >>
> > > > > >
> > > > > >
> > > >
> >
> >
 		 	   		  

Re: outage feedback and questions

Posted by Dean Kamali <de...@gmail.com>.
For primary storage dose nexentastor provide you with HA?


On Fri, Jul 19, 2013 at 12:09 PM, David Ortiz <dp...@outlook.com> wrote:

> Dean,
>     We didn't really have a recovery plan in place at the time.
>  Fortunately for us, this was just before we went live for other users to
> hit our system, so what ended up happening was I was able to compare the
> mysql database entries for volumes with the list of files that were still
> present on primary storage.  From there I could figure out which VMs were
> missing root disks and delete/rebuild them as needed, and then for data
> volumes that were missing we were able to simply recreate them and go into
> the instances to reformat and do any other configuration.  Fortunately we
> had created all the VMs that went down, and I had created base templates
> for each basic system type we were using (e.g. hadoop node, web server,
> etc.), so recovery was pretty straightforward.
> We now have been taking snaphosts of our vms and vendor vms so we can
> restore from those if things get corrupted.  We also are using nexentastor
> for our shared storage, which I believe lets you snapshot the entire shared
> filesystem as well.
> Thanks,     Dave
>
> > Date: Mon, 15 Jul 2013 17:27:24 -0400
> > Subject: RE: outage feedback and questions
> > From: dean.kamali@gmail.com
> > To: users@cloudstack.apache.org
> >
> > Just wondering if you had a recovery plan?
> > Would you please share with us your experience.
> >
> > Thank you
> > On Jul 15, 2013 4:47 PM, "David Ortiz" <dp...@outlook.com> wrote:
> >
> > > Laurent,
> > >     We too had some issues where we lost VMs after a switch went down.
>  We
> > > are also using gfs2 over iScsi for our primary storage.  Once I got the
> > > cluster back up, fsck found a lot of corruption on the gfs2 fs, which
> > > resulted in probably 6 VMs out of the 25 we had needing to have volumes
> > > rebuilt, or having to be rebuilt completely.  I would guess this is
> what
> > > happened in your case as well.
> > > Thanks,     David Ortiz
> > >
> > > > From: dean.kamali@gmail.com
> > > > Date: Tue, 9 Jul 2013 19:35:52 -0400
> > > > Subject: Re: outage feedback and questions
> > > > To: users@cloudstack.apache.org
> > > >
> > > > courtesy to geoff.higginbottom@shapeblue.comfor answering this
> question
> > > first
> > > >
> > > >
> > > > On Tue, Jul 9, 2013 at 7:33 PM, Dean Kamali <de...@gmail.com>
> > > wrote:
> > > >
> > > > > Well, I have asked in the mailing list sometime ago, about
> > > > > cloudstack behaviour when I lose connectively to primary storage,
> then
> > > > > hypervisor start rebooting randomly.
> > > > >
> > > > > I believe this what is very similar to what happend in your case.
> > > > >
> > > > > This is actually 'by design'.  The logic is that if the storage
> goes
> > > > > offline, then all VMs must have also failed, and a 'forced' reboot
> of
> > > the
> > > > > Host 'might' automatically fix things.
> > > > >
> > > > > This is great if you only have one Primary Storage, but typically
> you
> > > > > have more than one, so whilst the reboot might fix the failed
> storage,
> > > it
> > > > > will also kill off all the perfectly good VMs which were still
> happily
> > > > > running.
> > > > >
> > > > > The answer what I got was for xenserver not KVM, it included
> removing
> > > the
> > > > > reboot -f option for a config file.
> > > > >
> > > > >
> > > > >
> > > > > The fix for XenServer Hosts is to:
> > > > >
> > > > > 1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts,
> > > > > commenting out the two entries which have "reboot -f"
> > > > >
> > > > > 2. Identify the PID of the script  - pidof -x xenheartbeat.sh
> > > > >
> > > > > 3. Restart the Script  - kill <pid>
> > > > >
> > > > > 4. Force reconnect Host from the UI,  the script will then
> re-launch on
> > > > > reconnect
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Jul 9, 2013 at 7:08 PM, Laurent Steff <
> Laurent.Steff@inria.fr
> > > >wrote:
> > > > >
> > > > >> Hi Dean,
> > > > >>
> > > > >> And thanks for your answer.
> > > > >>
> > > > >> Yes the network troubles lead to issue with the main storage
> > > > >> on clusters (iscsi).
> > > > >>
> > > > >> So is that a fact if the main storage is lost on KVM, VMs are
> stopped
> > > > >> and domain destroyed ?
> > > > >>
> > > > >> It was an hypothesis as I found traces in
> > > > >>
> > > > >>
> > > > >>
> > >
> apache-cloudstack-4.0.2-src/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java
> > > > >>
> > > > >> which "kills -9 qemu processes" if main storage is not found, but
> I
> > > was
> > > > >> not sure when the function was called.
> > > > >>
> > > > >> It's on the function  checkingMountPoint, which calls destroyVMs
> if
> > > mount
> > > > >> point not found.
> > > > >>
> > > > >> Regards,
> > > > >>
> > > > >> ----- Mail original -----
> > > > >> > De: "Dean Kamali" <de...@gmail.com>
> > > > >> > À: users@cloudstack.apache.org
> > > > >> > Envoyé: Lundi 8 Juillet 2013 16:34:04
> > > > >> > Objet: Re: outage feedback and questions
> > > > >> >
> > > > >> > Survivors VMs are on the same KVM/GFS2 Cluster.
> > > > >> > SSVM is one of them. Messages on the console indicates she was
> > > > >> > temporarily
> > > > >> > in read-only mode
> > > > >> >
> > > > >> > Do you have an issue with storage?
> > > > >> >
> > > > >> > I wouldn't expect a failure in switch could cause all of this,
> it
> > > > >> > will
> > > > >> > cause loss of network connectivity but it shouldn't cause your
> vms
> > > to
> > > > >> > go
> > > > >> > down.
> > > > >> >
> > > > >> > This behavior usually happens when you lose your primary
> storage.
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> >
> > > > >> > On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff
> > > > >> > <La...@inria.fr>wrote:
> > > > >> >
> > > > >> > > Hello,
> > > > >> > >
> > > > >> > > Cloudstack is used in our company as a core component of a
> > > > >> > > "Continuous
> > > > >> > > Integration"
> > > > >> > > Service.
> > > > >> > >
> > > > >> > > We are mainly happy with it, for a lot of reasons too long to
> > > > >> > > describe. :)
> > > > >> > >
> > > > >> > > We encountered recently a major service outage on Cloudstack
> > > mainly
> > > > >> > > linked
> > > > >> > > to bad practices on our side, and the aim of this post is :
> > > > >> > >
> > > > >> > > - ask questions about things we didn't understand yet
> > > > >> > > - gather some practical best practices we missed
> > > > >> > > - if problems detected are still present on Cloudstack 4.x,
> > > helping
> > > > >> > > to robustify Cloudstack with our feedback
> > > > >> > >
> > > > >> > > we know that 3.x version is not supported and plan to move
> ASAP in
> > > > >> > > 4.x
> > > > >> > > version.
> > > > >> > >
> > > > >> > > It's quite a long mail, and it may be badly directed (dev
> mailing
> > > > >> > > list ?
> > > > >> > > multiple bugs ?)
> > > > >> > >
> > > > >> > > Any response is appreciated ;)
> > > > >> > >
> > > > >> > > Regards,
> > > > >> > >
> > > > >> > >
> > > > >> > > --------------------long
> > > > >> > > part----------------------------------------
> > > > >> > >
> > > > >> > > Architecture :
> > > > >> > > --------------
> > > > >> > >
> > > > >> > > Old and non Apache CloudStack 3.0.2 release
> > > > >> > > 1 Zone, 1 physical network, 1 pod
> > > > >> > > 1 Virtual Router VM, 1 SSVM
> > > > >> > > 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi
> storage
> > > > >> > > Management Server on Vmware virtual machine
> > > > >> > >
> > > > >> > >
> > > > >> > >
> > > > >> > > Incidents :
> > > > >> > > -----------
> > > > >> > >
> > > > >> > > Day 1 : Management Server DoSed by internal synchronization
> > > scripts
> > > > >> > > (ldap
> > > > >> > > to Cloudstack)
> > > > >> > > Day 3 : DoS corrected, Management Server RAM and CPU ugraded,
> and
> > > > >> > > rebooted
> > > > >> > > (never rebooted in more than a year). Cloudstack
> > > > >> > > is running again normally (vm creation/stop/start/console/...)
> > > > >> > > Day 4 : (week-end) Network outage on core datacenter switch.
> > > > >> > > Network
> > > > >> > > unstable 2 days.
> > > > >> > >
> > > > >> > > Symptoms :
> > > > >> > > ----------
> > > > >> > >
> > > > >> > > Day 7 : The network is operationnal but most of VMs down (250
> of
> > > > >> > > 300)
> > > > >> > > since Day 4.
> > > > >> > > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased).
> > > > >> > >
> > > > >> > > VirtualRouter VM fileystem was on of them. Filesystem
> corruption
> > > > >> > > prevented
> > > > >> > > it to reboot normally.
> > > > >> > >
> > > > >> > > Survivors VMs are on the same KVM/GFS2 Cluster.
> > > > >> > > SSVM is one of them. Messages on the console indicates she was
> > > > >> > > temporarily
> > > > >> > > in read-only mode
> > > > >> > >
> > > > >> > > Hard way to revival (actions):
> > > > >> > > -----------------------------
> > > > >> > >
> > > > >> > > 1. VirtualRouter VM destructed by an administrator, to let
> > > > >> > > CloudStack
> > > > >> > > recreate it from template.
> > > > >> > >
> > > > >> > > BUT :)
> > > > >> > >
> > > > >> > > the SystemVM KVM Template is not available. Status in GUI is
> > > > >> > > "CONNECTION
> > > > >> > > REFUSED".
> > > > >> > > The url from where it was downloaded during install is no more
> > > > >> > > valid (old
> > > > >> > > and unavailable
> > > > >> > > internal mirror server  instead of http://download.cloud.com)
> > > > >> > >
> > > > >> > > => we are unable to start again VMs stopped and create new
> ones
> > > > >> > >
> > > > >> > > 2. Manual download on the Managment Server of the template,
> like
> > > in
> > > > >> > > a
> > > > >> > > fresh install
> > > > >> > >
> > > > >> > > ---
> > > > >> > >
> > > > >>
> > >
> /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt
> > > > >> > > -m /mnt/secondary/  -u
> > > > >> > >
> > > > >>
> > >
> http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h
> > > > >> > > kvm -F
> > > > >> > > ---
> > > > >> > >
> > > > >> > > It's no sufficient. mysql table template_host_ref does not
> change.
> > > > >> > > Even
> > > > >> > > when changing url in mysql tables.
> > > > >> > > We still have "CONNECTION REFUSED" on template status in
> mysql and
> > > > >> > > on the
> > > > >> > > GUI
> > > > >> > >
> > > > >> > > 3. after analysis, we needed to alter manualy mysql tables
> > > > >> > > (template_id of
> > > > >> > > systemVM KVM was x) :
> > > > >> > >
> > > > >> > > ---
> > > > >> > > update template_host_ref set download_state='DOWNLOADED' where
> > > > >> > > template_id=x;
> > > > >> > > update template_host_ref set job_id='NULL' where
> template_id=x; <=
> > > > >> > > may be
> > > > >> > > useless
> > > > >> > > update template_host_ref set job_id='NULL' where
> template_id=x; <=
> > > > >> > > may be
> > > > >> > > useless
> > > > >> > > ---
> > > > >> > >
> > > > >> > > 4. As in MySQL, status on GUI is DOWNLOADED
> > > > >> > >
> > > > >> > > 5. Poweron of a stopped VM, Cloudstack builds a new
> VirtualRouter
> > > > >> > > VM and
> > > > >> > > we can let users
> > > > >> > > start manually their stopped VM
> > > > >> > >
> > > > >> > >
> > > > >> > > Questions :
> > > > >> > > -----------
> > > > >> > >
> > > > >> > > 1. What did stop and destroyed the libvirt domains of our VMs
> ?
> > > > >> > > There's
> > > > >> > > some part
> > > > >> > > of code who could do this, but I'm not sure
> > > > >> > >
> > > > >> > > 2. Is it possible that Cloudstack triggered autonomously the
> > > > >> > > re-download
> > > > >> > > of the
> > > > >> > > systemVM template ? Or has it to be an human interaction.
> > > > >> > >
> > > > >> > > 3. In 4.x is the risk of a corrupted, or systemVM template
> with a
> > > > >> > > bad
> > > > >> > > status
> > > > >> > > still present. Is there any warning more than a simple
> "connexion
> > > > >> > > refused"
> > > > >> > > not
> > > > >> > > really visible as an alert ?
> > > > >> > >
> > > > >> > > 4. Is Cloudstack retrying by default to restart VMs who
> should be
> > > > >> > > up, or do
> > > > >> > > we need configuration for this ?
> > > > >> > >
> > > > >> > >
> > > > >> > > --------------------end of long
> > > > >> > > part----------------------------------------
> > > > >> > >
> > > > >> > >
> > > > >> > > --
> > > > >> > > Laurent Steff
> > > > >> > >
> > > > >> > > DSI/SESI
> > > > >> > > http://www.inria.fr/
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >> --
> > > > >> Laurent Steff
> > > > >>
> > > > >> DSI/SESI
> > > > >> INRIA
> > > > >> Tél.  : +33 1 39 63 50 81
> > > > >> Port. : +33 6 87 66 77 85
> > > > >> http://www.inria.fr/
> > > > >>
> > > > >
> > > > >
> > >
>
>

RE: outage feedback and questions

Posted by David Ortiz <dp...@outlook.com>.
Dean,
    We didn't really have a recovery plan in place at the time.  Fortunately for us, this was just before we went live for other users to hit our system, so what ended up happening was I was able to compare the mysql database entries for volumes with the list of files that were still present on primary storage.  From there I could figure out which VMs were missing root disks and delete/rebuild them as needed, and then for data volumes that were missing we were able to simply recreate them and go into the instances to reformat and do any other configuration.  Fortunately we had created all the VMs that went down, and I had created base templates for each basic system type we were using (e.g. hadoop node, web server, etc.), so recovery was pretty straightforward.
We now have been taking snaphosts of our vms and vendor vms so we can restore from those if things get corrupted.  We also are using nexentastor for our shared storage, which I believe lets you snapshot the entire shared filesystem as well.
Thanks,     Dave

> Date: Mon, 15 Jul 2013 17:27:24 -0400
> Subject: RE: outage feedback and questions
> From: dean.kamali@gmail.com
> To: users@cloudstack.apache.org
> 
> Just wondering if you had a recovery plan?
> Would you please share with us your experience.
> 
> Thank you
> On Jul 15, 2013 4:47 PM, "David Ortiz" <dp...@outlook.com> wrote:
> 
> > Laurent,
> >     We too had some issues where we lost VMs after a switch went down.  We
> > are also using gfs2 over iScsi for our primary storage.  Once I got the
> > cluster back up, fsck found a lot of corruption on the gfs2 fs, which
> > resulted in probably 6 VMs out of the 25 we had needing to have volumes
> > rebuilt, or having to be rebuilt completely.  I would guess this is what
> > happened in your case as well.
> > Thanks,     David Ortiz
> >
> > > From: dean.kamali@gmail.com
> > > Date: Tue, 9 Jul 2013 19:35:52 -0400
> > > Subject: Re: outage feedback and questions
> > > To: users@cloudstack.apache.org
> > >
> > > courtesy to geoff.higginbottom@shapeblue.comfor answering this question
> > first
> > >
> > >
> > > On Tue, Jul 9, 2013 at 7:33 PM, Dean Kamali <de...@gmail.com>
> > wrote:
> > >
> > > > Well, I have asked in the mailing list sometime ago, about
> > > > cloudstack behaviour when I lose connectively to primary storage, then
> > > > hypervisor start rebooting randomly.
> > > >
> > > > I believe this what is very similar to what happend in your case.
> > > >
> > > > This is actually 'by design'.  The logic is that if the storage goes
> > > > offline, then all VMs must have also failed, and a 'forced' reboot of
> > the
> > > > Host 'might' automatically fix things.
> > > >
> > > > This is great if you only have one Primary Storage, but typically you
> > > > have more than one, so whilst the reboot might fix the failed storage,
> > it
> > > > will also kill off all the perfectly good VMs which were still happily
> > > > running.
> > > >
> > > > The answer what I got was for xenserver not KVM, it included removing
> > the
> > > > reboot -f option for a config file.
> > > >
> > > >
> > > >
> > > > The fix for XenServer Hosts is to:
> > > >
> > > > 1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts,
> > > > commenting out the two entries which have "reboot -f"
> > > >
> > > > 2. Identify the PID of the script  - pidof -x xenheartbeat.sh
> > > >
> > > > 3. Restart the Script  - kill <pid>
> > > >
> > > > 4. Force reconnect Host from the UI,  the script will then re-launch on
> > > > reconnect
> > > >
> > > >
> > > >
> > > > On Tue, Jul 9, 2013 at 7:08 PM, Laurent Steff <Laurent.Steff@inria.fr
> > >wrote:
> > > >
> > > >> Hi Dean,
> > > >>
> > > >> And thanks for your answer.
> > > >>
> > > >> Yes the network troubles lead to issue with the main storage
> > > >> on clusters (iscsi).
> > > >>
> > > >> So is that a fact if the main storage is lost on KVM, VMs are stopped
> > > >> and domain destroyed ?
> > > >>
> > > >> It was an hypothesis as I found traces in
> > > >>
> > > >>
> > > >>
> > apache-cloudstack-4.0.2-src/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java
> > > >>
> > > >> which "kills -9 qemu processes" if main storage is not found, but I
> > was
> > > >> not sure when the function was called.
> > > >>
> > > >> It's on the function  checkingMountPoint, which calls destroyVMs if
> > mount
> > > >> point not found.
> > > >>
> > > >> Regards,
> > > >>
> > > >> ----- Mail original -----
> > > >> > De: "Dean Kamali" <de...@gmail.com>
> > > >> > À: users@cloudstack.apache.org
> > > >> > Envoyé: Lundi 8 Juillet 2013 16:34:04
> > > >> > Objet: Re: outage feedback and questions
> > > >> >
> > > >> > Survivors VMs are on the same KVM/GFS2 Cluster.
> > > >> > SSVM is one of them. Messages on the console indicates she was
> > > >> > temporarily
> > > >> > in read-only mode
> > > >> >
> > > >> > Do you have an issue with storage?
> > > >> >
> > > >> > I wouldn't expect a failure in switch could cause all of this, it
> > > >> > will
> > > >> > cause loss of network connectivity but it shouldn't cause your vms
> > to
> > > >> > go
> > > >> > down.
> > > >> >
> > > >> > This behavior usually happens when you lose your primary storage.
> > > >> >
> > > >> >
> > > >> >
> > > >> >
> > > >> > On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff
> > > >> > <La...@inria.fr>wrote:
> > > >> >
> > > >> > > Hello,
> > > >> > >
> > > >> > > Cloudstack is used in our company as a core component of a
> > > >> > > "Continuous
> > > >> > > Integration"
> > > >> > > Service.
> > > >> > >
> > > >> > > We are mainly happy with it, for a lot of reasons too long to
> > > >> > > describe. :)
> > > >> > >
> > > >> > > We encountered recently a major service outage on Cloudstack
> > mainly
> > > >> > > linked
> > > >> > > to bad practices on our side, and the aim of this post is :
> > > >> > >
> > > >> > > - ask questions about things we didn't understand yet
> > > >> > > - gather some practical best practices we missed
> > > >> > > - if problems detected are still present on Cloudstack 4.x,
> > helping
> > > >> > > to robustify Cloudstack with our feedback
> > > >> > >
> > > >> > > we know that 3.x version is not supported and plan to move ASAP in
> > > >> > > 4.x
> > > >> > > version.
> > > >> > >
> > > >> > > It's quite a long mail, and it may be badly directed (dev mailing
> > > >> > > list ?
> > > >> > > multiple bugs ?)
> > > >> > >
> > > >> > > Any response is appreciated ;)
> > > >> > >
> > > >> > > Regards,
> > > >> > >
> > > >> > >
> > > >> > > --------------------long
> > > >> > > part----------------------------------------
> > > >> > >
> > > >> > > Architecture :
> > > >> > > --------------
> > > >> > >
> > > >> > > Old and non Apache CloudStack 3.0.2 release
> > > >> > > 1 Zone, 1 physical network, 1 pod
> > > >> > > 1 Virtual Router VM, 1 SSVM
> > > >> > > 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi storage
> > > >> > > Management Server on Vmware virtual machine
> > > >> > >
> > > >> > >
> > > >> > >
> > > >> > > Incidents :
> > > >> > > -----------
> > > >> > >
> > > >> > > Day 1 : Management Server DoSed by internal synchronization
> > scripts
> > > >> > > (ldap
> > > >> > > to Cloudstack)
> > > >> > > Day 3 : DoS corrected, Management Server RAM and CPU ugraded, and
> > > >> > > rebooted
> > > >> > > (never rebooted in more than a year). Cloudstack
> > > >> > > is running again normally (vm creation/stop/start/console/...)
> > > >> > > Day 4 : (week-end) Network outage on core datacenter switch.
> > > >> > > Network
> > > >> > > unstable 2 days.
> > > >> > >
> > > >> > > Symptoms :
> > > >> > > ----------
> > > >> > >
> > > >> > > Day 7 : The network is operationnal but most of VMs down (250 of
> > > >> > > 300)
> > > >> > > since Day 4.
> > > >> > > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased).
> > > >> > >
> > > >> > > VirtualRouter VM fileystem was on of them. Filesystem corruption
> > > >> > > prevented
> > > >> > > it to reboot normally.
> > > >> > >
> > > >> > > Survivors VMs are on the same KVM/GFS2 Cluster.
> > > >> > > SSVM is one of them. Messages on the console indicates she was
> > > >> > > temporarily
> > > >> > > in read-only mode
> > > >> > >
> > > >> > > Hard way to revival (actions):
> > > >> > > -----------------------------
> > > >> > >
> > > >> > > 1. VirtualRouter VM destructed by an administrator, to let
> > > >> > > CloudStack
> > > >> > > recreate it from template.
> > > >> > >
> > > >> > > BUT :)
> > > >> > >
> > > >> > > the SystemVM KVM Template is not available. Status in GUI is
> > > >> > > "CONNECTION
> > > >> > > REFUSED".
> > > >> > > The url from where it was downloaded during install is no more
> > > >> > > valid (old
> > > >> > > and unavailable
> > > >> > > internal mirror server  instead of http://download.cloud.com)
> > > >> > >
> > > >> > > => we are unable to start again VMs stopped and create new ones
> > > >> > >
> > > >> > > 2. Manual download on the Managment Server of the template, like
> > in
> > > >> > > a
> > > >> > > fresh install
> > > >> > >
> > > >> > > ---
> > > >> > >
> > > >>
> > /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt
> > > >> > > -m /mnt/secondary/  -u
> > > >> > >
> > > >>
> > http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h
> > > >> > > kvm -F
> > > >> > > ---
> > > >> > >
> > > >> > > It's no sufficient. mysql table template_host_ref does not change.
> > > >> > > Even
> > > >> > > when changing url in mysql tables.
> > > >> > > We still have "CONNECTION REFUSED" on template status in mysql and
> > > >> > > on the
> > > >> > > GUI
> > > >> > >
> > > >> > > 3. after analysis, we needed to alter manualy mysql tables
> > > >> > > (template_id of
> > > >> > > systemVM KVM was x) :
> > > >> > >
> > > >> > > ---
> > > >> > > update template_host_ref set download_state='DOWNLOADED' where
> > > >> > > template_id=x;
> > > >> > > update template_host_ref set job_id='NULL' where template_id=x; <=
> > > >> > > may be
> > > >> > > useless
> > > >> > > update template_host_ref set job_id='NULL' where template_id=x; <=
> > > >> > > may be
> > > >> > > useless
> > > >> > > ---
> > > >> > >
> > > >> > > 4. As in MySQL, status on GUI is DOWNLOADED
> > > >> > >
> > > >> > > 5. Poweron of a stopped VM, Cloudstack builds a new VirtualRouter
> > > >> > > VM and
> > > >> > > we can let users
> > > >> > > start manually their stopped VM
> > > >> > >
> > > >> > >
> > > >> > > Questions :
> > > >> > > -----------
> > > >> > >
> > > >> > > 1. What did stop and destroyed the libvirt domains of our VMs ?
> > > >> > > There's
> > > >> > > some part
> > > >> > > of code who could do this, but I'm not sure
> > > >> > >
> > > >> > > 2. Is it possible that Cloudstack triggered autonomously the
> > > >> > > re-download
> > > >> > > of the
> > > >> > > systemVM template ? Or has it to be an human interaction.
> > > >> > >
> > > >> > > 3. In 4.x is the risk of a corrupted, or systemVM template with a
> > > >> > > bad
> > > >> > > status
> > > >> > > still present. Is there any warning more than a simple "connexion
> > > >> > > refused"
> > > >> > > not
> > > >> > > really visible as an alert ?
> > > >> > >
> > > >> > > 4. Is Cloudstack retrying by default to restart VMs who should be
> > > >> > > up, or do
> > > >> > > we need configuration for this ?
> > > >> > >
> > > >> > >
> > > >> > > --------------------end of long
> > > >> > > part----------------------------------------
> > > >> > >
> > > >> > >
> > > >> > > --
> > > >> > > Laurent Steff
> > > >> > >
> > > >> > > DSI/SESI
> > > >> > > http://www.inria.fr/
> > > >> > >
> > > >> >
> > > >>
> > > >> --
> > > >> Laurent Steff
> > > >>
> > > >> DSI/SESI
> > > >> INRIA
> > > >> Tél.  : +33 1 39 63 50 81
> > > >> Port. : +33 6 87 66 77 85
> > > >> http://www.inria.fr/
> > > >>
> > > >
> > > >
> >
 		 	   		  

RE: outage feedback and questions

Posted by Dean Kamali <de...@gmail.com>.
Just wondering if you had a recovery plan?
Would you please share with us your experience.

Thank you
On Jul 15, 2013 4:47 PM, "David Ortiz" <dp...@outlook.com> wrote:

> Laurent,
>     We too had some issues where we lost VMs after a switch went down.  We
> are also using gfs2 over iScsi for our primary storage.  Once I got the
> cluster back up, fsck found a lot of corruption on the gfs2 fs, which
> resulted in probably 6 VMs out of the 25 we had needing to have volumes
> rebuilt, or having to be rebuilt completely.  I would guess this is what
> happened in your case as well.
> Thanks,     David Ortiz
>
> > From: dean.kamali@gmail.com
> > Date: Tue, 9 Jul 2013 19:35:52 -0400
> > Subject: Re: outage feedback and questions
> > To: users@cloudstack.apache.org
> >
> > courtesy to geoff.higginbottom@shapeblue.comfor answering this question
> first
> >
> >
> > On Tue, Jul 9, 2013 at 7:33 PM, Dean Kamali <de...@gmail.com>
> wrote:
> >
> > > Well, I have asked in the mailing list sometime ago, about
> > > cloudstack behaviour when I lose connectively to primary storage, then
> > > hypervisor start rebooting randomly.
> > >
> > > I believe this what is very similar to what happend in your case.
> > >
> > > This is actually 'by design'.  The logic is that if the storage goes
> > > offline, then all VMs must have also failed, and a 'forced' reboot of
> the
> > > Host 'might' automatically fix things.
> > >
> > > This is great if you only have one Primary Storage, but typically you
> > > have more than one, so whilst the reboot might fix the failed storage,
> it
> > > will also kill off all the perfectly good VMs which were still happily
> > > running.
> > >
> > > The answer what I got was for xenserver not KVM, it included removing
> the
> > > reboot -f option for a config file.
> > >
> > >
> > >
> > > The fix for XenServer Hosts is to:
> > >
> > > 1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts,
> > > commenting out the two entries which have "reboot -f"
> > >
> > > 2. Identify the PID of the script  - pidof -x xenheartbeat.sh
> > >
> > > 3. Restart the Script  - kill <pid>
> > >
> > > 4. Force reconnect Host from the UI,  the script will then re-launch on
> > > reconnect
> > >
> > >
> > >
> > > On Tue, Jul 9, 2013 at 7:08 PM, Laurent Steff <Laurent.Steff@inria.fr
> >wrote:
> > >
> > >> Hi Dean,
> > >>
> > >> And thanks for your answer.
> > >>
> > >> Yes the network troubles lead to issue with the main storage
> > >> on clusters (iscsi).
> > >>
> > >> So is that a fact if the main storage is lost on KVM, VMs are stopped
> > >> and domain destroyed ?
> > >>
> > >> It was an hypothesis as I found traces in
> > >>
> > >>
> > >>
> apache-cloudstack-4.0.2-src/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java
> > >>
> > >> which "kills -9 qemu processes" if main storage is not found, but I
> was
> > >> not sure when the function was called.
> > >>
> > >> It's on the function  checkingMountPoint, which calls destroyVMs if
> mount
> > >> point not found.
> > >>
> > >> Regards,
> > >>
> > >> ----- Mail original -----
> > >> > De: "Dean Kamali" <de...@gmail.com>
> > >> > À: users@cloudstack.apache.org
> > >> > Envoyé: Lundi 8 Juillet 2013 16:34:04
> > >> > Objet: Re: outage feedback and questions
> > >> >
> > >> > Survivors VMs are on the same KVM/GFS2 Cluster.
> > >> > SSVM is one of them. Messages on the console indicates she was
> > >> > temporarily
> > >> > in read-only mode
> > >> >
> > >> > Do you have an issue with storage?
> > >> >
> > >> > I wouldn't expect a failure in switch could cause all of this, it
> > >> > will
> > >> > cause loss of network connectivity but it shouldn't cause your vms
> to
> > >> > go
> > >> > down.
> > >> >
> > >> > This behavior usually happens when you lose your primary storage.
> > >> >
> > >> >
> > >> >
> > >> >
> > >> > On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff
> > >> > <La...@inria.fr>wrote:
> > >> >
> > >> > > Hello,
> > >> > >
> > >> > > Cloudstack is used in our company as a core component of a
> > >> > > "Continuous
> > >> > > Integration"
> > >> > > Service.
> > >> > >
> > >> > > We are mainly happy with it, for a lot of reasons too long to
> > >> > > describe. :)
> > >> > >
> > >> > > We encountered recently a major service outage on Cloudstack
> mainly
> > >> > > linked
> > >> > > to bad practices on our side, and the aim of this post is :
> > >> > >
> > >> > > - ask questions about things we didn't understand yet
> > >> > > - gather some practical best practices we missed
> > >> > > - if problems detected are still present on Cloudstack 4.x,
> helping
> > >> > > to robustify Cloudstack with our feedback
> > >> > >
> > >> > > we know that 3.x version is not supported and plan to move ASAP in
> > >> > > 4.x
> > >> > > version.
> > >> > >
> > >> > > It's quite a long mail, and it may be badly directed (dev mailing
> > >> > > list ?
> > >> > > multiple bugs ?)
> > >> > >
> > >> > > Any response is appreciated ;)
> > >> > >
> > >> > > Regards,
> > >> > >
> > >> > >
> > >> > > --------------------long
> > >> > > part----------------------------------------
> > >> > >
> > >> > > Architecture :
> > >> > > --------------
> > >> > >
> > >> > > Old and non Apache CloudStack 3.0.2 release
> > >> > > 1 Zone, 1 physical network, 1 pod
> > >> > > 1 Virtual Router VM, 1 SSVM
> > >> > > 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi storage
> > >> > > Management Server on Vmware virtual machine
> > >> > >
> > >> > >
> > >> > >
> > >> > > Incidents :
> > >> > > -----------
> > >> > >
> > >> > > Day 1 : Management Server DoSed by internal synchronization
> scripts
> > >> > > (ldap
> > >> > > to Cloudstack)
> > >> > > Day 3 : DoS corrected, Management Server RAM and CPU ugraded, and
> > >> > > rebooted
> > >> > > (never rebooted in more than a year). Cloudstack
> > >> > > is running again normally (vm creation/stop/start/console/...)
> > >> > > Day 4 : (week-end) Network outage on core datacenter switch.
> > >> > > Network
> > >> > > unstable 2 days.
> > >> > >
> > >> > > Symptoms :
> > >> > > ----------
> > >> > >
> > >> > > Day 7 : The network is operationnal but most of VMs down (250 of
> > >> > > 300)
> > >> > > since Day 4.
> > >> > > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased).
> > >> > >
> > >> > > VirtualRouter VM fileystem was on of them. Filesystem corruption
> > >> > > prevented
> > >> > > it to reboot normally.
> > >> > >
> > >> > > Survivors VMs are on the same KVM/GFS2 Cluster.
> > >> > > SSVM is one of them. Messages on the console indicates she was
> > >> > > temporarily
> > >> > > in read-only mode
> > >> > >
> > >> > > Hard way to revival (actions):
> > >> > > -----------------------------
> > >> > >
> > >> > > 1. VirtualRouter VM destructed by an administrator, to let
> > >> > > CloudStack
> > >> > > recreate it from template.
> > >> > >
> > >> > > BUT :)
> > >> > >
> > >> > > the SystemVM KVM Template is not available. Status in GUI is
> > >> > > "CONNECTION
> > >> > > REFUSED".
> > >> > > The url from where it was downloaded during install is no more
> > >> > > valid (old
> > >> > > and unavailable
> > >> > > internal mirror server  instead of http://download.cloud.com)
> > >> > >
> > >> > > => we are unable to start again VMs stopped and create new ones
> > >> > >
> > >> > > 2. Manual download on the Managment Server of the template, like
> in
> > >> > > a
> > >> > > fresh install
> > >> > >
> > >> > > ---
> > >> > >
> > >>
> /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt
> > >> > > -m /mnt/secondary/  -u
> > >> > >
> > >>
> http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h
> > >> > > kvm -F
> > >> > > ---
> > >> > >
> > >> > > It's no sufficient. mysql table template_host_ref does not change.
> > >> > > Even
> > >> > > when changing url in mysql tables.
> > >> > > We still have "CONNECTION REFUSED" on template status in mysql and
> > >> > > on the
> > >> > > GUI
> > >> > >
> > >> > > 3. after analysis, we needed to alter manualy mysql tables
> > >> > > (template_id of
> > >> > > systemVM KVM was x) :
> > >> > >
> > >> > > ---
> > >> > > update template_host_ref set download_state='DOWNLOADED' where
> > >> > > template_id=x;
> > >> > > update template_host_ref set job_id='NULL' where template_id=x; <=
> > >> > > may be
> > >> > > useless
> > >> > > update template_host_ref set job_id='NULL' where template_id=x; <=
> > >> > > may be
> > >> > > useless
> > >> > > ---
> > >> > >
> > >> > > 4. As in MySQL, status on GUI is DOWNLOADED
> > >> > >
> > >> > > 5. Poweron of a stopped VM, Cloudstack builds a new VirtualRouter
> > >> > > VM and
> > >> > > we can let users
> > >> > > start manually their stopped VM
> > >> > >
> > >> > >
> > >> > > Questions :
> > >> > > -----------
> > >> > >
> > >> > > 1. What did stop and destroyed the libvirt domains of our VMs ?
> > >> > > There's
> > >> > > some part
> > >> > > of code who could do this, but I'm not sure
> > >> > >
> > >> > > 2. Is it possible that Cloudstack triggered autonomously the
> > >> > > re-download
> > >> > > of the
> > >> > > systemVM template ? Or has it to be an human interaction.
> > >> > >
> > >> > > 3. In 4.x is the risk of a corrupted, or systemVM template with a
> > >> > > bad
> > >> > > status
> > >> > > still present. Is there any warning more than a simple "connexion
> > >> > > refused"
> > >> > > not
> > >> > > really visible as an alert ?
> > >> > >
> > >> > > 4. Is Cloudstack retrying by default to restart VMs who should be
> > >> > > up, or do
> > >> > > we need configuration for this ?
> > >> > >
> > >> > >
> > >> > > --------------------end of long
> > >> > > part----------------------------------------
> > >> > >
> > >> > >
> > >> > > --
> > >> > > Laurent Steff
> > >> > >
> > >> > > DSI/SESI
> > >> > > http://www.inria.fr/
> > >> > >
> > >> >
> > >>
> > >> --
> > >> Laurent Steff
> > >>
> > >> DSI/SESI
> > >> INRIA
> > >> Tél.  : +33 1 39 63 50 81
> > >> Port. : +33 6 87 66 77 85
> > >> http://www.inria.fr/
> > >>
> > >
> > >
>

RE: outage feedback and questions

Posted by David Ortiz <dp...@outlook.com>.
Laurent,
    We too had some issues where we lost VMs after a switch went down.  We are also using gfs2 over iScsi for our primary storage.  Once I got the cluster back up, fsck found a lot of corruption on the gfs2 fs, which resulted in probably 6 VMs out of the 25 we had needing to have volumes rebuilt, or having to be rebuilt completely.  I would guess this is what happened in your case as well.
Thanks,     David Ortiz

> From: dean.kamali@gmail.com
> Date: Tue, 9 Jul 2013 19:35:52 -0400
> Subject: Re: outage feedback and questions
> To: users@cloudstack.apache.org
> 
> courtesy to geoff.higginbottom@shapeblue.comfor answering this question first
> 
> 
> On Tue, Jul 9, 2013 at 7:33 PM, Dean Kamali <de...@gmail.com> wrote:
> 
> > Well, I have asked in the mailing list sometime ago, about
> > cloudstack behaviour when I lose connectively to primary storage, then
> > hypervisor start rebooting randomly.
> >
> > I believe this what is very similar to what happend in your case.
> >
> > This is actually 'by design'.  The logic is that if the storage goes
> > offline, then all VMs must have also failed, and a 'forced' reboot of the
> > Host 'might' automatically fix things.
> >
> > This is great if you only have one Primary Storage, but typically you
> > have more than one, so whilst the reboot might fix the failed storage, it
> > will also kill off all the perfectly good VMs which were still happily
> > running.
> >
> > The answer what I got was for xenserver not KVM, it included removing the
> > reboot -f option for a config file.
> >
> >
> >
> > The fix for XenServer Hosts is to:
> >
> > 1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts,
> > commenting out the two entries which have "reboot -f"
> >
> > 2. Identify the PID of the script  - pidof -x xenheartbeat.sh
> >
> > 3. Restart the Script  - kill <pid>
> >
> > 4. Force reconnect Host from the UI,  the script will then re-launch on
> > reconnect
> >
> >
> >
> > On Tue, Jul 9, 2013 at 7:08 PM, Laurent Steff <La...@inria.fr>wrote:
> >
> >> Hi Dean,
> >>
> >> And thanks for your answer.
> >>
> >> Yes the network troubles lead to issue with the main storage
> >> on clusters (iscsi).
> >>
> >> So is that a fact if the main storage is lost on KVM, VMs are stopped
> >> and domain destroyed ?
> >>
> >> It was an hypothesis as I found traces in
> >>
> >>
> >> apache-cloudstack-4.0.2-src/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java
> >>
> >> which "kills -9 qemu processes" if main storage is not found, but I was
> >> not sure when the function was called.
> >>
> >> It's on the function  checkingMountPoint, which calls destroyVMs if mount
> >> point not found.
> >>
> >> Regards,
> >>
> >> ----- Mail original -----
> >> > De: "Dean Kamali" <de...@gmail.com>
> >> > À: users@cloudstack.apache.org
> >> > Envoyé: Lundi 8 Juillet 2013 16:34:04
> >> > Objet: Re: outage feedback and questions
> >> >
> >> > Survivors VMs are on the same KVM/GFS2 Cluster.
> >> > SSVM is one of them. Messages on the console indicates she was
> >> > temporarily
> >> > in read-only mode
> >> >
> >> > Do you have an issue with storage?
> >> >
> >> > I wouldn't expect a failure in switch could cause all of this, it
> >> > will
> >> > cause loss of network connectivity but it shouldn't cause your vms to
> >> > go
> >> > down.
> >> >
> >> > This behavior usually happens when you lose your primary storage.
> >> >
> >> >
> >> >
> >> >
> >> > On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff
> >> > <La...@inria.fr>wrote:
> >> >
> >> > > Hello,
> >> > >
> >> > > Cloudstack is used in our company as a core component of a
> >> > > "Continuous
> >> > > Integration"
> >> > > Service.
> >> > >
> >> > > We are mainly happy with it, for a lot of reasons too long to
> >> > > describe. :)
> >> > >
> >> > > We encountered recently a major service outage on Cloudstack mainly
> >> > > linked
> >> > > to bad practices on our side, and the aim of this post is :
> >> > >
> >> > > - ask questions about things we didn't understand yet
> >> > > - gather some practical best practices we missed
> >> > > - if problems detected are still present on Cloudstack 4.x, helping
> >> > > to robustify Cloudstack with our feedback
> >> > >
> >> > > we know that 3.x version is not supported and plan to move ASAP in
> >> > > 4.x
> >> > > version.
> >> > >
> >> > > It's quite a long mail, and it may be badly directed (dev mailing
> >> > > list ?
> >> > > multiple bugs ?)
> >> > >
> >> > > Any response is appreciated ;)
> >> > >
> >> > > Regards,
> >> > >
> >> > >
> >> > > --------------------long
> >> > > part----------------------------------------
> >> > >
> >> > > Architecture :
> >> > > --------------
> >> > >
> >> > > Old and non Apache CloudStack 3.0.2 release
> >> > > 1 Zone, 1 physical network, 1 pod
> >> > > 1 Virtual Router VM, 1 SSVM
> >> > > 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi storage
> >> > > Management Server on Vmware virtual machine
> >> > >
> >> > >
> >> > >
> >> > > Incidents :
> >> > > -----------
> >> > >
> >> > > Day 1 : Management Server DoSed by internal synchronization scripts
> >> > > (ldap
> >> > > to Cloudstack)
> >> > > Day 3 : DoS corrected, Management Server RAM and CPU ugraded, and
> >> > > rebooted
> >> > > (never rebooted in more than a year). Cloudstack
> >> > > is running again normally (vm creation/stop/start/console/...)
> >> > > Day 4 : (week-end) Network outage on core datacenter switch.
> >> > > Network
> >> > > unstable 2 days.
> >> > >
> >> > > Symptoms :
> >> > > ----------
> >> > >
> >> > > Day 7 : The network is operationnal but most of VMs down (250 of
> >> > > 300)
> >> > > since Day 4.
> >> > > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased).
> >> > >
> >> > > VirtualRouter VM fileystem was on of them. Filesystem corruption
> >> > > prevented
> >> > > it to reboot normally.
> >> > >
> >> > > Survivors VMs are on the same KVM/GFS2 Cluster.
> >> > > SSVM is one of them. Messages on the console indicates she was
> >> > > temporarily
> >> > > in read-only mode
> >> > >
> >> > > Hard way to revival (actions):
> >> > > -----------------------------
> >> > >
> >> > > 1. VirtualRouter VM destructed by an administrator, to let
> >> > > CloudStack
> >> > > recreate it from template.
> >> > >
> >> > > BUT :)
> >> > >
> >> > > the SystemVM KVM Template is not available. Status in GUI is
> >> > > "CONNECTION
> >> > > REFUSED".
> >> > > The url from where it was downloaded during install is no more
> >> > > valid (old
> >> > > and unavailable
> >> > > internal mirror server  instead of http://download.cloud.com)
> >> > >
> >> > > => we are unable to start again VMs stopped and create new ones
> >> > >
> >> > > 2. Manual download on the Managment Server of the template, like in
> >> > > a
> >> > > fresh install
> >> > >
> >> > > ---
> >> > >
> >> /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt
> >> > > -m /mnt/secondary/  -u
> >> > >
> >> http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h
> >> > > kvm -F
> >> > > ---
> >> > >
> >> > > It's no sufficient. mysql table template_host_ref does not change.
> >> > > Even
> >> > > when changing url in mysql tables.
> >> > > We still have "CONNECTION REFUSED" on template status in mysql and
> >> > > on the
> >> > > GUI
> >> > >
> >> > > 3. after analysis, we needed to alter manualy mysql tables
> >> > > (template_id of
> >> > > systemVM KVM was x) :
> >> > >
> >> > > ---
> >> > > update template_host_ref set download_state='DOWNLOADED' where
> >> > > template_id=x;
> >> > > update template_host_ref set job_id='NULL' where template_id=x; <=
> >> > > may be
> >> > > useless
> >> > > update template_host_ref set job_id='NULL' where template_id=x; <=
> >> > > may be
> >> > > useless
> >> > > ---
> >> > >
> >> > > 4. As in MySQL, status on GUI is DOWNLOADED
> >> > >
> >> > > 5. Poweron of a stopped VM, Cloudstack builds a new VirtualRouter
> >> > > VM and
> >> > > we can let users
> >> > > start manually their stopped VM
> >> > >
> >> > >
> >> > > Questions :
> >> > > -----------
> >> > >
> >> > > 1. What did stop and destroyed the libvirt domains of our VMs ?
> >> > > There's
> >> > > some part
> >> > > of code who could do this, but I'm not sure
> >> > >
> >> > > 2. Is it possible that Cloudstack triggered autonomously the
> >> > > re-download
> >> > > of the
> >> > > systemVM template ? Or has it to be an human interaction.
> >> > >
> >> > > 3. In 4.x is the risk of a corrupted, or systemVM template with a
> >> > > bad
> >> > > status
> >> > > still present. Is there any warning more than a simple "connexion
> >> > > refused"
> >> > > not
> >> > > really visible as an alert ?
> >> > >
> >> > > 4. Is Cloudstack retrying by default to restart VMs who should be
> >> > > up, or do
> >> > > we need configuration for this ?
> >> > >
> >> > >
> >> > > --------------------end of long
> >> > > part----------------------------------------
> >> > >
> >> > >
> >> > > --
> >> > > Laurent Steff
> >> > >
> >> > > DSI/SESI
> >> > > http://www.inria.fr/
> >> > >
> >> >
> >>
> >> --
> >> Laurent Steff
> >>
> >> DSI/SESI
> >> INRIA
> >> Tél.  : +33 1 39 63 50 81
> >> Port. : +33 6 87 66 77 85
> >> http://www.inria.fr/
> >>
> >
> >
 		 	   		  

Re: outage feedback and questions

Posted by Dean Kamali <de...@gmail.com>.
courtesy to geoff.higginbottom@shapeblue.comfor answering this question first


On Tue, Jul 9, 2013 at 7:33 PM, Dean Kamali <de...@gmail.com> wrote:

> Well, I have asked in the mailing list sometime ago, about
> cloudstack behaviour when I lose connectively to primary storage, then
> hypervisor start rebooting randomly.
>
> I believe this what is very similar to what happend in your case.
>
> This is actually 'by design'.  The logic is that if the storage goes
> offline, then all VMs must have also failed, and a 'forced' reboot of the
> Host 'might' automatically fix things.
>
> This is great if you only have one Primary Storage, but typically you
> have more than one, so whilst the reboot might fix the failed storage, it
> will also kill off all the perfectly good VMs which were still happily
> running.
>
> The answer what I got was for xenserver not KVM, it included removing the
> reboot -f option for a config file.
>
>
>
> The fix for XenServer Hosts is to:
>
> 1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts,
> commenting out the two entries which have "reboot -f"
>
> 2. Identify the PID of the script  - pidof -x xenheartbeat.sh
>
> 3. Restart the Script  - kill <pid>
>
> 4. Force reconnect Host from the UI,  the script will then re-launch on
> reconnect
>
>
>
> On Tue, Jul 9, 2013 at 7:08 PM, Laurent Steff <La...@inria.fr>wrote:
>
>> Hi Dean,
>>
>> And thanks for your answer.
>>
>> Yes the network troubles lead to issue with the main storage
>> on clusters (iscsi).
>>
>> So is that a fact if the main storage is lost on KVM, VMs are stopped
>> and domain destroyed ?
>>
>> It was an hypothesis as I found traces in
>>
>>
>> apache-cloudstack-4.0.2-src/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java
>>
>> which "kills -9 qemu processes" if main storage is not found, but I was
>> not sure when the function was called.
>>
>> It's on the function  checkingMountPoint, which calls destroyVMs if mount
>> point not found.
>>
>> Regards,
>>
>> ----- Mail original -----
>> > De: "Dean Kamali" <de...@gmail.com>
>> > À: users@cloudstack.apache.org
>> > Envoyé: Lundi 8 Juillet 2013 16:34:04
>> > Objet: Re: outage feedback and questions
>> >
>> > Survivors VMs are on the same KVM/GFS2 Cluster.
>> > SSVM is one of them. Messages on the console indicates she was
>> > temporarily
>> > in read-only mode
>> >
>> > Do you have an issue with storage?
>> >
>> > I wouldn't expect a failure in switch could cause all of this, it
>> > will
>> > cause loss of network connectivity but it shouldn't cause your vms to
>> > go
>> > down.
>> >
>> > This behavior usually happens when you lose your primary storage.
>> >
>> >
>> >
>> >
>> > On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff
>> > <La...@inria.fr>wrote:
>> >
>> > > Hello,
>> > >
>> > > Cloudstack is used in our company as a core component of a
>> > > "Continuous
>> > > Integration"
>> > > Service.
>> > >
>> > > We are mainly happy with it, for a lot of reasons too long to
>> > > describe. :)
>> > >
>> > > We encountered recently a major service outage on Cloudstack mainly
>> > > linked
>> > > to bad practices on our side, and the aim of this post is :
>> > >
>> > > - ask questions about things we didn't understand yet
>> > > - gather some practical best practices we missed
>> > > - if problems detected are still present on Cloudstack 4.x, helping
>> > > to robustify Cloudstack with our feedback
>> > >
>> > > we know that 3.x version is not supported and plan to move ASAP in
>> > > 4.x
>> > > version.
>> > >
>> > > It's quite a long mail, and it may be badly directed (dev mailing
>> > > list ?
>> > > multiple bugs ?)
>> > >
>> > > Any response is appreciated ;)
>> > >
>> > > Regards,
>> > >
>> > >
>> > > --------------------long
>> > > part----------------------------------------
>> > >
>> > > Architecture :
>> > > --------------
>> > >
>> > > Old and non Apache CloudStack 3.0.2 release
>> > > 1 Zone, 1 physical network, 1 pod
>> > > 1 Virtual Router VM, 1 SSVM
>> > > 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi storage
>> > > Management Server on Vmware virtual machine
>> > >
>> > >
>> > >
>> > > Incidents :
>> > > -----------
>> > >
>> > > Day 1 : Management Server DoSed by internal synchronization scripts
>> > > (ldap
>> > > to Cloudstack)
>> > > Day 3 : DoS corrected, Management Server RAM and CPU ugraded, and
>> > > rebooted
>> > > (never rebooted in more than a year). Cloudstack
>> > > is running again normally (vm creation/stop/start/console/...)
>> > > Day 4 : (week-end) Network outage on core datacenter switch.
>> > > Network
>> > > unstable 2 days.
>> > >
>> > > Symptoms :
>> > > ----------
>> > >
>> > > Day 7 : The network is operationnal but most of VMs down (250 of
>> > > 300)
>> > > since Day 4.
>> > > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased).
>> > >
>> > > VirtualRouter VM fileystem was on of them. Filesystem corruption
>> > > prevented
>> > > it to reboot normally.
>> > >
>> > > Survivors VMs are on the same KVM/GFS2 Cluster.
>> > > SSVM is one of them. Messages on the console indicates she was
>> > > temporarily
>> > > in read-only mode
>> > >
>> > > Hard way to revival (actions):
>> > > -----------------------------
>> > >
>> > > 1. VirtualRouter VM destructed by an administrator, to let
>> > > CloudStack
>> > > recreate it from template.
>> > >
>> > > BUT :)
>> > >
>> > > the SystemVM KVM Template is not available. Status in GUI is
>> > > "CONNECTION
>> > > REFUSED".
>> > > The url from where it was downloaded during install is no more
>> > > valid (old
>> > > and unavailable
>> > > internal mirror server  instead of http://download.cloud.com)
>> > >
>> > > => we are unable to start again VMs stopped and create new ones
>> > >
>> > > 2. Manual download on the Managment Server of the template, like in
>> > > a
>> > > fresh install
>> > >
>> > > ---
>> > >
>> /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt
>> > > -m /mnt/secondary/  -u
>> > >
>> http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h
>> > > kvm -F
>> > > ---
>> > >
>> > > It's no sufficient. mysql table template_host_ref does not change.
>> > > Even
>> > > when changing url in mysql tables.
>> > > We still have "CONNECTION REFUSED" on template status in mysql and
>> > > on the
>> > > GUI
>> > >
>> > > 3. after analysis, we needed to alter manualy mysql tables
>> > > (template_id of
>> > > systemVM KVM was x) :
>> > >
>> > > ---
>> > > update template_host_ref set download_state='DOWNLOADED' where
>> > > template_id=x;
>> > > update template_host_ref set job_id='NULL' where template_id=x; <=
>> > > may be
>> > > useless
>> > > update template_host_ref set job_id='NULL' where template_id=x; <=
>> > > may be
>> > > useless
>> > > ---
>> > >
>> > > 4. As in MySQL, status on GUI is DOWNLOADED
>> > >
>> > > 5. Poweron of a stopped VM, Cloudstack builds a new VirtualRouter
>> > > VM and
>> > > we can let users
>> > > start manually their stopped VM
>> > >
>> > >
>> > > Questions :
>> > > -----------
>> > >
>> > > 1. What did stop and destroyed the libvirt domains of our VMs ?
>> > > There's
>> > > some part
>> > > of code who could do this, but I'm not sure
>> > >
>> > > 2. Is it possible that Cloudstack triggered autonomously the
>> > > re-download
>> > > of the
>> > > systemVM template ? Or has it to be an human interaction.
>> > >
>> > > 3. In 4.x is the risk of a corrupted, or systemVM template with a
>> > > bad
>> > > status
>> > > still present. Is there any warning more than a simple "connexion
>> > > refused"
>> > > not
>> > > really visible as an alert ?
>> > >
>> > > 4. Is Cloudstack retrying by default to restart VMs who should be
>> > > up, or do
>> > > we need configuration for this ?
>> > >
>> > >
>> > > --------------------end of long
>> > > part----------------------------------------
>> > >
>> > >
>> > > --
>> > > Laurent Steff
>> > >
>> > > DSI/SESI
>> > > http://www.inria.fr/
>> > >
>> >
>>
>> --
>> Laurent Steff
>>
>> DSI/SESI
>> INRIA
>> Tél.  : +33 1 39 63 50 81
>> Port. : +33 6 87 66 77 85
>> http://www.inria.fr/
>>
>
>

Re: outage feedback and questions

Posted by Dean Kamali <de...@gmail.com>.
Well, I have asked in the mailing list sometime ago, about
cloudstack behaviour when I lose connectively to primary storage, then
hypervisor start rebooting randomly.

I believe this what is very similar to what happend in your case.

This is actually 'by design'.  The logic is that if the storage goes
offline, then all VMs must have also failed, and a 'forced' reboot of the
Host 'might' automatically fix things.

This is great if you only have one Primary Storage, but typically you have
more than one, so whilst the reboot might fix the failed storage, it will
also kill off all the perfectly good VMs which were still happily running.

The answer what I got was for xenserver not KVM, it included removing the
reboot -f option for a config file.



The fix for XenServer Hosts is to:

1. Modify /opt/xensource/bin/xenheartbeat.sh on all your Hosts, commenting
out the two entries which have "reboot -f"

2. Identify the PID of the script  - pidof -x xenheartbeat.sh

3. Restart the Script  - kill <pid>

4. Force reconnect Host from the UI,  the script will then re-launch on
reconnect



On Tue, Jul 9, 2013 at 7:08 PM, Laurent Steff <La...@inria.fr>wrote:

> Hi Dean,
>
> And thanks for your answer.
>
> Yes the network troubles lead to issue with the main storage
> on clusters (iscsi).
>
> So is that a fact if the main storage is lost on KVM, VMs are stopped
> and domain destroyed ?
>
> It was an hypothesis as I found traces in
>
>
> apache-cloudstack-4.0.2-src/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java
>
> which "kills -9 qemu processes" if main storage is not found, but I was
> not sure when the function was called.
>
> It's on the function  checkingMountPoint, which calls destroyVMs if mount
> point not found.
>
> Regards,
>
> ----- Mail original -----
> > De: "Dean Kamali" <de...@gmail.com>
> > À: users@cloudstack.apache.org
> > Envoyé: Lundi 8 Juillet 2013 16:34:04
> > Objet: Re: outage feedback and questions
> >
> > Survivors VMs are on the same KVM/GFS2 Cluster.
> > SSVM is one of them. Messages on the console indicates she was
> > temporarily
> > in read-only mode
> >
> > Do you have an issue with storage?
> >
> > I wouldn't expect a failure in switch could cause all of this, it
> > will
> > cause loss of network connectivity but it shouldn't cause your vms to
> > go
> > down.
> >
> > This behavior usually happens when you lose your primary storage.
> >
> >
> >
> >
> > On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff
> > <La...@inria.fr>wrote:
> >
> > > Hello,
> > >
> > > Cloudstack is used in our company as a core component of a
> > > "Continuous
> > > Integration"
> > > Service.
> > >
> > > We are mainly happy with it, for a lot of reasons too long to
> > > describe. :)
> > >
> > > We encountered recently a major service outage on Cloudstack mainly
> > > linked
> > > to bad practices on our side, and the aim of this post is :
> > >
> > > - ask questions about things we didn't understand yet
> > > - gather some practical best practices we missed
> > > - if problems detected are still present on Cloudstack 4.x, helping
> > > to robustify Cloudstack with our feedback
> > >
> > > we know that 3.x version is not supported and plan to move ASAP in
> > > 4.x
> > > version.
> > >
> > > It's quite a long mail, and it may be badly directed (dev mailing
> > > list ?
> > > multiple bugs ?)
> > >
> > > Any response is appreciated ;)
> > >
> > > Regards,
> > >
> > >
> > > --------------------long
> > > part----------------------------------------
> > >
> > > Architecture :
> > > --------------
> > >
> > > Old and non Apache CloudStack 3.0.2 release
> > > 1 Zone, 1 physical network, 1 pod
> > > 1 Virtual Router VM, 1 SSVM
> > > 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi storage
> > > Management Server on Vmware virtual machine
> > >
> > >
> > >
> > > Incidents :
> > > -----------
> > >
> > > Day 1 : Management Server DoSed by internal synchronization scripts
> > > (ldap
> > > to Cloudstack)
> > > Day 3 : DoS corrected, Management Server RAM and CPU ugraded, and
> > > rebooted
> > > (never rebooted in more than a year). Cloudstack
> > > is running again normally (vm creation/stop/start/console/...)
> > > Day 4 : (week-end) Network outage on core datacenter switch.
> > > Network
> > > unstable 2 days.
> > >
> > > Symptoms :
> > > ----------
> > >
> > > Day 7 : The network is operationnal but most of VMs down (250 of
> > > 300)
> > > since Day 4.
> > > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased).
> > >
> > > VirtualRouter VM fileystem was on of them. Filesystem corruption
> > > prevented
> > > it to reboot normally.
> > >
> > > Survivors VMs are on the same KVM/GFS2 Cluster.
> > > SSVM is one of them. Messages on the console indicates she was
> > > temporarily
> > > in read-only mode
> > >
> > > Hard way to revival (actions):
> > > -----------------------------
> > >
> > > 1. VirtualRouter VM destructed by an administrator, to let
> > > CloudStack
> > > recreate it from template.
> > >
> > > BUT :)
> > >
> > > the SystemVM KVM Template is not available. Status in GUI is
> > > "CONNECTION
> > > REFUSED".
> > > The url from where it was downloaded during install is no more
> > > valid (old
> > > and unavailable
> > > internal mirror server  instead of http://download.cloud.com)
> > >
> > > => we are unable to start again VMs stopped and create new ones
> > >
> > > 2. Manual download on the Managment Server of the template, like in
> > > a
> > > fresh install
> > >
> > > ---
> > >
> /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt
> > > -m /mnt/secondary/  -u
> > >
> http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h
> > > kvm -F
> > > ---
> > >
> > > It's no sufficient. mysql table template_host_ref does not change.
> > > Even
> > > when changing url in mysql tables.
> > > We still have "CONNECTION REFUSED" on template status in mysql and
> > > on the
> > > GUI
> > >
> > > 3. after analysis, we needed to alter manualy mysql tables
> > > (template_id of
> > > systemVM KVM was x) :
> > >
> > > ---
> > > update template_host_ref set download_state='DOWNLOADED' where
> > > template_id=x;
> > > update template_host_ref set job_id='NULL' where template_id=x; <=
> > > may be
> > > useless
> > > update template_host_ref set job_id='NULL' where template_id=x; <=
> > > may be
> > > useless
> > > ---
> > >
> > > 4. As in MySQL, status on GUI is DOWNLOADED
> > >
> > > 5. Poweron of a stopped VM, Cloudstack builds a new VirtualRouter
> > > VM and
> > > we can let users
> > > start manually their stopped VM
> > >
> > >
> > > Questions :
> > > -----------
> > >
> > > 1. What did stop and destroyed the libvirt domains of our VMs ?
> > > There's
> > > some part
> > > of code who could do this, but I'm not sure
> > >
> > > 2. Is it possible that Cloudstack triggered autonomously the
> > > re-download
> > > of the
> > > systemVM template ? Or has it to be an human interaction.
> > >
> > > 3. In 4.x is the risk of a corrupted, or systemVM template with a
> > > bad
> > > status
> > > still present. Is there any warning more than a simple "connexion
> > > refused"
> > > not
> > > really visible as an alert ?
> > >
> > > 4. Is Cloudstack retrying by default to restart VMs who should be
> > > up, or do
> > > we need configuration for this ?
> > >
> > >
> > > --------------------end of long
> > > part----------------------------------------
> > >
> > >
> > > --
> > > Laurent Steff
> > >
> > > DSI/SESI
> > > http://www.inria.fr/
> > >
> >
>
> --
> Laurent Steff
>
> DSI/SESI
> INRIA
> Tél.  : +33 1 39 63 50 81
> Port. : +33 6 87 66 77 85
> http://www.inria.fr/
>

Re: outage feedback and questions

Posted by Laurent Steff <La...@inria.fr>.
Hi Dean,

And thanks for your answer.

Yes the network troubles lead to issue with the main storage
on clusters (iscsi).

So is that a fact if the main storage is lost on KVM, VMs are stopped
and domain destroyed ?

It was an hypothesis as I found traces in 

apache-cloudstack-4.0.2-src/plugins/hypervisors/kvm/src/com/cloud/hypervisor/kvm/resource/KVMHABase.java

which "kills -9 qemu processes" if main storage is not found, but I was not sure when the function was called.

It's on the function  checkingMountPoint, which calls destroyVMs if mount point not found.

Regards,

----- Mail original -----
> De: "Dean Kamali" <de...@gmail.com>
> À: users@cloudstack.apache.org
> Envoyé: Lundi 8 Juillet 2013 16:34:04
> Objet: Re: outage feedback and questions
> 
> Survivors VMs are on the same KVM/GFS2 Cluster.
> SSVM is one of them. Messages on the console indicates she was
> temporarily
> in read-only mode
> 
> Do you have an issue with storage?
> 
> I wouldn't expect a failure in switch could cause all of this, it
> will
> cause loss of network connectivity but it shouldn't cause your vms to
> go
> down.
> 
> This behavior usually happens when you lose your primary storage.
> 
> 
> 
> 
> On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff
> <La...@inria.fr>wrote:
> 
> > Hello,
> >
> > Cloudstack is used in our company as a core component of a
> > "Continuous
> > Integration"
> > Service.
> >
> > We are mainly happy with it, for a lot of reasons too long to
> > describe. :)
> >
> > We encountered recently a major service outage on Cloudstack mainly
> > linked
> > to bad practices on our side, and the aim of this post is :
> >
> > - ask questions about things we didn't understand yet
> > - gather some practical best practices we missed
> > - if problems detected are still present on Cloudstack 4.x, helping
> > to robustify Cloudstack with our feedback
> >
> > we know that 3.x version is not supported and plan to move ASAP in
> > 4.x
> > version.
> >
> > It's quite a long mail, and it may be badly directed (dev mailing
> > list ?
> > multiple bugs ?)
> >
> > Any response is appreciated ;)
> >
> > Regards,
> >
> >
> > --------------------long
> > part----------------------------------------
> >
> > Architecture :
> > --------------
> >
> > Old and non Apache CloudStack 3.0.2 release
> > 1 Zone, 1 physical network, 1 pod
> > 1 Virtual Router VM, 1 SSVM
> > 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi storage
> > Management Server on Vmware virtual machine
> >
> >
> >
> > Incidents :
> > -----------
> >
> > Day 1 : Management Server DoSed by internal synchronization scripts
> > (ldap
> > to Cloudstack)
> > Day 3 : DoS corrected, Management Server RAM and CPU ugraded, and
> > rebooted
> > (never rebooted in more than a year). Cloudstack
> > is running again normally (vm creation/stop/start/console/...)
> > Day 4 : (week-end) Network outage on core datacenter switch.
> > Network
> > unstable 2 days.
> >
> > Symptoms :
> > ----------
> >
> > Day 7 : The network is operationnal but most of VMs down (250 of
> > 300)
> > since Day 4.
> > Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased).
> >
> > VirtualRouter VM fileystem was on of them. Filesystem corruption
> > prevented
> > it to reboot normally.
> >
> > Survivors VMs are on the same KVM/GFS2 Cluster.
> > SSVM is one of them. Messages on the console indicates she was
> > temporarily
> > in read-only mode
> >
> > Hard way to revival (actions):
> > -----------------------------
> >
> > 1. VirtualRouter VM destructed by an administrator, to let
> > CloudStack
> > recreate it from template.
> >
> > BUT :)
> >
> > the SystemVM KVM Template is not available. Status in GUI is
> > "CONNECTION
> > REFUSED".
> > The url from where it was downloaded during install is no more
> > valid (old
> > and unavailable
> > internal mirror server  instead of http://download.cloud.com)
> >
> > => we are unable to start again VMs stopped and create new ones
> >
> > 2. Manual download on the Managment Server of the template, like in
> > a
> > fresh install
> >
> > ---
> > /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt
> > -m /mnt/secondary/  -u
> > http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h
> > kvm -F
> > ---
> >
> > It's no sufficient. mysql table template_host_ref does not change.
> > Even
> > when changing url in mysql tables.
> > We still have "CONNECTION REFUSED" on template status in mysql and
> > on the
> > GUI
> >
> > 3. after analysis, we needed to alter manualy mysql tables
> > (template_id of
> > systemVM KVM was x) :
> >
> > ---
> > update template_host_ref set download_state='DOWNLOADED' where
> > template_id=x;
> > update template_host_ref set job_id='NULL' where template_id=x; <=
> > may be
> > useless
> > update template_host_ref set job_id='NULL' where template_id=x; <=
> > may be
> > useless
> > ---
> >
> > 4. As in MySQL, status on GUI is DOWNLOADED
> >
> > 5. Poweron of a stopped VM, Cloudstack builds a new VirtualRouter
> > VM and
> > we can let users
> > start manually their stopped VM
> >
> >
> > Questions :
> > -----------
> >
> > 1. What did stop and destroyed the libvirt domains of our VMs ?
> > There's
> > some part
> > of code who could do this, but I'm not sure
> >
> > 2. Is it possible that Cloudstack triggered autonomously the
> > re-download
> > of the
> > systemVM template ? Or has it to be an human interaction.
> >
> > 3. In 4.x is the risk of a corrupted, or systemVM template with a
> > bad
> > status
> > still present. Is there any warning more than a simple "connexion
> > refused"
> > not
> > really visible as an alert ?
> >
> > 4. Is Cloudstack retrying by default to restart VMs who should be
> > up, or do
> > we need configuration for this ?
> >
> >
> > --------------------end of long
> > part----------------------------------------
> >
> >
> > --
> > Laurent Steff
> >
> > DSI/SESI
> > http://www.inria.fr/
> >
> 

-- 
Laurent Steff

DSI/SESI
INRIA
Tél.  : +33 1 39 63 50 81
Port. : +33 6 87 66 77 85
http://www.inria.fr/

Re: outage feedback and questions

Posted by Dean Kamali <de...@gmail.com>.
Survivors VMs are on the same KVM/GFS2 Cluster.
SSVM is one of them. Messages on the console indicates she was temporarily
in read-only mode

Do you have an issue with storage?

I wouldn't expect a failure in switch could cause all of this, it will
cause loss of network connectivity but it shouldn't cause your vms to go
down.

This behavior usually happens when you lose your primary storage.




On Mon, Jul 8, 2013 at 8:39 AM, Laurent Steff <La...@inria.fr>wrote:

> Hello,
>
> Cloudstack is used in our company as a core component of a "Continuous
> Integration"
> Service.
>
> We are mainly happy with it, for a lot of reasons too long to describe. :)
>
> We encountered recently a major service outage on Cloudstack mainly linked
> to bad practices on our side, and the aim of this post is :
>
> - ask questions about things we didn't understand yet
> - gather some practical best practices we missed
> - if problems detected are still present on Cloudstack 4.x, helping
> to robustify Cloudstack with our feedback
>
> we know that 3.x version is not supported and plan to move ASAP in 4.x
> version.
>
> It's quite a long mail, and it may be badly directed (dev mailing list ?
> multiple bugs ?)
>
> Any response is appreciated ;)
>
> Regards,
>
>
> --------------------long part----------------------------------------
>
> Architecture :
> --------------
>
> Old and non Apache CloudStack 3.0.2 release
> 1 Zone, 1 physical network, 1 pod
> 1 Virtual Router VM, 1 SSVM
> 4 CentOS 6.3 KVM clusters, primary storage GFS2 on iscsi storage
> Management Server on Vmware virtual machine
>
>
>
> Incidents :
> -----------
>
> Day 1 : Management Server DoSed by internal synchronization scripts (ldap
> to Cloudstack)
> Day 3 : DoS corrected, Management Server RAM and CPU ugraded, and rebooted
> (never rebooted in more than a year). Cloudstack
> is running again normally (vm creation/stop/start/console/...)
> Day 4 : (week-end) Network outage on core datacenter switch. Network
> unstable 2 days.
>
> Symptoms :
> ----------
>
> Day 7 : The network is operationnal but most of VMs down (250 of 300)
> since Day 4.
> Libvirt configuration (/etc/libvirt.d/qemu/VMuid.xml erased).
>
> VirtualRouter VM fileystem was on of them. Filesystem corruption prevented
> it to reboot normally.
>
> Survivors VMs are on the same KVM/GFS2 Cluster.
> SSVM is one of them. Messages on the console indicates she was temporarily
> in read-only mode
>
> Hard way to revival (actions):
> -----------------------------
>
> 1. VirtualRouter VM destructed by an administrator, to let CloudStack
> recreate it from template.
>
> BUT :)
>
> the SystemVM KVM Template is not available. Status in GUI is "CONNECTION
> REFUSED".
> The url from where it was downloaded during install is no more valid (old
> and unavailable
> internal mirror server  instead of http://download.cloud.com)
>
> => we are unable to start again VMs stopped and create new ones
>
> 2. Manual download on the Managment Server of the template, like in a
> fresh install
>
> ---
> /usr/lib64/cloud/agent/scripts/storage/secondary/cloud-install-sys-tmplt
> -m /mnt/secondary/  -u
> http://ourworkingmirror/repository/cloudstack-downloads/acton-systemvm-02062012.qcow2.bz2-h kvm -F
> ---
>
> It's no sufficient. mysql table template_host_ref does not change. Even
> when changing url in mysql tables.
> We still have "CONNECTION REFUSED" on template status in mysql and on the
> GUI
>
> 3. after analysis, we needed to alter manualy mysql tables (template_id of
> systemVM KVM was x) :
>
> ---
> update template_host_ref set download_state='DOWNLOADED' where
> template_id=x;
> update template_host_ref set job_id='NULL' where template_id=x; <= may be
> useless
> update template_host_ref set job_id='NULL' where template_id=x; <= may be
> useless
> ---
>
> 4. As in MySQL, status on GUI is DOWNLOADED
>
> 5. Poweron of a stopped VM, Cloudstack builds a new VirtualRouter VM and
> we can let users
> start manually their stopped VM
>
>
> Questions :
> -----------
>
> 1. What did stop and destroyed the libvirt domains of our VMs ? There's
> some part
> of code who could do this, but I'm not sure
>
> 2. Is it possible that Cloudstack triggered autonomously the re-download
> of the
> systemVM template ? Or has it to be an human interaction.
>
> 3. In 4.x is the risk of a corrupted, or systemVM template with a bad
> status
> still present. Is there any warning more than a simple "connexion refused"
> not
> really visible as an alert ?
>
> 4. Is Cloudstack retrying by default to restart VMs who should be up, or do
> we need configuration for this ?
>
>
> --------------------end of long
> part----------------------------------------
>
>
> --
> Laurent Steff
>
> DSI/SESI
> http://www.inria.fr/
>