You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@cloudstack.apache.org by "Dave Garbus (JIRA)" <ji...@apache.org> on 2014/01/12 21:53:51 UTC

[jira] [Created] (CLOUDSTACK-5859) [HA] Shared storage failure reboot loop; VMs with Local storage brought offline

Dave Garbus created CLOUDSTACK-5859:
---------------------------------------

             Summary: [HA] Shared storage failure reboot loop; VMs with Local storage brought offline
                 Key: CLOUDSTACK-5859
                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-5859
             Project: CloudStack
          Issue Type: Bug
      Security Level: Public (Anyone can view this level - this is the default.)
          Components: KVM
    Affects Versions: 4.2.0
         Environment: RHEL/CentOS 6.4 with KVM
            Reporter: Dave Garbus
            Priority: Critical


We have a group of 13 KVM servers added to a single cluster within CloudStack. All VMs use local hypervisor storage, with the exception of one that was configured to use NFS-based primary storage with a HA service offering.

An issue occurred with the disk responsible for serving the NFS mount and the mount was put into a read-only state. Shortly after, each host in the cluster rebooted and continued to stay in a reboot loop until I put the primary storage into maintenance. These messages were in the agent.log on each of the KVM hosts:

2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] (Thread-137180:null) write heartbeat failed: timeout, retry: 4
2014-01-12 02:40:20,953 WARN  [kvm.resource.KVMHAMonitor] (Thread-137180:null) write heartbeat failed: timeout; reboot the host

In essence, a single HA-enabled VM was able to bring down an entire KVM cluster that was hosting a number of VMs with local storage. It would seem that the fencing script needs to be improved to account for cases where both local and shared storage is used.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)