You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by Matt Foley <mf...@hortonworks.com> on 2013/09/17 01:58:32 UTC

Help! After network outage, can't start System VMs; focused debug info attached

We had a planned network outage this weekend, which inadvertently resulted
in making the NFS Shared Primary Storage (used by System VMs) unavailable
for a day and a half.  (Guest VMs use local storage only, but System VMs
use shared storage only.)  Cloudstack was not brought down prior to the
outage.

After network came back, we gracefully brought down all services including
cloudstack-management, mysql, and NFS, then actually rebooted all servers
in the cluster and the NFS server (to make sure no stale file handles),
then brought up services in the appropriate order.  Also checked mysql for
table corruption, and found none.  Confirmed that the NFS volumes are
mountable from all hosts, and in fact Shared Primary Storage is being
mounted by cloudstack on hosts as usual, under /mnt/<uuid>.

Nevertheless, when try to bring up the cluster, we fail to start the system
VMs, with errors "InsufficientServerCapacityException: Unable to create a
deployment for VM".  The cause is not really insufficient capacity, as
actual usage of resources is tiny; these error messages are false
explanations of the failure to create primary storage volume for the System
VMs.

Digging into management-server.log, the core issue seems to be the ~160
line snippet from the log attached to this message
as cloudstack_debug_2013.09.16.log.  The only Shared Primary Storage pool
is pool 201, named "cs-primary".  It is mounted on all hosts as
/mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid.  The log
shows the management server correctly identifying a particular host as
being able to access pool 201, then trying to allocate a primary storage
volume using the template with uuid f23a16e7-b628-429e-83e1-698935588465.
 It fails, but I cannot tell why.  I suspect its claim that "Template 3 has
already been downloaded to pool 201" is false, but I don't know how to
check this (or fix if wrong).

Any guidance for further debugging or fixing this would be GREATLY
appreciated.
Thanks,
--Matt

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Help! After network outage, can't start System VMs; focused debug info attached

Posted by sriharsha work <sr...@gmail.com>.
Does Reconfiguring CloudStack to use local storage for system VMs retain
all the VMs that are already in Cloudstack. There are about 200 vms running
in our cloudstack. How about the VM templates, snapshots and all other
stuff that were already in Cloudstack.

I mean would it still restore the Cloudstack's original behavior as we had
it before our maintenance. Also what are the disadvantages of using local
storage for System VMs.

Thanks
Sriharsha.

On Tue, Sep 17, 2013 at 2:52 AM, Kirk Kosinski <ki...@gmail.com>wrote:

> Okay, so system VMs are using NFS primary storage (I mis-read the OP,
> sorry).  Make sure the KVM hosts can mount and write to:
>
> 10.42.1.101:/srv/nfs/eng/cs-primary
>
> Also check libvirtd.log for errors.
>
> If you're not making progress and want to get up and running ASAP, try
> reconfiguring CloudStack to use local storage for system VMs and
> (assuming this works) sort out the NFS primary storage problem later.
>
> Best regards,
> Kirk
>
> On 09/17/2013 02:22 AM, sriharsha work wrote:
> > Hi Kirk,
> >
> > Thanks for your reply. This is a blocker for us and currently affected
> > all of our work. It is very helpful to debug more into the issue. I have
> > a question.
> >
> > 1. What should the directory be when mounting [2] systemVM template
> > location on the nfs drive.
> >
> >
> > Error from agent.log on the host. Clearly it says some issue with the
> > libvirt pools. Can you please help me understand if anything else needs
> > to be addressed to get the issue resolved.
> >
> >
> > 2013-09-17 02:17:36,736 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-3:null) Request:Seq 14-1592393816:  { Cmd ,
> > MgmtId: 161340856362, via: 14, Ver: v1, Flags: 100111,
> > [{"storage.CreateCommand":{"vo
> >
> lId":9817,"pool":{"id":201,"uuid":"9c6fd9a3-43e5-389a-9594-faecf178b4b9","host":"10.42.1.101","path":"/srv/nfs/eng/cs-primary","port":2049,"type":"NetworkFilesystem"},"diskCharacteristics":{"size":725811200,"tags":[],"type":"ROOT","name":"ROOT-9736","useLocalStorage":false,"recreatable":true,"diskOfferingId":7,"volumeId":9817,"hyperType":"KVM"},"templateUrl":"f23a16e7-b628-429e-83e1-698935588465","wait":0}}]
> > }
> > 2013-09-17 02:17:36,736 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-3:null) Processing command:
> > com.cloud.agent.api.storage.CreateCommand
> > 2013-09-17 02:17:36,779 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-3:null) Failed to create volume:
> > com.cloud.utils.exception.CloudRuntimeException:
> > org.libvirt.LibvirtException: Storage volume not found: no storage vol
> > with matching name 'f23a16e7-b628-429e-83e1-698935588465'
> > 2013-09-17 02:17:36,781 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-3:null) Seq 14-1592393816:  { Ans: , MgmtId:
> > 161340856362, via: 14, Ver: v1, Flags: 110,
> >
> [{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
> > com.cloud.utils.exception.CloudRuntimeException\nMessage:
> > org.libvirt.LibvirtException: Storage volume not found: no storage vol
> > with matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
> > com.cloud.utils.exception.CloudRuntimeException:
> > org.libvirt.LibvirtException: Storage volume not found: no storage vol
> > with matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
> >
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
> >
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
> >
> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
> >
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
> >
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
> > com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
> > com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
> > com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
> > java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
> > 2013-09-17 02:17:36,888 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-4:null) Request:Seq 14-1592393817:  { Cmd ,
> > MgmtId: 161340856362, via: 14, Ver: v1, Flags: 100111,
> > [{"StopCommand":{"isProxy":false,"vmName":"s-9736-VM","wait":0}}] }
> > 2013-09-17 02:17:36,888 DEBUG [cloud.agent.Agent]
> > (agentRequest-Handler-4:null) Processing command:
> > com.cloud.agent.api.StopCommand
> > 2013-09-17 02:17:36,891 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-4:null) Failed to get dom xml:
> > org.libvirt.LibvirtException: Domain not found: no domain with matching
> > uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
> > 2013-09-17 02:17:36,893 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-4:null) Failed to get dom xml:
> > org.libvirt.LibvirtException: Domain not found: no domain with matching
> > uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
> > 2013-09-17 02:17:36,893 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-4:null) Try to stop the vm at first
> > 2013-09-17 02:17:36,895 DEBUG [kvm.resource.LibvirtComputingResource]
> > (agentRequest-Handler-4:null) Failed to stop VM :s-9736-VM :
> > org.libvirt.LibvirtException: Domain not found: no domain with matching
> > uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
> >         at org.libvirt.ErrorHandler.processError(Unknown Source)
> >         at org.libvirt.Connect.processError(Unknown Source)
> >         at org.libvirt.Connect.domainLookupByUUIDString(Unknown Source)
> >         at org.libvirt.Connect.domainLookupByUUID(Unknown Source)
> >         at
> >
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.stopVM(LibvirtComputingResource.java:4023)
> >         at
> > com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.stopVM(Libvi
> >
> >
> > Thanks
> > Sriharsha.
> >
> >
> > On Tue, Sep 17, 2013 at 1:41 AM, Kirk Kosinski <kirkkosinski@gmail.com
> > <ma...@gmail.com>> wrote:
> >
> >     Hi, here is the error:
> >
> >     2013-09-16 15:08:17,168 DEBUG [agent.transport.Request]
> >     (AgentManager-Handler-5:null) Seq 13-931004532: Processing:  { Ans: ,
> >     MgmtId: 161340856362, via: 13, Ver: v1, Flags: 110,
> >
> [{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
> >     com.cloud.utils.exception.CloudRuntimeException\nMessage:
> >     org.libvirt.LibvirtException: Storage volume not found: no storage
> vol
> >     with matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
> >     com.cloud.utils.exception.CloudRuntimeException:
> >     org.libvirt.LibvirtException: Storage volume not found: no storage
> vol
> >     with matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
> >
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
> >
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
> >
> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
> >
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
> >
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
> >     com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
> >
> com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
> >     com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
> >
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
> >
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
> >     java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
> >
> >     I'm not certain what volume it is complaining about, but I suspect
> >     secondary storage.  Log on to a host (in particular host 13 [1]
> since it
> >     is confirmed to suffer from the issue) and try to manually mount the
> >     full path of the directory with the system VM template of the
> secondary
> >     storage NFS share [2].  The idea is to confirm the share and
> >     subdirectories of the share are mountable.  Maybe during the
> maintenance
> >     some hosts changed IPs and/or the secondary storage NFS share
> >     permissions (or other settings) were messed up.
> >
> >     If the mount doesn't work, fix whatever is causing it.  If it does
> work,
> >     please collect additional info.  Enable DEBUG logging on the hosts
> [3]
> >     (if necessary), wait for the error to occur, and upload the agent.log
> >     from the host with the error.  It should have more details besides
> the
> >     exception shown in the management-server.log.  If you have a lot of
> >     hosts and don't want to enable DEBUG logging on every one,
> temporarily
> >     disable most of them and do it on the remaining few.
> >
> >     Best regards,
> >     Kirk
> >
> >     [1] "13" is the id of the host in the CloudStack database, so find
> out
> >     which host it is with:
> >     select * from `cloud`.`host` where id = 13 \G
> >
> >     [2] Something like:
> >     nfshost:/share/template/tmpl/2/123
> >
> >     [3] In /etc/cloudstack/agent/log4j-cloud.xml, set the Threshold for
> FILE
> >     and com.cloud to DEBUG.  Depending on the CloudStack version, it may
> or
> >     may not be enabled by default, and the path may be /etc/cloud/agent/.
> >
> >
> >     On 09/16/2013 07:36 PM, sriharsha work wrote:
> >     > Replying on behalf of Matt. We are able to write data to the Nfs
> >     drives.
> >     > That's not an issue.
> >     >
> >     > Thanks
> >     > Sriharsha
> >     >
> >     > Sent from my iPhone
> >     >
> >     >> On Sep 16, 2013, at 19:30, Ahmad Emneina <aemneina@gmail.com
> >     <ma...@gmail.com>> wrote:
> >     >>
> >     >> Try to mount your primary storage to a compute host and try to
> >     write to it.
> >     >> Your NFS server might not have come back up properly
> >     (settings-wise or all
> >     >> the relevant services).
> >     >>> On Sep 16, 2013 6:08 PM, "Matt Foley" <mfoley@hortonworks.com
> >     <ma...@hortonworks.com>> wrote:
> >     >>>
> >     >>> Thank you Chiradeep.  Log snippet now available as
> >     http://apaste.info/qBIB
> >     >>> --Matt
> >     >>>
> >     >>> On Mon, Sep 16, 2013 at 5:19 PM, Chiradeep Vittal <
> >     >>> Chiradeep.Vittal@citrix.com
> >     <ma...@citrix.com>> wrote:
> >     >>>
> >     >>>> Attachments are stripped. Can you paste (say at
> >     http://apaste.info/)
> >     >>>>
> >     >>>> From: Matt Foley <mfoley@hortonworks.com
> >     <ma...@hortonworks.com>>
> >     >>>> Date: Monday, September 16, 2013 4:58 PM
> >     >>>>
> >     >>>> We had a planned network outage this weekend, which
> inadvertently
> >     >>> resulted
> >     >>>> in making the NFS Shared Primary Storage (used by System VMs)
> >     unavailable
> >     >>>> for a day and a half.  (Guest VMs use local storage only, but
> >     System VMs
> >     >>>> use shared storage only.)  Cloudstack was not brought down
> >     prior to the
> >     >>>> outage.
> >     >>>>
> >     >>>> After network came back, we gracefully brought down all services
> >     >>> including
> >     >>>> cloudstack-management, mysql, and NFS, then actually rebooted
> >     all servers
> >     >>>> in the cluster and the NFS server (to make sure no stale file
> >     handles),
> >     >>>> then brought up services in the appropriate order.  Also
> >     checked mysql
> >     >>> for
> >     >>>> table corruption, and found none.  Confirmed that the NFS
> >     volumes are
> >     >>>> mountable from all hosts, and in fact Shared Primary Storage is
> >     being
> >     >>>> mounted by cloudstack on hosts as usual, under /mnt/<uuid>.
> >     >>>>
> >     >>>> Nevertheless, when try to bring up the cluster, we fail to
> >     start the
> >     >>>> system VMs, with errors "InsufficientServerCapacityException:
> >     Unable to
> >     >>>> create a deployment for VM".  The cause is not really
> insufficient
> >     >>>> capacity, as actual usage of resources is tiny; these error
> >     messages are
> >     >>>> false explanations of the failure to create primary storage
> >     volume for
> >     >>> the
> >     >>>> System VMs.
> >     >>>>
> >     >>>> Digging into management-server.log, the core issue seems to be
> >     the ~160
> >     >>>> line snippet from the log attached to this message as
> >     >>>> cloudstack_debug_2013.09.16.log. The only Shared Primary
> >     Storage pool is
> >     >>>> pool 201, named "cs-primary".  It is mounted on all hosts as
> >     >>>> /mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid.
> >      The log
> >     >>>> shows the management server correctly identifying a particular
> >     host as
> >     >>>> being able to access pool 201, then trying to allocate a
> >     primary storage
> >     >>>> volume using the template with uuid
> >     f23a16e7-b628-429e-83e1-698935588465.
> >     >>>> It fails, but I cannot tell why.  I suspect its claim that
> >     "Template 3
> >     >>> has
> >     >>>> already been downloaded to pool 201" is false, but I don't know
> >     how to
> >     >>>> check this (or fix if wrong).
> >     >>>>
> >     >>>> Any guidance for further debugging or fixing this would be
> GREATLY
> >     >>>> appreciated.
> >     >>>> Thanks,
> >     >>>> --Matt
> >     >>>
> >     >>> --
> >     >>> CONFIDENTIALITY NOTICE
> >     >>> NOTICE: This message is intended for the use of the individual
> >     or entity to
> >     >>> which it is addressed and may contain information that is
> >     confidential,
> >     >>> privileged and exempt from disclosure under applicable law. If
> >     the reader
> >     >>> of this message is not the intended recipient, you are hereby
> >     notified that
> >     >>> any printing, copying, dissemination, distribution, disclosure or
> >     >>> forwarding of this communication is strictly prohibited. If you
> have
> >     >>> received this communication in error, please contact the sender
> >     immediately
> >     >>> and delete it from your system. Thank You.
> >     >>>
> >
> >
> >
> >
> > --
> > Thanks & Regards
> > Sriharsha Devineni
>



-- 
Thanks & Regards
Sriharsha Devineni

Re: Help! After network outage, can't start System VMs; focused debug info attached

Posted by Kirk Kosinski <ki...@gmail.com>.
Okay, so system VMs are using NFS primary storage (I mis-read the OP,
sorry).  Make sure the KVM hosts can mount and write to:

10.42.1.101:/srv/nfs/eng/cs-primary

Also check libvirtd.log for errors.

If you're not making progress and want to get up and running ASAP, try
reconfiguring CloudStack to use local storage for system VMs and
(assuming this works) sort out the NFS primary storage problem later.

Best regards,
Kirk

On 09/17/2013 02:22 AM, sriharsha work wrote:
> Hi Kirk,
> 
> Thanks for your reply. This is a blocker for us and currently affected
> all of our work. It is very helpful to debug more into the issue. I have
> a question.
> 
> 1. What should the directory be when mounting [2] systemVM template
> location on the nfs drive.
> 
> 
> Error from agent.log on the host. Clearly it says some issue with the
> libvirt pools. Can you please help me understand if anything else needs
> to be addressed to get the issue resolved. 
> 
> 
> 2013-09-17 02:17:36,736 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-3:null) Request:Seq 14-1592393816:  { Cmd ,
> MgmtId: 161340856362, via: 14, Ver: v1, Flags: 100111,
> [{"storage.CreateCommand":{"vo
> lId":9817,"pool":{"id":201,"uuid":"9c6fd9a3-43e5-389a-9594-faecf178b4b9","host":"10.42.1.101","path":"/srv/nfs/eng/cs-primary","port":2049,"type":"NetworkFilesystem"},"diskCharacteristics":{"size":725811200,"tags":[],"type":"ROOT","name":"ROOT-9736","useLocalStorage":false,"recreatable":true,"diskOfferingId":7,"volumeId":9817,"hyperType":"KVM"},"templateUrl":"f23a16e7-b628-429e-83e1-698935588465","wait":0}}]
> }
> 2013-09-17 02:17:36,736 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-3:null) Processing command:
> com.cloud.agent.api.storage.CreateCommand
> 2013-09-17 02:17:36,779 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-3:null) Failed to create volume:
> com.cloud.utils.exception.CloudRuntimeException:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'
> 2013-09-17 02:17:36,781 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-3:null) Seq 14-1592393816:  { Ans: , MgmtId:
> 161340856362, via: 14, Ver: v1, Flags: 110,
> [{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
> com.cloud.utils.exception.CloudRuntimeException\nMessage:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
> com.cloud.utils.exception.CloudRuntimeException:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
> com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
> com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
> com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
> java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
> 2013-09-17 02:17:36,888 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-4:null) Request:Seq 14-1592393817:  { Cmd ,
> MgmtId: 161340856362, via: 14, Ver: v1, Flags: 100111,
> [{"StopCommand":{"isProxy":false,"vmName":"s-9736-VM","wait":0}}] }
> 2013-09-17 02:17:36,888 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-4:null) Processing command:
> com.cloud.agent.api.StopCommand
> 2013-09-17 02:17:36,891 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-4:null) Failed to get dom xml:
> org.libvirt.LibvirtException: Domain not found: no domain with matching
> uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
> 2013-09-17 02:17:36,893 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-4:null) Failed to get dom xml:
> org.libvirt.LibvirtException: Domain not found: no domain with matching
> uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
> 2013-09-17 02:17:36,893 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-4:null) Try to stop the vm at first
> 2013-09-17 02:17:36,895 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-4:null) Failed to stop VM :s-9736-VM :
> org.libvirt.LibvirtException: Domain not found: no domain with matching
> uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
>         at org.libvirt.ErrorHandler.processError(Unknown Source)
>         at org.libvirt.Connect.processError(Unknown Source)
>         at org.libvirt.Connect.domainLookupByUUIDString(Unknown Source)
>         at org.libvirt.Connect.domainLookupByUUID(Unknown Source)
>         at
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.stopVM(LibvirtComputingResource.java:4023)
>         at
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.stopVM(Libvi
> 
> 
> Thanks 
> Sriharsha.
> 
> 
> On Tue, Sep 17, 2013 at 1:41 AM, Kirk Kosinski <kirkkosinski@gmail.com
> <ma...@gmail.com>> wrote:
> 
>     Hi, here is the error:
> 
>     2013-09-16 15:08:17,168 DEBUG [agent.transport.Request]
>     (AgentManager-Handler-5:null) Seq 13-931004532: Processing:  { Ans: ,
>     MgmtId: 161340856362, via: 13, Ver: v1, Flags: 110,
>     [{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
>     com.cloud.utils.exception.CloudRuntimeException\nMessage:
>     org.libvirt.LibvirtException: Storage volume not found: no storage vol
>     with matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
>     com.cloud.utils.exception.CloudRuntimeException:
>     org.libvirt.LibvirtException: Storage volume not found: no storage vol
>     with matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
>     com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
>     com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
>     com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
>     com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
>     com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
>     com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
>     com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
>     com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
>     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
>     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
>     java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
> 
>     I'm not certain what volume it is complaining about, but I suspect
>     secondary storage.  Log on to a host (in particular host 13 [1] since it
>     is confirmed to suffer from the issue) and try to manually mount the
>     full path of the directory with the system VM template of the secondary
>     storage NFS share [2].  The idea is to confirm the share and
>     subdirectories of the share are mountable.  Maybe during the maintenance
>     some hosts changed IPs and/or the secondary storage NFS share
>     permissions (or other settings) were messed up.
> 
>     If the mount doesn't work, fix whatever is causing it.  If it does work,
>     please collect additional info.  Enable DEBUG logging on the hosts [3]
>     (if necessary), wait for the error to occur, and upload the agent.log
>     from the host with the error.  It should have more details besides the
>     exception shown in the management-server.log.  If you have a lot of
>     hosts and don't want to enable DEBUG logging on every one, temporarily
>     disable most of them and do it on the remaining few.
> 
>     Best regards,
>     Kirk
> 
>     [1] "13" is the id of the host in the CloudStack database, so find out
>     which host it is with:
>     select * from `cloud`.`host` where id = 13 \G
> 
>     [2] Something like:
>     nfshost:/share/template/tmpl/2/123
> 
>     [3] In /etc/cloudstack/agent/log4j-cloud.xml, set the Threshold for FILE
>     and com.cloud to DEBUG.  Depending on the CloudStack version, it may or
>     may not be enabled by default, and the path may be /etc/cloud/agent/.
> 
> 
>     On 09/16/2013 07:36 PM, sriharsha work wrote:
>     > Replying on behalf of Matt. We are able to write data to the Nfs
>     drives.
>     > That's not an issue.
>     >
>     > Thanks
>     > Sriharsha
>     >
>     > Sent from my iPhone
>     >
>     >> On Sep 16, 2013, at 19:30, Ahmad Emneina <aemneina@gmail.com
>     <ma...@gmail.com>> wrote:
>     >>
>     >> Try to mount your primary storage to a compute host and try to
>     write to it.
>     >> Your NFS server might not have come back up properly
>     (settings-wise or all
>     >> the relevant services).
>     >>> On Sep 16, 2013 6:08 PM, "Matt Foley" <mfoley@hortonworks.com
>     <ma...@hortonworks.com>> wrote:
>     >>>
>     >>> Thank you Chiradeep.  Log snippet now available as
>     http://apaste.info/qBIB
>     >>> --Matt
>     >>>
>     >>> On Mon, Sep 16, 2013 at 5:19 PM, Chiradeep Vittal <
>     >>> Chiradeep.Vittal@citrix.com
>     <ma...@citrix.com>> wrote:
>     >>>
>     >>>> Attachments are stripped. Can you paste (say at
>     http://apaste.info/)
>     >>>>
>     >>>> From: Matt Foley <mfoley@hortonworks.com
>     <ma...@hortonworks.com>>
>     >>>> Date: Monday, September 16, 2013 4:58 PM
>     >>>>
>     >>>> We had a planned network outage this weekend, which inadvertently
>     >>> resulted
>     >>>> in making the NFS Shared Primary Storage (used by System VMs)
>     unavailable
>     >>>> for a day and a half.  (Guest VMs use local storage only, but
>     System VMs
>     >>>> use shared storage only.)  Cloudstack was not brought down
>     prior to the
>     >>>> outage.
>     >>>>
>     >>>> After network came back, we gracefully brought down all services
>     >>> including
>     >>>> cloudstack-management, mysql, and NFS, then actually rebooted
>     all servers
>     >>>> in the cluster and the NFS server (to make sure no stale file
>     handles),
>     >>>> then brought up services in the appropriate order.  Also
>     checked mysql
>     >>> for
>     >>>> table corruption, and found none.  Confirmed that the NFS
>     volumes are
>     >>>> mountable from all hosts, and in fact Shared Primary Storage is
>     being
>     >>>> mounted by cloudstack on hosts as usual, under /mnt/<uuid>.
>     >>>>
>     >>>> Nevertheless, when try to bring up the cluster, we fail to
>     start the
>     >>>> system VMs, with errors "InsufficientServerCapacityException:
>     Unable to
>     >>>> create a deployment for VM".  The cause is not really insufficient
>     >>>> capacity, as actual usage of resources is tiny; these error
>     messages are
>     >>>> false explanations of the failure to create primary storage
>     volume for
>     >>> the
>     >>>> System VMs.
>     >>>>
>     >>>> Digging into management-server.log, the core issue seems to be
>     the ~160
>     >>>> line snippet from the log attached to this message as
>     >>>> cloudstack_debug_2013.09.16.log. The only Shared Primary
>     Storage pool is
>     >>>> pool 201, named "cs-primary".  It is mounted on all hosts as
>     >>>> /mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid.
>      The log
>     >>>> shows the management server correctly identifying a particular
>     host as
>     >>>> being able to access pool 201, then trying to allocate a
>     primary storage
>     >>>> volume using the template with uuid
>     f23a16e7-b628-429e-83e1-698935588465.
>     >>>> It fails, but I cannot tell why.  I suspect its claim that
>     "Template 3
>     >>> has
>     >>>> already been downloaded to pool 201" is false, but I don't know
>     how to
>     >>>> check this (or fix if wrong).
>     >>>>
>     >>>> Any guidance for further debugging or fixing this would be GREATLY
>     >>>> appreciated.
>     >>>> Thanks,
>     >>>> --Matt
>     >>>
>     >>> --
>     >>> CONFIDENTIALITY NOTICE
>     >>> NOTICE: This message is intended for the use of the individual
>     or entity to
>     >>> which it is addressed and may contain information that is
>     confidential,
>     >>> privileged and exempt from disclosure under applicable law. If
>     the reader
>     >>> of this message is not the intended recipient, you are hereby
>     notified that
>     >>> any printing, copying, dissemination, distribution, disclosure or
>     >>> forwarding of this communication is strictly prohibited. If you have
>     >>> received this communication in error, please contact the sender
>     immediately
>     >>> and delete it from your system. Thank You.
>     >>>
> 
> 
> 
> 
> -- 
> Thanks & Regards
> Sriharsha Devineni

Re: Help! After network outage, can't start System VMs; focused debug info attached

Posted by sriharsha work <sr...@gmail.com>.
Hi Kirk,

Thanks for your reply. This is a blocker for us and currently affected all
of our work. It is very helpful to debug more into the issue. I have a
question.

1. What should the directory be when mounting [2] systemVM template
location on the nfs drive.


Error from agent.log on the host. Clearly it says some issue with the
libvirt pools. Can you please help me understand if anything else needs to
be addressed to get the issue resolved.


2013-09-17 02:17:36,736 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-3:null) Request:Seq 14-1592393816:  { Cmd , MgmtId:
161340856362, via: 14, Ver: v1, Flags: 100111,
[{"storage.CreateCommand":{"vo
lId":9817,"pool":{"id":201,"uuid":"9c6fd9a3-43e5-389a-9594-faecf178b4b9","host":"10.42.1.101","path":"/srv/nfs/eng/cs-primary","port":2049,"type":"NetworkFilesystem"},"diskCharacteristics":{"size":725811200,"tags":[],"type":"ROOT","name":"ROOT-9736","useLocalStorage":false,"recreatable":true,"diskOfferingId":7,"volumeId":9817,"hyperType":"KVM"},"templateUrl":"f23a16e7-b628-429e-83e1-698935588465","wait":0}}]
}
2013-09-17 02:17:36,736 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-3:null) Processing command:
com.cloud.agent.api.storage.CreateCommand
2013-09-17 02:17:36,779 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-3:null) Failed to create volume:
com.cloud.utils.exception.CloudRuntimeException:
org.libvirt.LibvirtException: Storage volume not found: no storage vol with
matching name 'f23a16e7-b628-429e-83e1-698935588465'
2013-09-17 02:17:36,781 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-3:null) Seq 14-1592393816:  { Ans: , MgmtId:
161340856362, via: 14, Ver: v1, Flags: 110,
[{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
com.cloud.utils.exception.CloudRuntimeException\nMessage:
org.libvirt.LibvirtException: Storage volume not found: no storage vol with
matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
com.cloud.utils.exception.CloudRuntimeException:
org.libvirt.LibvirtException: Storage volume not found: no storage vol with
matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
2013-09-17 02:17:36,888 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-4:null) Request:Seq 14-1592393817:  { Cmd , MgmtId:
161340856362, via: 14, Ver: v1, Flags: 100111,
[{"StopCommand":{"isProxy":false,"vmName":"s-9736-VM","wait":0}}] }
2013-09-17 02:17:36,888 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-4:null) Processing command:
com.cloud.agent.api.StopCommand
2013-09-17 02:17:36,891 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-4:null) Failed to get dom xml:
org.libvirt.LibvirtException: Domain not found: no domain with matching
uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
2013-09-17 02:17:36,893 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-4:null) Failed to get dom xml:
org.libvirt.LibvirtException: Domain not found: no domain with matching
uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
2013-09-17 02:17:36,893 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-4:null) Try to stop the vm at first
2013-09-17 02:17:36,895 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-4:null) Failed to stop VM :s-9736-VM :
org.libvirt.LibvirtException: Domain not found: no domain with matching
uuid 'fba58267-2f0b-3249-8cca-d99c4f843b5a'
        at org.libvirt.ErrorHandler.processError(Unknown Source)
        at org.libvirt.Connect.processError(Unknown Source)
        at org.libvirt.Connect.domainLookupByUUIDString(Unknown Source)
        at org.libvirt.Connect.domainLookupByUUID(Unknown Source)
        at
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.stopVM(LibvirtComputingResource.java:4023)
        at
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.stopVM(Libvi


Thanks
Sriharsha.


On Tue, Sep 17, 2013 at 1:41 AM, Kirk Kosinski <ki...@gmail.com>wrote:

> Hi, here is the error:
>
> 2013-09-16 15:08:17,168 DEBUG [agent.transport.Request]
> (AgentManager-Handler-5:null) Seq 13-931004532: Processing:  { Ans: ,
> MgmtId: 161340856362, via: 13, Ver: v1, Flags: 110,
>
> [{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
> com.cloud.utils.exception.CloudRuntimeException\nMessage:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
> com.cloud.utils.exception.CloudRuntimeException:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
>
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
>
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
>
> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
>
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
>
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
> com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
> com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
> com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
> java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
>
> I'm not certain what volume it is complaining about, but I suspect
> secondary storage.  Log on to a host (in particular host 13 [1] since it
> is confirmed to suffer from the issue) and try to manually mount the
> full path of the directory with the system VM template of the secondary
> storage NFS share [2].  The idea is to confirm the share and
> subdirectories of the share are mountable.  Maybe during the maintenance
> some hosts changed IPs and/or the secondary storage NFS share
> permissions (or other settings) were messed up.
>
> If the mount doesn't work, fix whatever is causing it.  If it does work,
> please collect additional info.  Enable DEBUG logging on the hosts [3]
> (if necessary), wait for the error to occur, and upload the agent.log
> from the host with the error.  It should have more details besides the
> exception shown in the management-server.log.  If you have a lot of
> hosts and don't want to enable DEBUG logging on every one, temporarily
> disable most of them and do it on the remaining few.
>
> Best regards,
> Kirk
>
> [1] "13" is the id of the host in the CloudStack database, so find out
> which host it is with:
> select * from `cloud`.`host` where id = 13 \G
>
> [2] Something like:
> nfshost:/share/template/tmpl/2/123
>
> [3] In /etc/cloudstack/agent/log4j-cloud.xml, set the Threshold for FILE
> and com.cloud to DEBUG.  Depending on the CloudStack version, it may or
> may not be enabled by default, and the path may be /etc/cloud/agent/.
>
>
> On 09/16/2013 07:36 PM, sriharsha work wrote:
> > Replying on behalf of Matt. We are able to write data to the Nfs drives.
> > That's not an issue.
> >
> > Thanks
> > Sriharsha
> >
> > Sent from my iPhone
> >
> >> On Sep 16, 2013, at 19:30, Ahmad Emneina <ae...@gmail.com> wrote:
> >>
> >> Try to mount your primary storage to a compute host and try to write to
> it.
> >> Your NFS server might not have come back up properly (settings-wise or
> all
> >> the relevant services).
> >>> On Sep 16, 2013 6:08 PM, "Matt Foley" <mf...@hortonworks.com> wrote:
> >>>
> >>> Thank you Chiradeep.  Log snippet now available as
> http://apaste.info/qBIB
> >>> --Matt
> >>>
> >>> On Mon, Sep 16, 2013 at 5:19 PM, Chiradeep Vittal <
> >>> Chiradeep.Vittal@citrix.com> wrote:
> >>>
> >>>> Attachments are stripped. Can you paste (say at http://apaste.info/)
> >>>>
> >>>> From: Matt Foley <mf...@hortonworks.com>
> >>>> Date: Monday, September 16, 2013 4:58 PM
> >>>>
> >>>> We had a planned network outage this weekend, which inadvertently
> >>> resulted
> >>>> in making the NFS Shared Primary Storage (used by System VMs)
> unavailable
> >>>> for a day and a half.  (Guest VMs use local storage only, but System
> VMs
> >>>> use shared storage only.)  Cloudstack was not brought down prior to
> the
> >>>> outage.
> >>>>
> >>>> After network came back, we gracefully brought down all services
> >>> including
> >>>> cloudstack-management, mysql, and NFS, then actually rebooted all
> servers
> >>>> in the cluster and the NFS server (to make sure no stale file
> handles),
> >>>> then brought up services in the appropriate order.  Also checked mysql
> >>> for
> >>>> table corruption, and found none.  Confirmed that the NFS volumes are
> >>>> mountable from all hosts, and in fact Shared Primary Storage is being
> >>>> mounted by cloudstack on hosts as usual, under /mnt/<uuid>.
> >>>>
> >>>> Nevertheless, when try to bring up the cluster, we fail to start the
> >>>> system VMs, with errors "InsufficientServerCapacityException: Unable
> to
> >>>> create a deployment for VM".  The cause is not really insufficient
> >>>> capacity, as actual usage of resources is tiny; these error messages
> are
> >>>> false explanations of the failure to create primary storage volume for
> >>> the
> >>>> System VMs.
> >>>>
> >>>> Digging into management-server.log, the core issue seems to be the
> ~160
> >>>> line snippet from the log attached to this message as
> >>>> cloudstack_debug_2013.09.16.log. The only Shared Primary Storage pool
> is
> >>>> pool 201, named "cs-primary".  It is mounted on all hosts as
> >>>> /mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid.  The log
> >>>> shows the management server correctly identifying a particular host as
> >>>> being able to access pool 201, then trying to allocate a primary
> storage
> >>>> volume using the template with uuid
> f23a16e7-b628-429e-83e1-698935588465.
> >>>> It fails, but I cannot tell why.  I suspect its claim that "Template 3
> >>> has
> >>>> already been downloaded to pool 201" is false, but I don't know how to
> >>>> check this (or fix if wrong).
> >>>>
> >>>> Any guidance for further debugging or fixing this would be GREATLY
> >>>> appreciated.
> >>>> Thanks,
> >>>> --Matt
> >>>
> >>> --
> >>> CONFIDENTIALITY NOTICE
> >>> NOTICE: This message is intended for the use of the individual or
> entity to
> >>> which it is addressed and may contain information that is confidential,
> >>> privileged and exempt from disclosure under applicable law. If the
> reader
> >>> of this message is not the intended recipient, you are hereby notified
> that
> >>> any printing, copying, dissemination, distribution, disclosure or
> >>> forwarding of this communication is strictly prohibited. If you have
> >>> received this communication in error, please contact the sender
> immediately
> >>> and delete it from your system. Thank You.
> >>>
>



-- 
Thanks & Regards
Sriharsha Devineni

Re: Help! After network outage, can't start System VMs; focused debug info attached

Posted by Dean Kamali <de...@gmail.com>.
I went through your logs, there seems to be an issue with secondary
storage, have you tried to create another secondary storage? and maybe
prepare new system templates
link<http://cloudstack.apache.org/docs/en-US/Apache_CloudStack/4.1.1/html/Installation_Guide/management-server-install-flow.html#prepare-system-vm-template>

What are you using for your backend storage? are you running NFSv4?

Please attach full copy of your log :)

Best,

Re: Help! After network outage, can't start System VMs; focused debug info attached

Posted by Matt Foley <mf...@hortonworks.com>.
Thank you, Kirk, that was most helpful.

I currently have the system running with SystemVMs using local storage,
with the loss of all our (supposedly) persistent volumes in shared storage.
 After we get past the emergency I will try Kirk's suggestion of deleting
the stale line from the template_spool_ref table.  In the meantime, here's
what I learned:

The volume f23a16e7-b628-429e-83e1-698935588465 is present in
"template_spool_ref" table, with download_state = DOWNLOADED, but is not in
the cs-primary pool location on NFS.

Indeed, there are no volume or template objects in the cs-primary pool
(shared Primary) location, even though there should be several.  In order
to see what they should look like, I created new DATA and ROOT disks in
shared storage, which both worked fine, creating in the NFS directory a
volume object in the case of DATA, and both a template and a volume object
in the case of ROOT, with a path name bound to the volume id via the
"volumes" table in the db, and in the case of the ROOT volume a new
DOWNLOADED entry in the "template_spool_ref" table.

The shared Primary storage should have contained several high-value DATA
volumes, as well as the SystemVM template the system obviously thought was
previously downloaded.  I infer the Primary storage was deleted and
recreated by Cloudstack when the NFS storage became available after an
outage of more than 24 hours.  This is disappointing, and rather ironic,
since shared storage was chosen so that it would be MORE persistent, not
more vulnerable.

I suspect implicit re-registration of Primary storage, after a lengthy NFS
outage, activated the logic of erasing the Primary storage upon
registration.  That would be a major bug, if so.

Thank you for your kind help.
--Matt


On Tue, Sep 17, 2013 at 7:01 PM, Kirk Kosinski <ki...@gmail.com>wrote:

> Hi, secondary storage is only mounted on an as-needed basis.  When a KVM
> or XenServer host needs to do something on secondary storage, it will
> mount the full path it needs (e.g. nfshost:/share/template/tmpl/2/123),
> do what it needs to do, and unmount it.
>
> The error seems to be that CloudStack is looking for and not finding a
> volume (qcow2 disk) named "f23a16e7-b628-429e-83e1-698935588465" on the
> NFS primary storage.  This file seems to be the system VM template.
> Does this file exist or not?  I'd guess not, since CS says it can't find
> it.
>
> Check the status of this volume in the template_spool_ref table:
> SELECT * FROM template_spool_ref where local_path =
> 'f23a16e7-b628-429e-83e1-698935588465'\G
>
> If it shows up in the database as download_state = DOWNLOADED but it
> does not exist on primary storage, back up the cloud database, then
> delete the row in template_spool_ref.  This should force CS to should
> re-download it (i.e. copy it from secondary storage to primary again and
> use it to deploy system VMs... and create a new entry for it in
> template_spool_ref).
>
> If it does exist on primary storage, maybe the file is corrupt.  Compare
> the size and md5sum to the original on secondary storage.  Let us know
> how it goes.
>
> Best regards,
> Kirk
>
> On 09/17/2013 04:47 PM, Matt Foley wrote:
> > Hi,
> > I've now heard that this problem, of Cloudstack being messed up after
> > interruption of the NFS shared storage access, is well known.  Does
> > anyone have a fix or work-around?
> >
> > Kirk, thanks for your help so far.
> > Both the master and the host servers can mount both primary and
> > secondary stores, and read and write them.  No permissions nor IP access
> > seem broken.
> >
> > I also checked the log levels on the hosts, and both FILE and com.cloud
> > were already set to DEBUG.  I tried setting them to TRACE, but got no
> > additional useful info.
> >
> > On the host, I tried just restarting the cloudstack-agent service.  In
> > the resulting logs, the following snippet occurs.  The best
> > interpretation I can make of it is that "no storage vol with matching
> > name 'f23a16e7-b628-429e-83e1-698935588465'' is the key issue, and that
> > should relate to secondary storage, where the templates are stored.  But
> > this uuid doesn't seem to be related to the actual secondary storage
> > pool, whose uuid is b7fd7b11-c0f7-4717-8343-ff6fb9bff860.  The primary
> > storage pool is uuid 9c6fd9a3-43e5-389a-9594-faecf178b4b9, and it seems
> > to be properly automatically mounted on all hosts and the master.
> >
> > ** It concerns me that the secondary storage pool does NOT seem to be
> > automatically mounted.  Is it supposed to be?  If not, how are the hosts
> > supposed to find the templates, before a System Router VM can even be
> > set up?
> >
> > Below is the relevant host agent.log snippet, and also a dump of the
> > storage_pool table from mysql.
> >
> > Thanks in advance for any suggestions.
> > --Matt
> >
> > ======================
>
...truncated...

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Help! After network outage, can't start System VMs; focused debug info attached

Posted by Kirk Kosinski <ki...@gmail.com>.
Hi, secondary storage is only mounted on an as-needed basis.  When a KVM
or XenServer host needs to do something on secondary storage, it will
mount the full path it needs (e.g. nfshost:/share/template/tmpl/2/123),
do what it needs to do, and unmount it.

The error seems to be that CloudStack is looking for and not finding a
volume (qcow2 disk) named "f23a16e7-b628-429e-83e1-698935588465" on the
NFS primary storage.  This file seems to be the system VM template.
Does this file exist or not?  I'd guess not, since CS says it can't find it.

Check the status of this volume in the template_spool_ref table:
SELECT * FROM template_spool_ref where local_path =
'f23a16e7-b628-429e-83e1-698935588465'\G

If it shows up in the database as download_state = DOWNLOADED but it
does not exist on primary storage, back up the cloud database, then
delete the row in template_spool_ref.  This should force CS to should
re-download it (i.e. copy it from secondary storage to primary again and
use it to deploy system VMs... and create a new entry for it in
template_spool_ref).

If it does exist on primary storage, maybe the file is corrupt.  Compare
the size and md5sum to the original on secondary storage.  Let us know
how it goes.

Best regards,
Kirk

On 09/17/2013 04:47 PM, Matt Foley wrote:
> Hi,
> I've now heard that this problem, of Cloudstack being messed up after
> interruption of the NFS shared storage access, is well known.  Does
> anyone have a fix or work-around?
> 
> Kirk, thanks for your help so far.
> Both the master and the host servers can mount both primary and
> secondary stores, and read and write them.  No permissions nor IP access
> seem broken.
> 
> I also checked the log levels on the hosts, and both FILE and com.cloud
> were already set to DEBUG.  I tried setting them to TRACE, but got no
> additional useful info.
> 
> On the host, I tried just restarting the cloudstack-agent service.  In
> the resulting logs, the following snippet occurs.  The best
> interpretation I can make of it is that "no storage vol with matching
> name 'f23a16e7-b628-429e-83e1-698935588465'' is the key issue, and that
> should relate to secondary storage, where the templates are stored.  But
> this uuid doesn't seem to be related to the actual secondary storage
> pool, whose uuid is b7fd7b11-c0f7-4717-8343-ff6fb9bff860.  The primary
> storage pool is uuid 9c6fd9a3-43e5-389a-9594-faecf178b4b9, and it seems
> to be properly automatically mounted on all hosts and the master.  
> 
> ** It concerns me that the secondary storage pool does NOT seem to be
> automatically mounted.  Is it supposed to be?  If not, how are the hosts
> supposed to find the templates, before a System Router VM can even be
> set up?
> 
> Below is the relevant host agent.log snippet, and also a dump of the
> storage_pool table from mysql.
> 
> Thanks in advance for any suggestions.
> --Matt
> 
> ======================
> 2013-09-17 15:26:46,012 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-4:null) Processing command:
> com.cloud.agent.api.storage.CreateCommand
> 2013-09-17 15:26:46,050 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-4:null) Failed to create volume:
> com.cloud.utils.exception.CloudRuntimeException:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'
> 2013-09-17 15:26:46,051 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-4:null) Seq 14-606340093:  { Ans: , MgmtId:
> 161340856362, via: 14, Ver: v1, Flags: 110,
> [{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
> com.cloud.utils.exception.CloudRuntimeException\nMessage:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
> com.cloud.utils.exception.CloudRuntimeException:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b62\
> 8-429e-83e1-698935588465'\n\tat
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
> com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
> com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
> com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
> java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
> 2013-09-17 15:26:46,192 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-1:null) Request:Seq 14-606340094:  { Cmd , MgmtId:
> 161340856362, via: 14, Ver: v1, Flags: 100111,
> [{"storage.CreateCommand":{"volId":10510,"pool":{"id":201,"uuid":"9c6fd9a3-43e5-389a-9594-faecf178b4b9","host":"10.42.1.101","path":"/srv/nfs/eng/cs-primary","port":2049,"type":"NetworkFilesystem"},"diskCharacteristics":{"size":725811200,"tags":[],"type":"ROOT","name":"ROOT-10429","useLocalStorage":false,"recreatable":true,"diskOfferingId":7,"volumeId":10510,"hyperType":"KVM"},"templateUrl":"f23a16e7-b628-429e-83e1-698935588465","wait":0}}]
> }
> 2013-09-17 15:26:46,192 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-1:null) Processing command:
> com.cloud.agent.api.storage.CreateCommand
> 2013-09-17 15:26:46,228 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-1:null) Failed to create volume:
> com.cloud.utils.exception.CloudRuntimeException:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'
> 2013-09-17 15:26:46,229 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-1:null) Seq 14-606340094:  { Ans: , MgmtId:
> 161340856362, via: 14, Ver: v1, Flags: 110,
> [{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
> com.cloud.utils.exception.CloudRuntimeException\nMessage:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
> com.cloud.utils.exception.CloudRuntimeException:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
> com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
> com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
> com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
> java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
> 2013-09-17 15:26:46,271 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-2:null) Request:Seq 14-606340095:  { Cmd , MgmtId:
> 161340856362, via: 14, Ver: v1, Flags: 100111,
> [{"StopCommand":{"isProxy":false,"vmName":"v-10415-VM","wait":0}}] }
> 
> ======================
> 
> dump from mysql of the "storage_pool" table:
> 
> ======================
> --
> -- Table structure for table `storage_pool`
> --
> 
> DROP TABLE IF EXISTS `storage_pool`;
> /*!40101 SET @saved_cs_client     = @@character_set_client */;
> /*!40101 SET character_set_client = utf8 */;
> CREATE TABLE `storage_pool` (
>   `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
>   `name` varchar(255) DEFAULT NULL COMMENT 'should be NOT NULL',
>   `uuid` varchar(255) DEFAULT NULL,
>   `pool_type` varchar(32) NOT NULL,
>   `port` int(10) unsigned NOT NULL,
>   `data_center_id` bigint(20) unsigned NOT NULL,
>   `pod_id` bigint(20) unsigned DEFAULT NULL,
>   `cluster_id` bigint(20) unsigned DEFAULT NULL COMMENT 'foreign key to
> cluster',
>   `available_bytes` bigint(20) unsigned DEFAULT NULL,
>   `capacity_bytes` bigint(20) unsigned DEFAULT NULL,
>   `host_address` varchar(255) NOT NULL COMMENT 'FQDN or IP of storage
> server',
>   `user_info` varchar(255) DEFAULT NULL COMMENT 'Authorization
> information for the storage pool. Used by network filesystems',
>   `path` varchar(255) NOT NULL COMMENT 'Filesystem path that is shared',
>   `created` datetime DEFAULT NULL COMMENT 'date the pool created',
>   `removed` datetime DEFAULT NULL COMMENT 'date removed if not null',
>   `update_time` datetime DEFAULT NULL,
>   `status` varchar(32) DEFAULT NULL,
>   `storage_provider_id` bigint(20) unsigned DEFAULT NULL,
>   `scope` varchar(255) DEFAULT NULL,
>   PRIMARY KEY (`id`),
>   UNIQUE KEY `id` (`id`),
>   UNIQUE KEY `id_2` (`id`),
>   UNIQUE KEY `uuid` (`uuid`),
>   KEY `i_storage_pool__pod_id` (`pod_id`),
>   KEY `fk_storage_pool__cluster_id` (`cluster_id`),
>   KEY `i_storage_pool__removed` (`removed`),
>   CONSTRAINT `fk_storage_pool__cluster_id` FOREIGN KEY (`cluster_id`)
> REFERENCES `cluster` (`id`),
>   CONSTRAINT `fk_storage_pool__pod_id` FOREIGN KEY (`pod_id`) REFERENCES
> `host_pod_ref` (`id`) ON DELETE CASCADE
> ) ENGINE=InnoDB AUTO_INCREMENT=247 DEFAULT CHARSET=utf8;
> /*!40101 SET character_set_client = @saved_cs_client */;
> 
> --
> -- Dumping data for table `storage_pool`
> --
> 
> LOCK TABLES `storage_pool` WRITE;
> /*!40000 ALTER TABLE `storage_pool` DISABLE KEYS */;
> INSERT INTO `storage_pool` VALUES
> (201,'cs-primary','9c6fd9a3-43e5-389a-9594-faecf178b4b9','NetworkFilesystem',2049,1,1,1,1552364339200,20916432011264,'10.42.1.101',NULL,'/srv\
> /nfs/eng/cs-primary','2013-06-07
> 08:40:58',NULL,NULL,'Up',NULL,NULL),(205,'cn005','48ef7eec-1e42-4ffa-9182-303c8c8883b4','Filesystem',0,1,1,1,4964460785664,5270660358144,'172.\
> 18.128.5',NULL,'/var/lib/libvirt/images/','2013-06-09
> 20:44:10',NULL,NULL,'Up',NULL,NULL),(207,'cn004-10',NULL,'Filesystem',0,1,1,NULL,8117739520,8487899136,'172.18.128.4',NUL\
> L,'/var/lib/libvirt/images/','2013-06-10 06:17:53','2013-06-11
> 21:52:54',NULL,'Maintenance',NULL,NULL),(210,'cn004_grid',NULL,'NetworkFilesystem',2049,1,1,1,1645268992,4868214\
> 7840,'172.18.128.4',NULL,'/grid/1/cloudstack_store','2013-06-10
> 21:48:42','2013-06-20
> 08:53:15',NULL,'Maintenance',NULL,NULL),(215,'cn007','65aab404-6915-44fc-9a5e-c156b663ea67','Filesystem',0,1,1,1,4984320176128,5247872114688,'172.18.128.7',NULL,'/var/lib/libvirt/images/','2013-06-11
> 15:36:11',NULL,NULL,'Up',NULL,NULL),(216,'cn004-10','dfe2fa90-70fc-4d87-a314-0c7eab429d08','Filesystem',0,1,1,1,5270461812736,5270660358144,'172.18.128.4',NULL,'/var/lib/libvirt/images/','2013-06-11
> 21:54:44',NULL,NULL,'Up',NULL,NULL),(217,'cn003-10','3ea2c222-98fe-4ba9-a83c-c6d12eed1186','Filesystem',0,1,1,1,5232745308160,5270660358144,'172.18.128.3',NULL,'/var/lib/libvirt/images/','2013-06-11
> 22:03:17',NULL,NULL,'Up',NULL,NULL),(218,'cn008','52fd1e05-5153-4e16-94e9-7c851855a3fb','Filesystem',0,1,1,1,5073231945728,5270660358144,'172.18.128.8',NULL,'/var/lib/libvirt/images/','2013-06-11
> 22:09:38',NULL,NULL,'Up',NULL,NULL),(219,'cn009','e6c4ed93-d0ee-429a-a44f-e39f7ece4356','Filesystem',0,1,1,1,5183913791488,5270660358144,'172.18.128.9',NULL,'/var/lib/libvirt/images/','2013-06-11
> 22:14:52',NULL,NULL,'Up',NULL,NULL),(220,'cn010','b8398363-b0d0-4768-870f-b50033baa5dc','Filesystem',0,1,1,1,5242997583872,5270660358144,'172.18.128.10',NULL,'/var/lib/libvirt/images/','2013-06-11
> 22:25:25',NULL,NULL,'Up',NULL,NULL),(221,'cn006','59340ae4-22be-46a6-94d0-f4e44ac74885','Filesystem',0,1,1,1,5251206721536,5270660358144,'172.18.128.6',NULL,'/var/lib/libvirt/images/','2013-06-11
> 22:45:09',NULL,NULL,'Up',NULL,NULL),(222,'cn011',NULL,'Filesystem',0,1,1,NULL,8122257408,8487899136,'172.18.128.11',NULL,'/var/lib/libvirt/images/','2013-06-19
> 03:09:37','2013-06-19
> 03:15:36',NULL,'Maintenance',NULL,NULL),(223,'cn011','ca666329-0081-48c1-837f-4181fdf60cfd','Filesystem',0,1,1,2,5229988343808,5270660358144,'172.18.128.11',NULL,'/var/lib/libvirt/images/','2013-06-20
> 07:25:39',NULL,NULL,'Up',NULL,NULL),(224,'cn012','60be4d38-8b57-491b-8d4c-cd2eb54fb815','Filesystem',0,1,1,2,5142698045440,5270660358144,'172.18.128.12',NULL,'/var/lib/libvirt/images/','2013-06-20
> 08:10:19',NULL,NULL,'Up',NULL,NULL),(225,'cn014','2e19dae5-79e2-4ec1-b280-5396fd695c22','Filesystem',0,1,1,2,5140740456448,5270660358144,'172.18.128.14',NULL,'/var/lib/libvirt/images/','2013-06-20
> 08:11:07',NULL,NULL,'Up',NULL,NULL),(226,'cn013','09528b9b-c5a9-4bd3-b9fe-fc31ff46afb2','Filesystem',0,1,1,2,5055306797056,5270660358144,'172.18.128.13',NULL,'/var/lib/libvirt/images/','2013-06-20
> 08:11:14',NULL,NULL,'Up',NULL,NULL),(227,'cn015','420c3008-8de7-4106-807a-eb2c86b4c261','Filesystem',0,1,1,2,5187185598464,5270660358144,'172.18.128.15',NULL,'/var/lib/libvirt/images/','2013-06-20
> 08:11:19',NULL,NULL,'Up',NULL,NULL),(228,'cn016','2cafc2d9-91da-405e-92c6-90b13cd8b068','Filesystem',0,1,1,2,5270461952000,5270660358144,'172.18.128.16',NULL,'/var/lib/libvirt/images/','2013-06-20
> 08:11:45',NULL,NULL,'Up',NULL,NULL),(229,'cn017','22dff242-f780-4522-95f5-c01ac62c197c','Filesystem',0,1,1,2,5039361929216,5270660358144,'172.18.128.17',NULL,'/var/lib/libvirt/images/','2013-06-20
> 08:12:00',NULL,NULL,'Up',NULL,NULL),(230,'cn018','31b5a0f2-0ea9-47a1-971c-4330539489c7','Filesystem',0,1,1,2,5014768701440,5270660358144,'172.18.128.18',NULL,'/var/lib/libvirt/images/','2013-06-20
> 08:12:22',NULL,NULL,'Up',NULL,NULL),(231,'cn019','a28eca04-09c0-4a42-b3a0-aa075fccb154','Filesystem',0,1,1,2,5270461812736,5270660358144,'172.18.128.19',NULL,'/var/lib/libvirt/images/','2013-06-20
> 08:17:30',NULL,NULL,'Up',NULL,NULL),(232,'cn020','dfc5d6e4-0f27-4692-8e94-1c89a9410e82','Filesystem',0,1,1,2,4790488539136,5270660358144,'172.18.128.20',NULL,'/var/lib/libvirt/images/','2013-06-20
> 08:17:51',NULL,NULL,'Up',NULL,NULL),(233,'cn061-10',NULL,'Filesystem',0,1,1,NULL,47272779776,48682147840,'172.18.128.61',NULL,'/var/lib/libvirt/images/','2013-07-01
> 15:11:22','2013-07-24
> 01:12:16',NULL,'Maintenance',NULL,NULL),(234,'cn061-10','c01a2cb9-239b-4d0b-b484-886065d888c2','Filesystem',0,1,1,3,5181433708544,5270660358144,'172.18.128.61',NULL,'/var/lib/libvirt/images/','2013-07-24
> 01:16:13',NULL,NULL,'Up',NULL,NULL),(235,'cn062-10','4c5f9b7f-968f-48be-a9c2-2ae2f11d8967','Filesystem',0,1,1,3,5030513332224,5270660358144,'172.18.128.62',NULL,'/var/lib/libvirt/images/','2013-07-24
> 01:46:40',NULL,NULL,'Up',NULL,NULL),(236,'cn063-10','c9a579ec-ed2f-41b4-b89e-c47bc346c4c3','Filesystem',0,1,1,3,4963781529600,5270660358144,'172.18.128.63',NULL,'/var/lib/libvirt/images/','2013-07-24
> 05:16:29',NULL,NULL,'Up',NULL,NULL),(237,'cn065-10','be4c89a9-8b9c-4161-8955-5db998c58e34','Filesystem',0,1,1,3,5029360099328,5270660358144,'172.18.128.65',NULL,'/var/lib/libvirt/images/','2013-07-24
> 05:35:43',NULL,NULL,'Up',NULL,NULL),(238,'cn064-10','180150d3-cf66-4156-acb9-9338e5294fbc','Filesystem',0,1,1,3,4882664796160,5270660358144,'172.18.128.64',NULL,'/var/lib/libvirt/images/','2013-07-24
> 05:37:31',NULL,NULL,'Up',NULL,NULL),(239,'cn067-10','63aa8d84-34c0-4f1e-a66a-247dba851da2','Filesystem',0,1,1,3,5182267789312,5270660358144,'172.18.128.67',NULL,'/var/lib/libvirt/images/','2013-07-24
> 05:46:12',NULL,NULL,'Up',NULL,NULL),(240,'cn066-10','b24c1265-3b3c-4aac-bebe-d689961af4bf','Filesystem',0,1,1,3,5207416717312,5270660358144,'172.18.128.66',NULL,'/var/lib/libvirt/images/','2013-07-24
> 05:48:58',NULL,NULL,'Up',NULL,NULL),(241,'cn068-10','d34ca2fe-2323-4f1d-bf49-282705e188ef','Filesystem',0,1,1,3,5159436877824,5270660358144,'172.18.128.68',NULL,'/var/lib/libvirt/images/','2013-07-24
> 05:59:22',NULL,NULL,'Up',NULL,NULL),(242,'cn069-10','5227b052-ec01-4fa8-afa1-27877f79818a','Filesystem',0,1,1,3,5111256465408,5270660358144,'172.18.128.69',NULL,'/var/lib/libvirt/images/','2013-07-24
> 06:01:52',NULL,NULL,'Up',NULL,NULL),(243,'cn070-10','c28fcfc0-c443-452d-959c-9fa5d01b57e4','Filesystem',0,1,1,3,4914289025024,5270660358144,'172.18.128.70',NULL,'/var/lib/libvirt/images/','2013-07-24
> 06:05:23',NULL,NULL,'Up',NULL,NULL),(244,'cn071-10','fe972842-d227-4eff-9730-3c4043842efb','Filesystem',0,1,1,4,5054019776512,5270660358144,'172.18.128.71',NULL,'/var/lib/libvirt/images/','2013-07-24
> 06:14:36',NULL,NULL,'Up',NULL,NULL),(245,'cn072-10','9dae6eff-6c2d-4091-88f1-682e23bc4424','Filesystem',0,1,1,4,5228991623168,5270660358144,'172.18.128.72',NULL,'/var/lib/libvirt/images/','2013-07-24
> 06:16:55',NULL,NULL,'Up',NULL,NULL),(246,'cn073-10','937f263b-1a14-488c-be5c-ba19e9a598aa','Filesystem',0,1,1,4,8107274240,8487899136,'172.18.128.73',NULL,'/var/lib/libvirt/images/','2013-09-17
> 06:55:01',NULL,NULL,'Up',NULL,NULL);
> /*!40000 ALTER TABLE `storage_pool` ENABLE KEYS */;
> 
> ======================
> 
> 
> On Tue, Sep 17, 2013 at 1:41 AM, Kirk Kosinski <kirkkosinski@gmail.com
> <ma...@gmail.com>> wrote:
> 
>     Hi, here is the error:
> 
>     2013-09-16 15:08:17,168 DEBUG [agent.transport.Request]
>     (AgentManager-Handler-5:null) Seq 13-931004532: Processing:  { Ans: ,
>     MgmtId: 161340856362, via: 13, Ver: v1, Flags: 110,
>     [{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
>     com.cloud.utils.exception.CloudRuntimeException\nMessage:
>     org.libvirt.LibvirtException: Storage volume not found: no storage vol
>     with matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
>     com.cloud.utils.exception.CloudRuntimeException:
>     org.libvirt.LibvirtException: Storage volume not found: no storage vol
>     with matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
>     com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
>     com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
>     com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
>     com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
>     com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
>     com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
>     com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
>     com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
>     java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
>     java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
>     java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
> 
>     I'm not certain what volume it is complaining about, but I suspect
>     secondary storage.  Log on to a host (in particular host 13 [1] since it
>     is confirmed to suffer from the issue) and try to manually mount the
>     full path of the directory with the system VM template of the secondary
>     storage NFS share [2].  The idea is to confirm the share and
>     subdirectories of the share are mountable.  Maybe during the maintenance
>     some hosts changed IPs and/or the secondary storage NFS share
>     permissions (or other settings) were messed up.
> 
>     If the mount doesn't work, fix whatever is causing it.  If it does work,
>     please collect additional info.  Enable DEBUG logging on the hosts [3]
>     (if necessary), wait for the error to occur, and upload the agent.log
>     from the host with the error.  It should have more details besides the
>     exception shown in the management-server.log.  If you have a lot of
>     hosts and don't want to enable DEBUG logging on every one, temporarily
>     disable most of them and do it on the remaining few.
> 
>     Best regards,
>     Kirk
> 
>     [1] "13" is the id of the host in the CloudStack database, so find out
>     which host it is with:
>     select * from `cloud`.`host` where id = 13 \G
> 
>     [2] Something like:
>     nfshost:/share/template/tmpl/2/123
> 
>     [3] In /etc/cloudstack/agent/log4j-cloud.xml, set the Threshold for FILE
>     and com.cloud to DEBUG.  Depending on the CloudStack version, it may or
>     may not be enabled by default, and the path may be /etc/cloud/agent/.
> 
> 
>     On 09/16/2013 07:36 PM, sriharsha work wrote:
>     > Replying on behalf of Matt. We are able to write data to the Nfs
>     drives.
>     > That's not an issue.
>     >
>     > Thanks
>     > Sriharsha
>     >
>     > Sent from my iPhone
>     >
>     >> On Sep 16, 2013, at 19:30, Ahmad Emneina <aemneina@gmail.com
>     <ma...@gmail.com>> wrote:
>     >>
>     >> Try to mount your primary storage to a compute host and try to
>     write to it.
>     >> Your NFS server might not have come back up properly
>     (settings-wise or all
>     >> the relevant services).
>     >>> On Sep 16, 2013 6:08 PM, "Matt Foley" <mfoley@hortonworks.com
>     <ma...@hortonworks.com>> wrote:
>     >>>
>     >>> Thank you Chiradeep.  Log snippet now available as
>     http://apaste.info/qBIB
>     >>> --Matt
>     >>>
>     >>> On Mon, Sep 16, 2013 at 5:19 PM, Chiradeep Vittal <
>     >>> Chiradeep.Vittal@citrix.com
>     <ma...@citrix.com>> wrote:
>     >>>
>     >>>> Attachments are stripped. Can you paste (say at
>     http://apaste.info/)
>     >>>>
>     >>>> From: Matt Foley <mfoley@hortonworks.com
>     <ma...@hortonworks.com>>
>     >>>> Date: Monday, September 16, 2013 4:58 PM
>     >>>>
>     >>>> We had a planned network outage this weekend, which inadvertently
>     >>> resulted
>     >>>> in making the NFS Shared Primary Storage (used by System VMs)
>     unavailable
>     >>>> for a day and a half.  (Guest VMs use local storage only, but
>     System VMs
>     >>>> use shared storage only.)  Cloudstack was not brought down
>     prior to the
>     >>>> outage.
>     >>>>
>     >>>> After network came back, we gracefully brought down all services
>     >>> including
>     >>>> cloudstack-management, mysql, and NFS, then actually rebooted
>     all servers
>     >>>> in the cluster and the NFS server (to make sure no stale file
>     handles),
>     >>>> then brought up services in the appropriate order.  Also
>     checked mysql
>     >>> for
>     >>>> table corruption, and found none.  Confirmed that the NFS
>     volumes are
>     >>>> mountable from all hosts, and in fact Shared Primary Storage is
>     being
>     >>>> mounted by cloudstack on hosts as usual, under /mnt/<uuid>.
>     >>>>
>     >>>> Nevertheless, when try to bring up the cluster, we fail to
>     start the
>     >>>> system VMs, with errors "InsufficientServerCapacityException:
>     Unable to
>     >>>> create a deployment for VM".  The cause is not really insufficient
>     >>>> capacity, as actual usage of resources is tiny; these error
>     messages are
>     >>>> false explanations of the failure to create primary storage
>     volume for
>     >>> the
>     >>>> System VMs.
>     >>>>
>     >>>> Digging into management-server.log, the core issue seems to be
>     the ~160
>     >>>> line snippet from the log attached to this message as
>     >>>> cloudstack_debug_2013.09.16.log. The only Shared Primary
>     Storage pool is
>     >>>> pool 201, named "cs-primary".  It is mounted on all hosts as
>     >>>> /mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid.
>      The log
>     >>>> shows the management server correctly identifying a particular
>     host as
>     >>>> being able to access pool 201, then trying to allocate a
>     primary storage
>     >>>> volume using the template with uuid
>     f23a16e7-b628-429e-83e1-698935588465.
>     >>>> It fails, but I cannot tell why.  I suspect its claim that
>     "Template 3
>     >>> has
>     >>>> already been downloaded to pool 201" is false, but I don't know
>     how to
>     >>>> check this (or fix if wrong).
>     >>>>
>     >>>> Any guidance for further debugging or fixing this would be GREATLY
>     >>>> appreciated.
>     >>>> Thanks,
>     >>>> --Matt
>     >>>
>     >>> --
>     >>> CONFIDENTIALITY NOTICE
>     >>> NOTICE: This message is intended for the use of the individual
>     or entity to
>     >>> which it is addressed and may contain information that is
>     confidential,
>     >>> privileged and exempt from disclosure under applicable law. If
>     the reader
>     >>> of this message is not the intended recipient, you are hereby
>     notified that
>     >>> any printing, copying, dissemination, distribution, disclosure or
>     >>> forwarding of this communication is strictly prohibited. If you have
>     >>> received this communication in error, please contact the sender
>     immediately
>     >>> and delete it from your system. Thank You.
>     >>>
> 
> 
> 
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity
> to which it is addressed and may contain information that is
> confidential, privileged and exempt from disclosure under applicable
> law. If the reader of this message is not the intended recipient, you
> are hereby notified that any printing, copying, dissemination,
> distribution, disclosure or forwarding of this communication is strictly
> prohibited. If you have received this communication in error, please
> contact the sender immediately and delete it from your system. Thank You.

Re: Help! After network outage, can't start System VMs; focused debug info attached

Posted by Matt Foley <mf...@hortonworks.com>.
Hi,
I've now heard that this problem, of Cloudstack being messed up after
interruption of the NFS shared storage access, is well known.  Does anyone
have a fix or work-around?

Kirk, thanks for your help so far.
Both the master and the host servers can mount both primary and secondary
stores, and read and write them.  No permissions nor IP access seem broken.

I also checked the log levels on the hosts, and both FILE and com.cloud
were already set to DEBUG.  I tried setting them to TRACE, but got no
additional useful info.

On the host, I tried just restarting the cloudstack-agent service.  In the
resulting logs, the following snippet occurs.  The best interpretation I
can make of it is that "no storage vol with matching
name 'f23a16e7-b628-429e-83e1-698935588465'' is the key issue, and that
should relate to secondary storage, where the templates are stored.  But
this uuid doesn't seem to be related to the actual secondary storage pool,
whose uuid is b7fd7b11-c0f7-4717-8343-ff6fb9bff860.  The primary storage
pool is uuid 9c6fd9a3-43e5-389a-9594-faecf178b4b9, and it seems to be
properly automatically mounted on all hosts and the master.

** It concerns me that the secondary storage pool does NOT seem to be
automatically mounted.  Is it supposed to be?  If not, how are the hosts
supposed to find the templates, before a System Router VM can even be set
up?

Below is the relevant host agent.log snippet, and also a dump of the
storage_pool table from mysql.

Thanks in advance for any suggestions.
--Matt

======================
2013-09-17 15:26:46,012 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-4:null) Processing command:
com.cloud.agent.api.storage.CreateCommand
2013-09-17 15:26:46,050 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-4:null) Failed to create volume:
com.cloud.utils.exception.CloudRuntimeException:
org.libvirt.LibvirtException: Storage volume not found: no storage vol with
matching name 'f23a16e7-b628-429e-83e1-698935588465'
2013-09-17 15:26:46,051 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-4:null) Seq 14-606340093:  { Ans: , MgmtId:
161340856362, via: 14, Ver: v1, Flags: 110,
[{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
com.cloud.utils.exception.CloudRuntimeException\nMessage:
org.libvirt.LibvirtException: Storage volume not found: no storage vol with
matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
com.cloud.utils.exception.CloudRuntimeException:
org.libvirt.LibvirtException: Storage volume not found: no storage vol with
matching name 'f23a16e7-b62\
8-429e-83e1-698935588465'\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
2013-09-17 15:26:46,192 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-1:null) Request:Seq 14-606340094:  { Cmd , MgmtId:
161340856362, via: 14, Ver: v1, Flags: 100111,
[{"storage.CreateCommand":{"volId":10510,"pool":{"id":201,"uuid":"9c6fd9a3-43e5-389a-9594-faecf178b4b9","host":"10.42.1.101","path":"/srv/nfs/eng/cs-primary","port":2049,"type":"NetworkFilesystem"},"diskCharacteristics":{"size":725811200,"tags":[],"type":"ROOT","name":"ROOT-10429","useLocalStorage":false,"recreatable":true,"diskOfferingId":7,"volumeId":10510,"hyperType":"KVM"},"templateUrl":"f23a16e7-b628-429e-83e1-698935588465","wait":0}}]
}
2013-09-17 15:26:46,192 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-1:null) Processing command:
com.cloud.agent.api.storage.CreateCommand
2013-09-17 15:26:46,228 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-1:null) Failed to create volume:
com.cloud.utils.exception.CloudRuntimeException:
org.libvirt.LibvirtException: Storage volume not found: no storage vol with
matching name 'f23a16e7-b628-429e-83e1-698935588465'
2013-09-17 15:26:46,229 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-1:null) Seq 14-606340094:  { Ans: , MgmtId:
161340856362, via: 14, Ver: v1, Flags: 110,
[{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
com.cloud.utils.exception.CloudRuntimeException\nMessage:
org.libvirt.LibvirtException: Storage volume not found: no storage vol with
matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
com.cloud.utils.exception.CloudRuntimeException:
org.libvirt.LibvirtException: Storage volume not found: no storage vol with
matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
2013-09-17 15:26:46,271 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-2:null) Request:Seq 14-606340095:  { Cmd , MgmtId:
161340856362, via: 14, Ver: v1, Flags: 100111,
[{"StopCommand":{"isProxy":false,"vmName":"v-10415-VM","wait":0}}] }

======================

dump from mysql of the "storage_pool" table:

======================
--
-- Table structure for table `storage_pool`
--

DROP TABLE IF EXISTS `storage_pool`;
/*!40101 SET @saved_cs_client     = @@character_set_client */;
/*!40101 SET character_set_client = utf8 */;
CREATE TABLE `storage_pool` (
  `id` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `name` varchar(255) DEFAULT NULL COMMENT 'should be NOT NULL',
  `uuid` varchar(255) DEFAULT NULL,
  `pool_type` varchar(32) NOT NULL,
  `port` int(10) unsigned NOT NULL,
  `data_center_id` bigint(20) unsigned NOT NULL,
  `pod_id` bigint(20) unsigned DEFAULT NULL,
  `cluster_id` bigint(20) unsigned DEFAULT NULL COMMENT 'foreign key to
cluster',
  `available_bytes` bigint(20) unsigned DEFAULT NULL,
  `capacity_bytes` bigint(20) unsigned DEFAULT NULL,
  `host_address` varchar(255) NOT NULL COMMENT 'FQDN or IP of storage
server',
  `user_info` varchar(255) DEFAULT NULL COMMENT 'Authorization information
for the storage pool. Used by network filesystems',
  `path` varchar(255) NOT NULL COMMENT 'Filesystem path that is shared',
  `created` datetime DEFAULT NULL COMMENT 'date the pool created',
  `removed` datetime DEFAULT NULL COMMENT 'date removed if not null',
  `update_time` datetime DEFAULT NULL,
  `status` varchar(32) DEFAULT NULL,
  `storage_provider_id` bigint(20) unsigned DEFAULT NULL,
  `scope` varchar(255) DEFAULT NULL,
  PRIMARY KEY (`id`),
  UNIQUE KEY `id` (`id`),
  UNIQUE KEY `id_2` (`id`),
  UNIQUE KEY `uuid` (`uuid`),
  KEY `i_storage_pool__pod_id` (`pod_id`),
  KEY `fk_storage_pool__cluster_id` (`cluster_id`),
  KEY `i_storage_pool__removed` (`removed`),
  CONSTRAINT `fk_storage_pool__cluster_id` FOREIGN KEY (`cluster_id`)
REFERENCES `cluster` (`id`),
  CONSTRAINT `fk_storage_pool__pod_id` FOREIGN KEY (`pod_id`) REFERENCES
`host_pod_ref` (`id`) ON DELETE CASCADE
) ENGINE=InnoDB AUTO_INCREMENT=247 DEFAULT CHARSET=utf8;
/*!40101 SET character_set_client = @saved_cs_client */;

--
-- Dumping data for table `storage_pool`
--

LOCK TABLES `storage_pool` WRITE;
/*!40000 ALTER TABLE `storage_pool` DISABLE KEYS */;
INSERT INTO `storage_pool` VALUES
(201,'cs-primary','9c6fd9a3-43e5-389a-9594-faecf178b4b9','NetworkFilesystem',2049,1,1,1,1552364339200,20916432011264,'10.42.1.101',NULL,'/srv\
/nfs/eng/cs-primary','2013-06-07
08:40:58',NULL,NULL,'Up',NULL,NULL),(205,'cn005','48ef7eec-1e42-4ffa-9182-303c8c8883b4','Filesystem',0,1,1,1,4964460785664,5270660358144,'172.\
18.128.5',NULL,'/var/lib/libvirt/images/','2013-06-09
20:44:10',NULL,NULL,'Up',NULL,NULL),(207,'cn004-10',NULL,'Filesystem',0,1,1,NULL,8117739520,8487899136,'172.18.128.4',NUL\
L,'/var/lib/libvirt/images/','2013-06-10 06:17:53','2013-06-11
21:52:54',NULL,'Maintenance',NULL,NULL),(210,'cn004_grid',NULL,'NetworkFilesystem',2049,1,1,1,1645268992,4868214\
7840,'172.18.128.4',NULL,'/grid/1/cloudstack_store','2013-06-10
21:48:42','2013-06-20
08:53:15',NULL,'Maintenance',NULL,NULL),(215,'cn007','65aab404-6915-44fc-9a5e-c156b663ea67','Filesystem',0,1,1,1,4984320176128,5247872114688,'172.18.128.7',NULL,'/var/lib/libvirt/images/','2013-06-11
15:36:11',NULL,NULL,'Up',NULL,NULL),(216,'cn004-10','dfe2fa90-70fc-4d87-a314-0c7eab429d08','Filesystem',0,1,1,1,5270461812736,5270660358144,'172.18.128.4',NULL,'/var/lib/libvirt/images/','2013-06-11
21:54:44',NULL,NULL,'Up',NULL,NULL),(217,'cn003-10','3ea2c222-98fe-4ba9-a83c-c6d12eed1186','Filesystem',0,1,1,1,5232745308160,5270660358144,'172.18.128.3',NULL,'/var/lib/libvirt/images/','2013-06-11
22:03:17',NULL,NULL,'Up',NULL,NULL),(218,'cn008','52fd1e05-5153-4e16-94e9-7c851855a3fb','Filesystem',0,1,1,1,5073231945728,5270660358144,'172.18.128.8',NULL,'/var/lib/libvirt/images/','2013-06-11
22:09:38',NULL,NULL,'Up',NULL,NULL),(219,'cn009','e6c4ed93-d0ee-429a-a44f-e39f7ece4356','Filesystem',0,1,1,1,5183913791488,5270660358144,'172.18.128.9',NULL,'/var/lib/libvirt/images/','2013-06-11
22:14:52',NULL,NULL,'Up',NULL,NULL),(220,'cn010','b8398363-b0d0-4768-870f-b50033baa5dc','Filesystem',0,1,1,1,5242997583872,5270660358144,'172.18.128.10',NULL,'/var/lib/libvirt/images/','2013-06-11
22:25:25',NULL,NULL,'Up',NULL,NULL),(221,'cn006','59340ae4-22be-46a6-94d0-f4e44ac74885','Filesystem',0,1,1,1,5251206721536,5270660358144,'172.18.128.6',NULL,'/var/lib/libvirt/images/','2013-06-11
22:45:09',NULL,NULL,'Up',NULL,NULL),(222,'cn011',NULL,'Filesystem',0,1,1,NULL,8122257408,8487899136,'172.18.128.11',NULL,'/var/lib/libvirt/images/','2013-06-19
03:09:37','2013-06-19
03:15:36',NULL,'Maintenance',NULL,NULL),(223,'cn011','ca666329-0081-48c1-837f-4181fdf60cfd','Filesystem',0,1,1,2,5229988343808,5270660358144,'172.18.128.11',NULL,'/var/lib/libvirt/images/','2013-06-20
07:25:39',NULL,NULL,'Up',NULL,NULL),(224,'cn012','60be4d38-8b57-491b-8d4c-cd2eb54fb815','Filesystem',0,1,1,2,5142698045440,5270660358144,'172.18.128.12',NULL,'/var/lib/libvirt/images/','2013-06-20
08:10:19',NULL,NULL,'Up',NULL,NULL),(225,'cn014','2e19dae5-79e2-4ec1-b280-5396fd695c22','Filesystem',0,1,1,2,5140740456448,5270660358144,'172.18.128.14',NULL,'/var/lib/libvirt/images/','2013-06-20
08:11:07',NULL,NULL,'Up',NULL,NULL),(226,'cn013','09528b9b-c5a9-4bd3-b9fe-fc31ff46afb2','Filesystem',0,1,1,2,5055306797056,5270660358144,'172.18.128.13',NULL,'/var/lib/libvirt/images/','2013-06-20
08:11:14',NULL,NULL,'Up',NULL,NULL),(227,'cn015','420c3008-8de7-4106-807a-eb2c86b4c261','Filesystem',0,1,1,2,5187185598464,5270660358144,'172.18.128.15',NULL,'/var/lib/libvirt/images/','2013-06-20
08:11:19',NULL,NULL,'Up',NULL,NULL),(228,'cn016','2cafc2d9-91da-405e-92c6-90b13cd8b068','Filesystem',0,1,1,2,5270461952000,5270660358144,'172.18.128.16',NULL,'/var/lib/libvirt/images/','2013-06-20
08:11:45',NULL,NULL,'Up',NULL,NULL),(229,'cn017','22dff242-f780-4522-95f5-c01ac62c197c','Filesystem',0,1,1,2,5039361929216,5270660358144,'172.18.128.17',NULL,'/var/lib/libvirt/images/','2013-06-20
08:12:00',NULL,NULL,'Up',NULL,NULL),(230,'cn018','31b5a0f2-0ea9-47a1-971c-4330539489c7','Filesystem',0,1,1,2,5014768701440,5270660358144,'172.18.128.18',NULL,'/var/lib/libvirt/images/','2013-06-20
08:12:22',NULL,NULL,'Up',NULL,NULL),(231,'cn019','a28eca04-09c0-4a42-b3a0-aa075fccb154','Filesystem',0,1,1,2,5270461812736,5270660358144,'172.18.128.19',NULL,'/var/lib/libvirt/images/','2013-06-20
08:17:30',NULL,NULL,'Up',NULL,NULL),(232,'cn020','dfc5d6e4-0f27-4692-8e94-1c89a9410e82','Filesystem',0,1,1,2,4790488539136,5270660358144,'172.18.128.20',NULL,'/var/lib/libvirt/images/','2013-06-20
08:17:51',NULL,NULL,'Up',NULL,NULL),(233,'cn061-10',NULL,'Filesystem',0,1,1,NULL,47272779776,48682147840,'172.18.128.61',NULL,'/var/lib/libvirt/images/','2013-07-01
15:11:22','2013-07-24
01:12:16',NULL,'Maintenance',NULL,NULL),(234,'cn061-10','c01a2cb9-239b-4d0b-b484-886065d888c2','Filesystem',0,1,1,3,5181433708544,5270660358144,'172.18.128.61',NULL,'/var/lib/libvirt/images/','2013-07-24
01:16:13',NULL,NULL,'Up',NULL,NULL),(235,'cn062-10','4c5f9b7f-968f-48be-a9c2-2ae2f11d8967','Filesystem',0,1,1,3,5030513332224,5270660358144,'172.18.128.62',NULL,'/var/lib/libvirt/images/','2013-07-24
01:46:40',NULL,NULL,'Up',NULL,NULL),(236,'cn063-10','c9a579ec-ed2f-41b4-b89e-c47bc346c4c3','Filesystem',0,1,1,3,4963781529600,5270660358144,'172.18.128.63',NULL,'/var/lib/libvirt/images/','2013-07-24
05:16:29',NULL,NULL,'Up',NULL,NULL),(237,'cn065-10','be4c89a9-8b9c-4161-8955-5db998c58e34','Filesystem',0,1,1,3,5029360099328,5270660358144,'172.18.128.65',NULL,'/var/lib/libvirt/images/','2013-07-24
05:35:43',NULL,NULL,'Up',NULL,NULL),(238,'cn064-10','180150d3-cf66-4156-acb9-9338e5294fbc','Filesystem',0,1,1,3,4882664796160,5270660358144,'172.18.128.64',NULL,'/var/lib/libvirt/images/','2013-07-24
05:37:31',NULL,NULL,'Up',NULL,NULL),(239,'cn067-10','63aa8d84-34c0-4f1e-a66a-247dba851da2','Filesystem',0,1,1,3,5182267789312,5270660358144,'172.18.128.67',NULL,'/var/lib/libvirt/images/','2013-07-24
05:46:12',NULL,NULL,'Up',NULL,NULL),(240,'cn066-10','b24c1265-3b3c-4aac-bebe-d689961af4bf','Filesystem',0,1,1,3,5207416717312,5270660358144,'172.18.128.66',NULL,'/var/lib/libvirt/images/','2013-07-24
05:48:58',NULL,NULL,'Up',NULL,NULL),(241,'cn068-10','d34ca2fe-2323-4f1d-bf49-282705e188ef','Filesystem',0,1,1,3,5159436877824,5270660358144,'172.18.128.68',NULL,'/var/lib/libvirt/images/','2013-07-24
05:59:22',NULL,NULL,'Up',NULL,NULL),(242,'cn069-10','5227b052-ec01-4fa8-afa1-27877f79818a','Filesystem',0,1,1,3,5111256465408,5270660358144,'172.18.128.69',NULL,'/var/lib/libvirt/images/','2013-07-24
06:01:52',NULL,NULL,'Up',NULL,NULL),(243,'cn070-10','c28fcfc0-c443-452d-959c-9fa5d01b57e4','Filesystem',0,1,1,3,4914289025024,5270660358144,'172.18.128.70',NULL,'/var/lib/libvirt/images/','2013-07-24
06:05:23',NULL,NULL,'Up',NULL,NULL),(244,'cn071-10','fe972842-d227-4eff-9730-3c4043842efb','Filesystem',0,1,1,4,5054019776512,5270660358144,'172.18.128.71',NULL,'/var/lib/libvirt/images/','2013-07-24
06:14:36',NULL,NULL,'Up',NULL,NULL),(245,'cn072-10','9dae6eff-6c2d-4091-88f1-682e23bc4424','Filesystem',0,1,1,4,5228991623168,5270660358144,'172.18.128.72',NULL,'/var/lib/libvirt/images/','2013-07-24
06:16:55',NULL,NULL,'Up',NULL,NULL),(246,'cn073-10','937f263b-1a14-488c-be5c-ba19e9a598aa','Filesystem',0,1,1,4,8107274240,8487899136,'172.18.128.73',NULL,'/var/lib/libvirt/images/','2013-09-17
06:55:01',NULL,NULL,'Up',NULL,NULL);
/*!40000 ALTER TABLE `storage_pool` ENABLE KEYS */;

======================


On Tue, Sep 17, 2013 at 1:41 AM, Kirk Kosinski <ki...@gmail.com>wrote:

> Hi, here is the error:
>
> 2013-09-16 15:08:17,168 DEBUG [agent.transport.Request]
> (AgentManager-Handler-5:null) Seq 13-931004532: Processing:  { Ans: ,
> MgmtId: 161340856362, via: 13, Ver: v1, Flags: 110,
>
> [{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
> com.cloud.utils.exception.CloudRuntimeException\nMessage:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
> com.cloud.utils.exception.CloudRuntimeException:
> org.libvirt.LibvirtException: Storage volume not found: no storage vol
> with matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
>
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
>
> com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
>
> com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
>
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
>
> com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
> com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
> com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
> com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
> java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }
>
> I'm not certain what volume it is complaining about, but I suspect
> secondary storage.  Log on to a host (in particular host 13 [1] since it
> is confirmed to suffer from the issue) and try to manually mount the
> full path of the directory with the system VM template of the secondary
> storage NFS share [2].  The idea is to confirm the share and
> subdirectories of the share are mountable.  Maybe during the maintenance
> some hosts changed IPs and/or the secondary storage NFS share
> permissions (or other settings) were messed up.
>
> If the mount doesn't work, fix whatever is causing it.  If it does work,
> please collect additional info.  Enable DEBUG logging on the hosts [3]
> (if necessary), wait for the error to occur, and upload the agent.log
> from the host with the error.  It should have more details besides the
> exception shown in the management-server.log.  If you have a lot of
> hosts and don't want to enable DEBUG logging on every one, temporarily
> disable most of them and do it on the remaining few.
>
> Best regards,
> Kirk
>
> [1] "13" is the id of the host in the CloudStack database, so find out
> which host it is with:
> select * from `cloud`.`host` where id = 13 \G
>
> [2] Something like:
> nfshost:/share/template/tmpl/2/123
>
> [3] In /etc/cloudstack/agent/log4j-cloud.xml, set the Threshold for FILE
> and com.cloud to DEBUG.  Depending on the CloudStack version, it may or
> may not be enabled by default, and the path may be /etc/cloud/agent/.
>
>
> On 09/16/2013 07:36 PM, sriharsha work wrote:
> > Replying on behalf of Matt. We are able to write data to the Nfs drives.
> > That's not an issue.
> >
> > Thanks
> > Sriharsha
> >
> > Sent from my iPhone
> >
> >> On Sep 16, 2013, at 19:30, Ahmad Emneina <ae...@gmail.com> wrote:
> >>
> >> Try to mount your primary storage to a compute host and try to write to
> it.
> >> Your NFS server might not have come back up properly (settings-wise or
> all
> >> the relevant services).
> >>> On Sep 16, 2013 6:08 PM, "Matt Foley" <mf...@hortonworks.com> wrote:
> >>>
> >>> Thank you Chiradeep.  Log snippet now available as
> http://apaste.info/qBIB
> >>> --Matt
> >>>
> >>> On Mon, Sep 16, 2013 at 5:19 PM, Chiradeep Vittal <
> >>> Chiradeep.Vittal@citrix.com> wrote:
> >>>
> >>>> Attachments are stripped. Can you paste (say at http://apaste.info/)
> >>>>
> >>>> From: Matt Foley <mf...@hortonworks.com>
> >>>> Date: Monday, September 16, 2013 4:58 PM
> >>>>
> >>>> We had a planned network outage this weekend, which inadvertently
> >>> resulted
> >>>> in making the NFS Shared Primary Storage (used by System VMs)
> unavailable
> >>>> for a day and a half.  (Guest VMs use local storage only, but System
> VMs
> >>>> use shared storage only.)  Cloudstack was not brought down prior to
> the
> >>>> outage.
> >>>>
> >>>> After network came back, we gracefully brought down all services
> >>> including
> >>>> cloudstack-management, mysql, and NFS, then actually rebooted all
> servers
> >>>> in the cluster and the NFS server (to make sure no stale file
> handles),
> >>>> then brought up services in the appropriate order.  Also checked mysql
> >>> for
> >>>> table corruption, and found none.  Confirmed that the NFS volumes are
> >>>> mountable from all hosts, and in fact Shared Primary Storage is being
> >>>> mounted by cloudstack on hosts as usual, under /mnt/<uuid>.
> >>>>
> >>>> Nevertheless, when try to bring up the cluster, we fail to start the
> >>>> system VMs, with errors "InsufficientServerCapacityException: Unable
> to
> >>>> create a deployment for VM".  The cause is not really insufficient
> >>>> capacity, as actual usage of resources is tiny; these error messages
> are
> >>>> false explanations of the failure to create primary storage volume for
> >>> the
> >>>> System VMs.
> >>>>
> >>>> Digging into management-server.log, the core issue seems to be the
> ~160
> >>>> line snippet from the log attached to this message as
> >>>> cloudstack_debug_2013.09.16.log. The only Shared Primary Storage pool
> is
> >>>> pool 201, named "cs-primary".  It is mounted on all hosts as
> >>>> /mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid.  The log
> >>>> shows the management server correctly identifying a particular host as
> >>>> being able to access pool 201, then trying to allocate a primary
> storage
> >>>> volume using the template with uuid
> f23a16e7-b628-429e-83e1-698935588465.
> >>>> It fails, but I cannot tell why.  I suspect its claim that "Template 3
> >>> has
> >>>> already been downloaded to pool 201" is false, but I don't know how to
> >>>> check this (or fix if wrong).
> >>>>
> >>>> Any guidance for further debugging or fixing this would be GREATLY
> >>>> appreciated.
> >>>> Thanks,
> >>>> --Matt
> >>>
> >>> --
> >>> CONFIDENTIALITY NOTICE
> >>> NOTICE: This message is intended for the use of the individual or
> entity to
> >>> which it is addressed and may contain information that is confidential,
> >>> privileged and exempt from disclosure under applicable law. If the
> reader
> >>> of this message is not the intended recipient, you are hereby notified
> that
> >>> any printing, copying, dissemination, distribution, disclosure or
> >>> forwarding of this communication is strictly prohibited. If you have
> >>> received this communication in error, please contact the sender
> immediately
> >>> and delete it from your system. Thank You.
> >>>
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Help! After network outage, can't start System VMs; focused debug info attached

Posted by Kirk Kosinski <ki...@gmail.com>.
Hi, here is the error:

2013-09-16 15:08:17,168 DEBUG [agent.transport.Request]
(AgentManager-Handler-5:null) Seq 13-931004532: Processing:  { Ans: ,
MgmtId: 161340856362, via: 13, Ver: v1, Flags: 110,
[{"storage.CreateAnswer":{"requestTemplateReload":false,"result":false,"details":"Exception:
com.cloud.utils.exception.CloudRuntimeException\nMessage:
org.libvirt.LibvirtException: Storage volume not found: no storage vol
with matching name 'f23a16e7-b628-429e-83e1-698935588465'\nStack:
com.cloud.utils.exception.CloudRuntimeException:
org.libvirt.LibvirtException: Storage volume not found: no storage vol
with matching name 'f23a16e7-b628-429e-83e1-698935588465'\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getVolume(LibvirtStorageAdaptor.java:90)\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStorageAdaptor.getPhysicalDisk(LibvirtStorageAdaptor.java:437)\n\tat
com.cloud.hypervisor.kvm.storage.LibvirtStoragePool.getPhysicalDisk(LibvirtStoragePool.java:123)\n\tat
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.execute(LibvirtComputingResource.java:1279)\n\tat
com.cloud.hypervisor.kvm.resource.LibvirtComputingResource.executeRequest(LibvirtComputingResource.java:1072)\n\tat
com.cloud.agent.Agent.processRequest(Agent.java:525)\n\tat
com.cloud.agent.Agent$AgentRequestHandler.doTask(Agent.java:852)\n\tat
com.cloud.utils.nio.Task.run(Task.java:83)\n\tat
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)\n\tat
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)\n\tat
java.lang.Thread.run(Thread.java:679)\n","wait":0}}] }

I'm not certain what volume it is complaining about, but I suspect
secondary storage.  Log on to a host (in particular host 13 [1] since it
is confirmed to suffer from the issue) and try to manually mount the
full path of the directory with the system VM template of the secondary
storage NFS share [2].  The idea is to confirm the share and
subdirectories of the share are mountable.  Maybe during the maintenance
some hosts changed IPs and/or the secondary storage NFS share
permissions (or other settings) were messed up.

If the mount doesn't work, fix whatever is causing it.  If it does work,
please collect additional info.  Enable DEBUG logging on the hosts [3]
(if necessary), wait for the error to occur, and upload the agent.log
from the host with the error.  It should have more details besides the
exception shown in the management-server.log.  If you have a lot of
hosts and don't want to enable DEBUG logging on every one, temporarily
disable most of them and do it on the remaining few.

Best regards,
Kirk

[1] "13" is the id of the host in the CloudStack database, so find out
which host it is with:
select * from `cloud`.`host` where id = 13 \G

[2] Something like:
nfshost:/share/template/tmpl/2/123

[3] In /etc/cloudstack/agent/log4j-cloud.xml, set the Threshold for FILE
and com.cloud to DEBUG.  Depending on the CloudStack version, it may or
may not be enabled by default, and the path may be /etc/cloud/agent/.


On 09/16/2013 07:36 PM, sriharsha work wrote:
> Replying on behalf of Matt. We are able to write data to the Nfs drives.
> That's not an issue.
> 
> Thanks
> Sriharsha
> 
> Sent from my iPhone
> 
>> On Sep 16, 2013, at 19:30, Ahmad Emneina <ae...@gmail.com> wrote:
>>
>> Try to mount your primary storage to a compute host and try to write to it.
>> Your NFS server might not have come back up properly (settings-wise or all
>> the relevant services).
>>> On Sep 16, 2013 6:08 PM, "Matt Foley" <mf...@hortonworks.com> wrote:
>>>
>>> Thank you Chiradeep.  Log snippet now available as http://apaste.info/qBIB
>>> --Matt
>>>
>>> On Mon, Sep 16, 2013 at 5:19 PM, Chiradeep Vittal <
>>> Chiradeep.Vittal@citrix.com> wrote:
>>>
>>>> Attachments are stripped. Can you paste (say at http://apaste.info/)
>>>>
>>>> From: Matt Foley <mf...@hortonworks.com>
>>>> Date: Monday, September 16, 2013 4:58 PM
>>>>
>>>> We had a planned network outage this weekend, which inadvertently
>>> resulted
>>>> in making the NFS Shared Primary Storage (used by System VMs) unavailable
>>>> for a day and a half.  (Guest VMs use local storage only, but System VMs
>>>> use shared storage only.)  Cloudstack was not brought down prior to the
>>>> outage.
>>>>
>>>> After network came back, we gracefully brought down all services
>>> including
>>>> cloudstack-management, mysql, and NFS, then actually rebooted all servers
>>>> in the cluster and the NFS server (to make sure no stale file handles),
>>>> then brought up services in the appropriate order.  Also checked mysql
>>> for
>>>> table corruption, and found none.  Confirmed that the NFS volumes are
>>>> mountable from all hosts, and in fact Shared Primary Storage is being
>>>> mounted by cloudstack on hosts as usual, under /mnt/<uuid>.
>>>>
>>>> Nevertheless, when try to bring up the cluster, we fail to start the
>>>> system VMs, with errors "InsufficientServerCapacityException: Unable to
>>>> create a deployment for VM".  The cause is not really insufficient
>>>> capacity, as actual usage of resources is tiny; these error messages are
>>>> false explanations of the failure to create primary storage volume for
>>> the
>>>> System VMs.
>>>>
>>>> Digging into management-server.log, the core issue seems to be the ~160
>>>> line snippet from the log attached to this message as
>>>> cloudstack_debug_2013.09.16.log. The only Shared Primary Storage pool is
>>>> pool 201, named "cs-primary".  It is mounted on all hosts as
>>>> /mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid.  The log
>>>> shows the management server correctly identifying a particular host as
>>>> being able to access pool 201, then trying to allocate a primary storage
>>>> volume using the template with uuid f23a16e7-b628-429e-83e1-698935588465.
>>>> It fails, but I cannot tell why.  I suspect its claim that "Template 3
>>> has
>>>> already been downloaded to pool 201" is false, but I don't know how to
>>>> check this (or fix if wrong).
>>>>
>>>> Any guidance for further debugging or fixing this would be GREATLY
>>>> appreciated.
>>>> Thanks,
>>>> --Matt
>>>
>>> --
>>> CONFIDENTIALITY NOTICE
>>> NOTICE: This message is intended for the use of the individual or entity to
>>> which it is addressed and may contain information that is confidential,
>>> privileged and exempt from disclosure under applicable law. If the reader
>>> of this message is not the intended recipient, you are hereby notified that
>>> any printing, copying, dissemination, distribution, disclosure or
>>> forwarding of this communication is strictly prohibited. If you have
>>> received this communication in error, please contact the sender immediately
>>> and delete it from your system. Thank You.
>>>

Re: Help! After network outage, can't start System VMs; focused debug info attached

Posted by sriharsha work <sr...@gmail.com>.
Replying on behalf of Matt. We are able to write data to the Nfs drives.
That's not an issue.

Thanks
Sriharsha

Sent from my iPhone

> On Sep 16, 2013, at 19:30, Ahmad Emneina <ae...@gmail.com> wrote:
>
> Try to mount your primary storage to a compute host and try to write to it.
> Your NFS server might not have come back up properly (settings-wise or all
> the relevant services).
>> On Sep 16, 2013 6:08 PM, "Matt Foley" <mf...@hortonworks.com> wrote:
>>
>> Thank you Chiradeep.  Log snippet now available as http://apaste.info/qBIB
>> --Matt
>>
>> On Mon, Sep 16, 2013 at 5:19 PM, Chiradeep Vittal <
>> Chiradeep.Vittal@citrix.com> wrote:
>>
>>> Attachments are stripped. Can you paste (say at http://apaste.info/)
>>>
>>> From: Matt Foley <mf...@hortonworks.com>
>>> Date: Monday, September 16, 2013 4:58 PM
>>>
>>> We had a planned network outage this weekend, which inadvertently
>> resulted
>>> in making the NFS Shared Primary Storage (used by System VMs) unavailable
>>> for a day and a half.  (Guest VMs use local storage only, but System VMs
>>> use shared storage only.)  Cloudstack was not brought down prior to the
>>> outage.
>>>
>>> After network came back, we gracefully brought down all services
>> including
>>> cloudstack-management, mysql, and NFS, then actually rebooted all servers
>>> in the cluster and the NFS server (to make sure no stale file handles),
>>> then brought up services in the appropriate order.  Also checked mysql
>> for
>>> table corruption, and found none.  Confirmed that the NFS volumes are
>>> mountable from all hosts, and in fact Shared Primary Storage is being
>>> mounted by cloudstack on hosts as usual, under /mnt/<uuid>.
>>>
>>> Nevertheless, when try to bring up the cluster, we fail to start the
>>> system VMs, with errors "InsufficientServerCapacityException: Unable to
>>> create a deployment for VM".  The cause is not really insufficient
>>> capacity, as actual usage of resources is tiny; these error messages are
>>> false explanations of the failure to create primary storage volume for
>> the
>>> System VMs.
>>>
>>> Digging into management-server.log, the core issue seems to be the ~160
>>> line snippet from the log attached to this message as
>>> cloudstack_debug_2013.09.16.log. The only Shared Primary Storage pool is
>>> pool 201, named "cs-primary".  It is mounted on all hosts as
>>> /mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid.  The log
>>> shows the management server correctly identifying a particular host as
>>> being able to access pool 201, then trying to allocate a primary storage
>>> volume using the template with uuid f23a16e7-b628-429e-83e1-698935588465.
>>> It fails, but I cannot tell why.  I suspect its claim that "Template 3
>> has
>>> already been downloaded to pool 201" is false, but I don't know how to
>>> check this (or fix if wrong).
>>>
>>> Any guidance for further debugging or fixing this would be GREATLY
>>> appreciated.
>>> Thanks,
>>> --Matt
>>
>> --
>> CONFIDENTIALITY NOTICE
>> NOTICE: This message is intended for the use of the individual or entity to
>> which it is addressed and may contain information that is confidential,
>> privileged and exempt from disclosure under applicable law. If the reader
>> of this message is not the intended recipient, you are hereby notified that
>> any printing, copying, dissemination, distribution, disclosure or
>> forwarding of this communication is strictly prohibited. If you have
>> received this communication in error, please contact the sender immediately
>> and delete it from your system. Thank You.
>>

Re: Help! After network outage, can't start System VMs; focused debug info attached

Posted by Ahmad Emneina <ae...@gmail.com>.
Try to mount your primary storage to a compute host and try to write to it.
Your NFS server might not have come back up properly (settings-wise or all
the relevant services).
On Sep 16, 2013 6:08 PM, "Matt Foley" <mf...@hortonworks.com> wrote:

> Thank you Chiradeep.  Log snippet now available as http://apaste.info/qBIB
> --Matt
>
> On Mon, Sep 16, 2013 at 5:19 PM, Chiradeep Vittal <
> Chiradeep.Vittal@citrix.com> wrote:
>
> > Attachments are stripped. Can you paste (say at http://apaste.info/)
> >
> > From: Matt Foley <mf...@hortonworks.com>
> > Date: Monday, September 16, 2013 4:58 PM
> >
> > We had a planned network outage this weekend, which inadvertently
> resulted
> > in making the NFS Shared Primary Storage (used by System VMs) unavailable
> > for a day and a half.  (Guest VMs use local storage only, but System VMs
> > use shared storage only.)  Cloudstack was not brought down prior to the
> > outage.
> >
> > After network came back, we gracefully brought down all services
> including
> > cloudstack-management, mysql, and NFS, then actually rebooted all servers
> > in the cluster and the NFS server (to make sure no stale file handles),
> > then brought up services in the appropriate order.  Also checked mysql
> for
> > table corruption, and found none.  Confirmed that the NFS volumes are
> > mountable from all hosts, and in fact Shared Primary Storage is being
> > mounted by cloudstack on hosts as usual, under /mnt/<uuid>.
> >
> > Nevertheless, when try to bring up the cluster, we fail to start the
> > system VMs, with errors "InsufficientServerCapacityException: Unable to
> > create a deployment for VM".  The cause is not really insufficient
> > capacity, as actual usage of resources is tiny; these error messages are
> > false explanations of the failure to create primary storage volume for
> the
> > System VMs.
> >
> > Digging into management-server.log, the core issue seems to be the ~160
> > line snippet from the log attached to this message as
> > cloudstack_debug_2013.09.16.log.  The only Shared Primary Storage pool is
> > pool 201, named "cs-primary".  It is mounted on all hosts as
> > /mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid.  The log
> > shows the management server correctly identifying a particular host as
> > being able to access pool 201, then trying to allocate a primary storage
> > volume using the template with uuid f23a16e7-b628-429e-83e1-698935588465.
> >  It fails, but I cannot tell why.  I suspect its claim that "Template 3
> has
> > already been downloaded to pool 201" is false, but I don't know how to
> > check this (or fix if wrong).
> >
> > Any guidance for further debugging or fixing this would be GREATLY
> > appreciated.
> > Thanks,
> > --Matt
> >
>
> --
> CONFIDENTIALITY NOTICE
> NOTICE: This message is intended for the use of the individual or entity to
> which it is addressed and may contain information that is confidential,
> privileged and exempt from disclosure under applicable law. If the reader
> of this message is not the intended recipient, you are hereby notified that
> any printing, copying, dissemination, distribution, disclosure or
> forwarding of this communication is strictly prohibited. If you have
> received this communication in error, please contact the sender immediately
> and delete it from your system. Thank You.
>

Re: Help! After network outage, can't start System VMs; focused debug info attached

Posted by Matt Foley <mf...@hortonworks.com>.
Thank you Chiradeep.  Log snippet now available as http://apaste.info/qBIB
--Matt

On Mon, Sep 16, 2013 at 5:19 PM, Chiradeep Vittal <
Chiradeep.Vittal@citrix.com> wrote:

> Attachments are stripped. Can you paste (say at http://apaste.info/)
>
> From: Matt Foley <mf...@hortonworks.com>
> Date: Monday, September 16, 2013 4:58 PM
>
> We had a planned network outage this weekend, which inadvertently resulted
> in making the NFS Shared Primary Storage (used by System VMs) unavailable
> for a day and a half.  (Guest VMs use local storage only, but System VMs
> use shared storage only.)  Cloudstack was not brought down prior to the
> outage.
>
> After network came back, we gracefully brought down all services including
> cloudstack-management, mysql, and NFS, then actually rebooted all servers
> in the cluster and the NFS server (to make sure no stale file handles),
> then brought up services in the appropriate order.  Also checked mysql for
> table corruption, and found none.  Confirmed that the NFS volumes are
> mountable from all hosts, and in fact Shared Primary Storage is being
> mounted by cloudstack on hosts as usual, under /mnt/<uuid>.
>
> Nevertheless, when try to bring up the cluster, we fail to start the
> system VMs, with errors "InsufficientServerCapacityException: Unable to
> create a deployment for VM".  The cause is not really insufficient
> capacity, as actual usage of resources is tiny; these error messages are
> false explanations of the failure to create primary storage volume for the
> System VMs.
>
> Digging into management-server.log, the core issue seems to be the ~160
> line snippet from the log attached to this message as
> cloudstack_debug_2013.09.16.log.  The only Shared Primary Storage pool is
> pool 201, named "cs-primary".  It is mounted on all hosts as
> /mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid.  The log
> shows the management server correctly identifying a particular host as
> being able to access pool 201, then trying to allocate a primary storage
> volume using the template with uuid f23a16e7-b628-429e-83e1-698935588465.
>  It fails, but I cannot tell why.  I suspect its claim that "Template 3 has
> already been downloaded to pool 201" is false, but I don't know how to
> check this (or fix if wrong).
>
> Any guidance for further debugging or fixing this would be GREATLY
> appreciated.
> Thanks,
> --Matt
>

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Re: Help! After network outage, can't start System VMs; focused debug info attached

Posted by Chiradeep Vittal <Ch...@citrix.com>.
Attachments are stripped. Can you paste (say at http://apaste.info/)

From: Matt Foley <mf...@hortonworks.com>>
Reply-To: "users@cloudstack.apache.org<ma...@cloudstack.apache.org>" <us...@cloudstack.apache.org>>
Date: Monday, September 16, 2013 4:58 PM
To: "users@cloudstack.apache.org<ma...@cloudstack.apache.org>" <us...@cloudstack.apache.org>>
Subject: Help! After network outage, can't start System VMs; focused debug info attached

We had a planned network outage this weekend, which inadvertently resulted in making the NFS Shared Primary Storage (used by System VMs) unavailable for a day and a half.  (Guest VMs use local storage only, but System VMs use shared storage only.)  Cloudstack was not brought down prior to the outage.

After network came back, we gracefully brought down all services including cloudstack-management, mysql, and NFS, then actually rebooted all servers in the cluster and the NFS server (to make sure no stale file handles), then brought up services in the appropriate order.  Also checked mysql for table corruption, and found none.  Confirmed that the NFS volumes are mountable from all hosts, and in fact Shared Primary Storage is being mounted by cloudstack on hosts as usual, under /mnt/<uuid>.

Nevertheless, when try to bring up the cluster, we fail to start the system VMs, with errors "InsufficientServerCapacityException: Unable to create a deployment for VM".  The cause is not really insufficient capacity, as actual usage of resources is tiny; these error messages are false explanations of the failure to create primary storage volume for the System VMs.

Digging into management-server.log, the core issue seems to be the ~160 line snippet from the log attached to this message as cloudstack_debug_2013.09.16.log.  The only Shared Primary Storage pool is pool 201, named "cs-primary".  It is mounted on all hosts as /mnt/9c6fd9a3-43e5-389a-9594-faecf178b4b9, which is its uuid.  The log shows the management server correctly identifying a particular host as being able to access pool 201, then trying to allocate a primary storage volume using the template with uuid f23a16e7-b628-429e-83e1-698935588465.  It fails, but I cannot tell why.  I suspect its claim that "Template 3 has already been downloaded to pool 201" is false, but I don't know how to check this (or fix if wrong).

Any guidance for further debugging or fixing this would be GREATLY appreciated.
Thanks,
--Matt


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.