You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by "Richard Klein (RSI)" <rk...@rsitex.com> on 2016/04/16 01:54:14 UTC

Primary storage not mounted on hosts?

I am not sure what happened but our primary storage, which is Gluster, on all our hosts is not mounted anymore.  When I do "virsh pool-list" on any host I only see the local pool.  Gluster is working fine and there are no problems with it because I can mount the Gluster volume manually on any of the hosts and see the primary storage.  Instances that are running can write data to the local volume and pull data from it.  But if a VM is stopped it can't start again.  I get the "Unable to create a New VM - Error message: Unable to start instance due to Unable to get answer that is of class com.cloud.agent.api.StartAnswer" that I have seen a thread in this mailing list and I am sure its primary storage related. 

The agent logs on the hosts are issuing the following log snippets which confirm its looking for primary storage:

2016-04-15 18:42:34,838 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:ad8ec05a) Trying to fetch storage pool c3991ea2-b702-3b1b-bfc5-69cb7d928554 from libvirt
2016-04-15 18:45:19,006 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:4c396753) Trying to fetch storage pool c3991ea2-b702-3b1b-bfc5-69cb7d928554 from libvirt
2016-04-15 18:45:49,010 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:4c396753) Trying to fetch storage pool c3991ea2-b702-3b1b-bfc5-69cb7d928554 from libvirt

The c3991ea2-b702-3b1b-bfc5-69cb7d928554 is the UUID of our primary storage.

We did have some secondary storage issues (NFS) that caused some NFS mounts to secondary storage to hang.  The only way to recover was to reboot the host.  There were 2 host affected so I put each host in maintenance mode, rebooted and then canceled maintenance mode.  I did this one host at a time.  It seems like ever since this has happened I have had issues.

Is there a way to get the primary storage remounted and added to libvirt pool-list while keeping the VMs up and running?  At this point the only idea I have to recover is to power off all VMs, disable primary storage then enable it again.  This is a little extreme and is a last resort but I don't know what other options I have.

Any suggestions?


Richard Klein  <rk...@rsitex.com> 
RSI 
5426 Guadalupe, Suite 100 
Austin TX 78751 



RE: Primary storage not mounted on hosts?

Posted by "Richard Klein (RSI)" <rk...@rsitex.com>.
Thanks for the advice.  I found the problem and got it resolved.  During the agent (with debug enabled per you suggestion) did a tail/grep using the UUID of primary storage and discovered that during the mount/add to libvrt process it was getting an I/O error on a UUID of a QCOW2 volume.  Below is a snippet form the tail/grep.  So I stopped the agent, mounted primary storage manually and tried to copy that file from the log.  Sure enough I got an IO error.  I then copied some other random small files and they were OK so it appeared that this one volume was corrupt.

I looked up the volume UUID in the volumes table and found the instance it belonged to which was a stopped VR.  I destroyed the VR and started the agent.  I still got the IO error because the volume was still there (probably hadn't gone thru the expunge process yet).  I stopped the agent, manually moved the file to a temp directory and then started the agent.  Everything worked normally then.  It added the primary storage and started to turn on VRs.  I then restarted the agents on all hosts and all started working again.

It behaved as if during the process of adding the pool to libvirt all of the volumes are examined to get information about it I suppose.  Because this one volume was corrupt that prevented the pool from being added.  At least that is my theory.

I do still have one problem.  The system VMs are stuck in a starting state.  I think due to timing of the agent restarts.  When I look on the host they are "starting" on I don't see them with the "virsh list" command.  I am going to give them time just in case it's a work load issue but if they are still starting after an hour or so I will probably change the database status for them to stop, then recreated them again.

Thanks for the help!

Here is the agent log snippet:
----
tail -f /var/log/cloudstack/agent/agent.log | grep "c3991ea2\-b702\-3b1b\-bfc5\-69cb7d928554"
2016-04-16 10:43:00,245 DEBUG [cloud.agent.Agent] (agentRequest-Handler-1:null) (logid:30562dd3) Request:Seq 46-5281314988022038529:  { Cmd , MgmtId: 345049993464, via: 46, Ver: v1, Flags: 100011, [{"com.cloud.agent.api.ModifyStoragePoolCommand":{"add":true,"pool":{"id":5,"uuid":"c3991ea2-b702-3b1b-bfc5-69cb7d928554","host":"gv0cl1.pod1.aus1.centex.rsitex.com","path":"/gv0cl1","port":24007,"type":"Gluster"},"localPath":"/mnt//c3991ea2-b702-3b1b-bfc5-69cb7d928554","wait":0}}] }
2016-04-16 10:43:00,318 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:30562dd3) Attempting to create storage pool c3991ea2-b702-3b1b-bfc5-69cb7d928554 (Gluster) in libvirt
2016-04-16 10:43:00,322 WARN  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:30562dd3) Storage pool c3991ea2-b702-3b1b-bfc5-69cb7d928554 was not found running in libvirt. Need to create it.
2016-04-16 10:43:00,322 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:30562dd3) Didn't find an existing storage pool c3991ea2-b702-3b1b-bfc5-69cb7d928554 by UUID, checking for pools with duplicate paths
2016-04-16 10:43:00,325 DEBUG [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:30562dd3) Attempting to create storage pool c3991ea2-b702-3b1b-bfc5-69cb7d928554
<name>c3991ea2-b702-3b1b-bfc5-69cb7d928554</name>
<uuid>c3991ea2-b702-3b1b-bfc5-69cb7d928554</uuid>
<path>/mnt/c3991ea2-b702-3b1b-bfc5-69cb7d928554</path>
2016-04-16 10:43:00,775 ERROR [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:30562dd3) org.libvirt.LibvirtException: cannot read header '/mnt/c3991ea2-b702-3b1b-bfc5-69cb7d928554/52a6130f-266e-4667-8ca2-89f932a0b254': Input/output error
org.libvirt.LibvirtException: cannot read header '/mnt/c3991ea2-b702-3b1b-bfc5-69cb7d928554/52a6130f-266e-4667-8ca2-89f932a0b254': Input/output error

----

Richard Klein  <rk...@rsitex.com> 
RSI 
5426 Guadalupe, Suite 100 
Austin TX 78751 
RSI Help Desk:  (512) 334-3334 
Phone:  (512) 275-0358 
Fax:  (512)  328-3410






> -----Original Message-----
> From: Simon Weller [mailto:sweller@ena.com]
> Sent: Friday, April 15, 2016 8:47 PM
> To: users@cloudstack.apache.org
> Subject: Re: Primary storage not mounted on hosts?
> 
> Richard,
> 
> The Cloudstack-agent should populate the libvirt pool-list when it starts up.
> Have you tried restarting libvirtd and then restarting the Cloudstack-agent?
> 
> You may want to turn up debugging on the agent so you get some more detail
> on what's going on.
> You can do this by modifying /etc/cloudstack/agent/log4j-cloud.xml
> See this wiki article for more details:
> https://cwiki.apache.org/confluence/display/CLOUDSTACK/KVM+agent+debug
> 
> - Si
> 
> ________________________________________
> From: Richard Klein (RSI) <rk...@rsitex.com>
> Sent: Friday, April 15, 2016 6:54 PM
> To: users@cloudstack.apache.org
> Subject: Primary storage not mounted on hosts?
> 
> I am not sure what happened but our primary storage, which is Gluster, on all
> our hosts is not mounted anymore.  When I do "virsh pool-list" on any host I
> only see the local pool.  Gluster is working fine and there are no problems with
> it because I can mount the Gluster volume manually on any of the hosts and
> see the primary storage.  Instances that are running can write data to the local
> volume and pull data from it.  But if a VM is stopped it can't start again.  I get
> the "Unable to create a New VM - Error message: Unable to start instance due
> to Unable to get answer that is of class com.cloud.agent.api.StartAnswer" that I
> have seen a thread in this mailing list and I am sure its primary storage related.
> 
> The agent logs on the hosts are issuing the following log snippets which
> confirm its looking for primary storage:
> 
> 2016-04-15 18:42:34,838 INFO  [kvm.storage.LibvirtStorageAdaptor]
> (agentRequest-Handler-3:null) (logid:ad8ec05a) Trying to fetch storage pool
> c3991ea2-b702-3b1b-bfc5-69cb7d928554 from libvirt
> 2016-04-15 18:45:19,006 INFO  [kvm.storage.LibvirtStorageAdaptor]
> (agentRequest-Handler-1:null) (logid:4c396753) Trying to fetch storage pool
> c3991ea2-b702-3b1b-bfc5-69cb7d928554 from libvirt
> 2016-04-15 18:45:49,010 INFO  [kvm.storage.LibvirtStorageAdaptor]
> (agentRequest-Handler-1:null) (logid:4c396753) Trying to fetch storage pool
> c3991ea2-b702-3b1b-bfc5-69cb7d928554 from libvirt
> 
> The c3991ea2-b702-3b1b-bfc5-69cb7d928554 is the UUID of our primary
> storage.
> 
> We did have some secondary storage issues (NFS) that caused some NFS
> mounts to secondary storage to hang.  The only way to recover was to reboot
> the host.  There were 2 host affected so I put each host in maintenance mode,
> rebooted and then canceled maintenance mode.  I did this one host at a time.
> It seems like ever since this has happened I have had issues.
> 
> Is there a way to get the primary storage remounted and added to libvirt pool-
> list while keeping the VMs up and running?  At this point the only idea I have to
> recover is to power off all VMs, disable primary storage then enable it again.
> This is a little extreme and is a last resort but I don't know what other options I
> have.
> 
> Any suggestions?
> 
> 
> Richard Klein  <rk...@rsitex.com>
> RSI
> 5426 Guadalupe, Suite 100
> Austin TX 78751
> 


Re: Primary storage not mounted on hosts?

Posted by Simon Weller <sw...@ena.com>.
Richard,

The Cloudstack-agent should populate the libvirt pool-list when it starts up.
Have you tried restarting libvirtd and then restarting the Cloudstack-agent?

You may want to turn up debugging on the agent so you get some more detail on what's going on.
You can do this by modifying /etc/cloudstack/agent/log4j-cloud.xml
See this wiki article for more details: https://cwiki.apache.org/confluence/display/CLOUDSTACK/KVM+agent+debug

- Si

________________________________________
From: Richard Klein (RSI) <rk...@rsitex.com>
Sent: Friday, April 15, 2016 6:54 PM
To: users@cloudstack.apache.org
Subject: Primary storage not mounted on hosts?

I am not sure what happened but our primary storage, which is Gluster, on all our hosts is not mounted anymore.  When I do "virsh pool-list" on any host I only see the local pool.  Gluster is working fine and there are no problems with it because I can mount the Gluster volume manually on any of the hosts and see the primary storage.  Instances that are running can write data to the local volume and pull data from it.  But if a VM is stopped it can't start again.  I get the "Unable to create a New VM - Error message: Unable to start instance due to Unable to get answer that is of class com.cloud.agent.api.StartAnswer" that I have seen a thread in this mailing list and I am sure its primary storage related.

The agent logs on the hosts are issuing the following log snippets which confirm its looking for primary storage:

2016-04-15 18:42:34,838 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-3:null) (logid:ad8ec05a) Trying to fetch storage pool c3991ea2-b702-3b1b-bfc5-69cb7d928554 from libvirt
2016-04-15 18:45:19,006 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:4c396753) Trying to fetch storage pool c3991ea2-b702-3b1b-bfc5-69cb7d928554 from libvirt
2016-04-15 18:45:49,010 INFO  [kvm.storage.LibvirtStorageAdaptor] (agentRequest-Handler-1:null) (logid:4c396753) Trying to fetch storage pool c3991ea2-b702-3b1b-bfc5-69cb7d928554 from libvirt

The c3991ea2-b702-3b1b-bfc5-69cb7d928554 is the UUID of our primary storage.

We did have some secondary storage issues (NFS) that caused some NFS mounts to secondary storage to hang.  The only way to recover was to reboot the host.  There were 2 host affected so I put each host in maintenance mode, rebooted and then canceled maintenance mode.  I did this one host at a time.  It seems like ever since this has happened I have had issues.

Is there a way to get the primary storage remounted and added to libvirt pool-list while keeping the VMs up and running?  At this point the only idea I have to recover is to power off all VMs, disable primary storage then enable it again.  This is a little extreme and is a last resort but I don't know what other options I have.

Any suggestions?


Richard Klein  <rk...@rsitex.com>
RSI
5426 Guadalupe, Suite 100
Austin TX 78751