You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@cloudstack.apache.org by Nik Martin <ni...@nfinausa.com> on 2012/10/02 22:12:23 UTC

Storage failure in not handled well in CS

I have two SANs connected to CS as primary storage.  One is an HD based 
SAN, with a single target and LUN, and the other is an SSD SAN split 
into two volumes, each connected with a target and LUN.  The HD san is 
where all system VMs are stored (or they were before I added the HD SAN, 
but I have no ide where the system vm volumens are stored).  This 
morning, I had to do a semi emergency shutdown of the SSD SAN, so I put 
both LUNS in emergency maintenance mode in CS.  CS shutdown the entire 
cloud, not just the volumes stored in the SSD san.  The san is offline, 
and CS shows it in maintenance mode, but NO vm's will start, and the cs 
management log shows:

onnecting; event = AgentDisconnected; new status = Alert; old update 
count = 959; new update count = 960]
2012-10-02 15:10:40,370 DEBUG [agent.manager.ClusteredAgentManagerImpl] 
(AgentTaskPool-2:null) Notifying other nodes of to disconnect
2012-10-02 15:10:40,370 WARN  [cloud.resource.ResourceManagerImpl] 
(AgentTaskPool-2:null) Unable to connect due to
com.cloud.exception.ConnectionException: Unable to connect to pool 
Pool[204|IscsiLUN]
	at
	at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
	at java.lang.Thread.run(Thread.java:679)
Caused by: com.cloud.exception.StorageUnavailableException: Resource 
[StoragePool:204] is unreachable: Unable establish connection from 
storage head to storage pool 204 due to ModifyStoragePoolCommand add 
XenAPIException:Can not see storage pool: 
cfd3b016-d4d9-3bb9-b1f9-f31374c44185 from on 
host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd 
host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool: 
172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/0
	at 
com.cloud.storage.StorageManagerImpl.connectHostToSharedPool(StorageManagerImpl.java:1567)
	at 
com.cloud.storage.listener.StoragePoolMonitor.processConnect(StoragePoolMonitor.java:88)
	... 8 more
2012-10-02 15:10:40,371 DEBUG [cloud.host.Status] (AgentTaskPool-2:null) 
Transition:[Resource state = Enabled, Agent event = AgentDisconnected, 
Host id = 6, name = hv1]
2012-10-02 15:10:40,375 DEBUG [cloud.host.Status] (AgentTaskPool-2:null) 
Agent status update: [id = 6; name = hv1; old status = Alert; event = 
AgentDisconnected; new status = Alert; old update count = 960; new 
update count = 961]


host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool: 
172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/1 is the SAN that is 
in maintenance mode, so why is CS still trying to connect?  All my HVs 
are in alert state becasue of this.

-- 
Regards,

Nik

Nik Martin
VP Business Development
Nfina Technologies, Inc.
+1.251.243.0043 x1003
Relentless Reliability

Re: Storage failure in not handled well in CS

Posted by Matthew Patton <mp...@inforelay.com>.

On Wed, 03 Oct 2012 09:51:33 -0400, Nik Martin <ni...@nfinausa.com>  
wrote:

> Bump?  This is a serious issue that I need to get resolved.  An entire  
> cloud going down while one SAN is being repaired is a bad thing.

Don't get too upset - it's par for the course for products written by  
Citrix programmers. Citrix XenServer (the hypervisor) also has a similar,  
fatal flaw. If the first IP of a multi-IP SAN array is unreachable (eg. a  
bad port, or controller failure), it won't reconnect using any of the  
other IP. It will just mark it dead and you literally have to DELETE the  
volume object and redefine it.

The only hypervisor that isn't complete sh*t is ESX. So far I haven't  
found a cloud environment that wasn't riddled with incredibly naive  
assumptions.

-- 
Cloud Services Architect, Senior System Administrator
InfoRelay Online Systems (www.inforelay.com)

Re: Storage failure in not handled well in CS

Posted by Nik Martin <ni...@nfinausa.com>.

On 10/03/2012 02:11 PM, Anthony Xu wrote:
> It is a bug, please file a bug,
>
> You can try following workaround,
> In mysql
> Update storage_pool set removed=now() where id= "primary storage id you put into maintenance mode"
>
>
> Anthony
>
>
Will do.  This is kind of a blocker in my book.

Regards,

Nik

> -----Original Message-----
> From: Nik Martin [mailto:nik.martin@nfinausa.com]
> Sent: Wednesday, October 03, 2012 6:52 AM
> To: cloudstack-users@incubator.apache.org
> Subject: Re: Storage failure in not handled well in CS
>
> Bump?  This is a serious issue that I need to get resolved.  An entire cloud going down while one SAN is being repaired is a bad thing.  My cloud controller still refuses to start VMs because it cannot connect to a SAN that is in maintenance mode and is offline.
>
>
> On 10/02/2012 03:12 PM, Nik Martin wrote:
>> I have two SANs connected to CS as primary storage.  One is an HD
>> based SAN, with a single target and LUN, and the other is an SSD SAN
>> split into two volumes, each connected with a target and LUN.  The HD
>> san is where all system VMs are stored (or they were before I added
>> the HD SAN, but I have no ide where the system vm volumens are
>> stored).  This morning, I had to do a semi emergency shutdown of the
>> SSD SAN, so I put both LUNS in emergency maintenance mode in CS.  CS
>> shutdown the entire cloud, not just the volumes stored in the SSD san.
>> The san is offline, and CS shows it in maintenance mode, but NO vm's
>> will start, and the cs management log shows:
>>
>> onnecting; event = AgentDisconnected; new status = Alert; old update
>> count = 959; new update count = 960]
>> 2012-10-02 15:10:40,370 DEBUG
>> [agent.manager.ClusteredAgentManagerImpl]
>> (AgentTaskPool-2:null) Notifying other nodes of to disconnect
>> 2012-10-02 15:10:40,370 WARN  [cloud.resource.ResourceManagerImpl]
>> (AgentTaskPool-2:null) Unable to connect due to
>> com.cloud.exception.ConnectionException: Unable to connect to pool
>> Pool[204|IscsiLUN]
>>       at
>>       at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
>> java:603)
>>
>>       at java.lang.Thread.run(Thread.java:679)
>> Caused by: com.cloud.exception.StorageUnavailableException: Resource
>> [StoragePool:204] is unreachable: Unable establish connection from
>> storage head to storage pool 204 due to ModifyStoragePoolCommand add
>> XenAPIException:Can not see storage pool:
>> cfd3b016-d4d9-3bb9-b1f9-f31374c44185 from on
>> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd
>> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool:
>> 172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/0
>>       at
>> com.cloud.storage.StorageManagerImpl.connectHostToSharedPool(StorageMa
>> nagerImpl.java:1567)
>>
>>       at
>> com.cloud.storage.listener.StoragePoolMonitor.processConnect(StoragePo
>> olMonitor.java:88)
>>
>>       ... 8 more
>> 2012-10-02 15:10:40,371 DEBUG [cloud.host.Status]
>> (AgentTaskPool-2:null) Transition:[Resource state = Enabled, Agent
>> event = AgentDisconnected, Host id = 6, name = hv1]
>> 2012-10-02 15:10:40,375 DEBUG [cloud.host.Status]
>> (AgentTaskPool-2:null) Agent status update: [id = 6; name = hv1; old
>> status = Alert; event = AgentDisconnected; new status = Alert; old
>> update count = 960; new update count = 961]
>>
>>
>> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool:
>> 172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/1 is the SAN that
>> is in maintenance mode, so why is CS still trying to connect?  All my
>> HVs are in alert state becasue of this.
>>
>
>
> --
> Regards,
>
> Nik
>
> Nik Martin
> VP Business Development
> Nfina Technologies, Inc.
> +1.251.243.0043 x1003
> Relentless Reliability
>


-- 
Regards,

Nik

Nik Martin
VP Business Development
Nfina Technologies, Inc.
+1.251.243.0043 x1003
Relentless Reliability

Re: Storage failure in not handled well in CS

Posted by Nik Martin <ni...@nfinausa.com>.

On 10/03/2012 03:04 PM, Alex Huang wrote:
>
>
>> -----Original Message-----
>> From: Nik Martin [mailto:nik.martin@nfinausa.com]
>> Sent: Wednesday, October 03, 2012 12:27 PM
>> To: cloudstack-users@incubator.apache.org
>> Subject: Re: Storage failure in not handled well in CS
>>
>> https://issues.apache.org/jira/browse/CLOUDSTACK-251
>>
>> On 10/03/2012 02:11 PM, Anthony Xu wrote:
>>> It is a bug, please file a bug,
>>>
>>> You can try following workaround,
>>> In mysql
>>> Update storage_pool set removed=now() where id= "primary storage id
>> you put into maintenance mode"
>
> Martin,
>
> This should not happen.  I've raised the priority on the bug itself.
>
> I want to make sure you understand the implications of the sql Anthony sent.  This will temporarily remove the storage_pool from CloudStack because removed column is not null means CloudStack data access layer won't even retrieve the storage pool.  There may be things that appear broken while this is true.  For example, when you retrieve volumes stored on that storage pool, cloudstack won't be able to retrieve any information on the storage pool because it can't see it.  To get it back in working order, you have to "update storage_pool set removed=null where id = [primary storage pool id]".
>
> You should still put the storage pool into maintenance mode before you run Anthony's sql.
>
> --Alex
>
ALex,

Thank you for the update and feedback.
-- 
Regards,

Nik

Nik Martin
VP Business Development
Nfina Technologies, Inc.
+1.251.243.0043 x1003
Relentless Reliability

RE: Storage failure in not handled well in CS

Posted by Alex Huang <Al...@citrix.com>.

> -----Original Message-----
> From: Nik Martin [mailto:nik.martin@nfinausa.com]
> Sent: Wednesday, October 03, 2012 12:27 PM
> To: cloudstack-users@incubator.apache.org
> Subject: Re: Storage failure in not handled well in CS
> 
> https://issues.apache.org/jira/browse/CLOUDSTACK-251
> 
> On 10/03/2012 02:11 PM, Anthony Xu wrote:
> > It is a bug, please file a bug,
> >
> > You can try following workaround,
> > In mysql
> > Update storage_pool set removed=now() where id= "primary storage id
> you put into maintenance mode"

Martin, 

This should not happen.  I've raised the priority on the bug itself.

I want to make sure you understand the implications of the sql Anthony sent.  This will temporarily remove the storage_pool from CloudStack because removed column is not null means CloudStack data access layer won't even retrieve the storage pool.  There may be things that appear broken while this is true.  For example, when you retrieve volumes stored on that storage pool, cloudstack won't be able to retrieve any information on the storage pool because it can't see it.  To get it back in working order, you have to "update storage_pool set removed=null where id = [primary storage pool id]".

You should still put the storage pool into maintenance mode before you run Anthony's sql.

--Alex

Re: Storage failure in not handled well in CS

Posted by Nik Martin <ni...@nfinausa.com>.

https://issues.apache.org/jira/browse/CLOUDSTACK-251

On 10/03/2012 02:11 PM, Anthony Xu wrote:
> It is a bug, please file a bug,
>
> You can try following workaround,
> In mysql
> Update storage_pool set removed=now() where id= "primary storage id you put into maintenance mode"
>
>
> Anthony
>
>
> -----Original Message-----
> From: Nik Martin [mailto:nik.martin@nfinausa.com]
> Sent: Wednesday, October 03, 2012 6:52 AM
> To: cloudstack-users@incubator.apache.org
> Subject: Re: Storage failure in not handled well in CS
>
> Bump?  This is a serious issue that I need to get resolved.  An entire cloud going down while one SAN is being repaired is a bad thing.  My cloud controller still refuses to start VMs because it cannot connect to a SAN that is in maintenance mode and is offline.
>
>
> On 10/02/2012 03:12 PM, Nik Martin wrote:
>> I have two SANs connected to CS as primary storage.  One is an HD
>> based SAN, with a single target and LUN, and the other is an SSD SAN
>> split into two volumes, each connected with a target and LUN.  The HD
>> san is where all system VMs are stored (or they were before I added
>> the HD SAN, but I have no ide where the system vm volumens are
>> stored).  This morning, I had to do a semi emergency shutdown of the
>> SSD SAN, so I put both LUNS in emergency maintenance mode in CS.  CS
>> shutdown the entire cloud, not just the volumes stored in the SSD san.
>> The san is offline, and CS shows it in maintenance mode, but NO vm's
>> will start, and the cs management log shows:
>>
>> onnecting; event = AgentDisconnected; new status = Alert; old update
>> count = 959; new update count = 960]
>> 2012-10-02 15:10:40,370 DEBUG
>> [agent.manager.ClusteredAgentManagerImpl]
>> (AgentTaskPool-2:null) Notifying other nodes of to disconnect
>> 2012-10-02 15:10:40,370 WARN  [cloud.resource.ResourceManagerImpl]
>> (AgentTaskPool-2:null) Unable to connect due to
>> com.cloud.exception.ConnectionException: Unable to connect to pool
>> Pool[204|IscsiLUN]
>>       at
>>       at
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
>> java:603)
>>
>>       at java.lang.Thread.run(Thread.java:679)
>> Caused by: com.cloud.exception.StorageUnavailableException: Resource
>> [StoragePool:204] is unreachable: Unable establish connection from
>> storage head to storage pool 204 due to ModifyStoragePoolCommand add
>> XenAPIException:Can not see storage pool:
>> cfd3b016-d4d9-3bb9-b1f9-f31374c44185 from on
>> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd
>> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool:
>> 172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/0
>>       at
>> com.cloud.storage.StorageManagerImpl.connectHostToSharedPool(StorageMa
>> nagerImpl.java:1567)
>>
>>       at
>> com.cloud.storage.listener.StoragePoolMonitor.processConnect(StoragePo
>> olMonitor.java:88)
>>
>>       ... 8 more
>> 2012-10-02 15:10:40,371 DEBUG [cloud.host.Status]
>> (AgentTaskPool-2:null) Transition:[Resource state = Enabled, Agent
>> event = AgentDisconnected, Host id = 6, name = hv1]
>> 2012-10-02 15:10:40,375 DEBUG [cloud.host.Status]
>> (AgentTaskPool-2:null) Agent status update: [id = 6; name = hv1; old
>> status = Alert; event = AgentDisconnected; new status = Alert; old
>> update count = 960; new update count = 961]
>>
>>
>> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool:
>> 172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/1 is the SAN that
>> is in maintenance mode, so why is CS still trying to connect?  All my
>> HVs are in alert state becasue of this.
>>
>
>
> --
> Regards,
>
> Nik
>
> Nik Martin
> VP Business Development
> Nfina Technologies, Inc.
> +1.251.243.0043 x1003
> Relentless Reliability
>


-- 
Regards,

Nik

Nik Martin
VP Business Development
Nfina Technologies, Inc.
+1.251.243.0043 x1003
Relentless Reliability

RE: Storage failure in not handled well in CS

Posted by Anthony Xu <Xu...@citrix.com>.

It is a bug, please file a bug,

You can try following workaround,
In mysql
Update storage_pool set removed=now() where id= "primary storage id you put into maintenance mode"


Anthony


-----Original Message-----
From: Nik Martin [mailto:nik.martin@nfinausa.com] 
Sent: Wednesday, October 03, 2012 6:52 AM
To: cloudstack-users@incubator.apache.org
Subject: Re: Storage failure in not handled well in CS

Bump?  This is a serious issue that I need to get resolved.  An entire cloud going down while one SAN is being repaired is a bad thing.  My cloud controller still refuses to start VMs because it cannot connect to a SAN that is in maintenance mode and is offline.


On 10/02/2012 03:12 PM, Nik Martin wrote:
> I have two SANs connected to CS as primary storage.  One is an HD 
> based SAN, with a single target and LUN, and the other is an SSD SAN 
> split into two volumes, each connected with a target and LUN.  The HD 
> san is where all system VMs are stored (or they were before I added 
> the HD SAN, but I have no ide where the system vm volumens are 
> stored).  This morning, I had to do a semi emergency shutdown of the 
> SSD SAN, so I put both LUNS in emergency maintenance mode in CS.  CS 
> shutdown the entire cloud, not just the volumes stored in the SSD san.  
> The san is offline, and CS shows it in maintenance mode, but NO vm's 
> will start, and the cs management log shows:
>
> onnecting; event = AgentDisconnected; new status = Alert; old update 
> count = 959; new update count = 960]
> 2012-10-02 15:10:40,370 DEBUG 
> [agent.manager.ClusteredAgentManagerImpl]
> (AgentTaskPool-2:null) Notifying other nodes of to disconnect
> 2012-10-02 15:10:40,370 WARN  [cloud.resource.ResourceManagerImpl]
> (AgentTaskPool-2:null) Unable to connect due to
> com.cloud.exception.ConnectionException: Unable to connect to pool 
> Pool[204|IscsiLUN]
>      at
>      at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.
> java:603)
>
>      at java.lang.Thread.run(Thread.java:679)
> Caused by: com.cloud.exception.StorageUnavailableException: Resource 
> [StoragePool:204] is unreachable: Unable establish connection from 
> storage head to storage pool 204 due to ModifyStoragePoolCommand add 
> XenAPIException:Can not see storage pool:
> cfd3b016-d4d9-3bb9-b1f9-f31374c44185 from on 
> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd
> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool:
> 172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/0
>      at
> com.cloud.storage.StorageManagerImpl.connectHostToSharedPool(StorageMa
> nagerImpl.java:1567)
>
>      at
> com.cloud.storage.listener.StoragePoolMonitor.processConnect(StoragePo
> olMonitor.java:88)
>
>      ... 8 more
> 2012-10-02 15:10:40,371 DEBUG [cloud.host.Status] 
> (AgentTaskPool-2:null) Transition:[Resource state = Enabled, Agent 
> event = AgentDisconnected, Host id = 6, name = hv1]
> 2012-10-02 15:10:40,375 DEBUG [cloud.host.Status] 
> (AgentTaskPool-2:null) Agent status update: [id = 6; name = hv1; old 
> status = Alert; event = AgentDisconnected; new status = Alert; old 
> update count = 960; new update count = 961]
>
>
> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool:
> 172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/1 is the SAN that 
> is in maintenance mode, so why is CS still trying to connect?  All my 
> HVs are in alert state becasue of this.
>


--
Regards,

Nik

Nik Martin
VP Business Development
Nfina Technologies, Inc.
+1.251.243.0043 x1003
Relentless Reliability

Re: Storage failure in not handled well in CS

Posted by Nik Martin <ni...@nfinausa.com>.

On 10/03/2012 12:03 PM, Ahmad Emneina wrote:
> Hey Nik,
>
> It appears the compute host, or cluster, cant connect to the SAN
> referenced below. Have you peered into the compute hosts logs, they should
> be more informative as to why it cant connect the storage. You should also
> have at least one storage pool up to be able to provision against.
>

Ahmad, If you reference my original post to the list I have two SANs, 
both are primary storage.  One is HD based,and one is SSD based.  I use 
storage tags "HD" and "SSD" respectively.  the HD based san is a single 
20TB volume, with 1 iSCSI target, and 1 LUN.  The SSD SAN is two 5TB 
volumes, each with 1 target, and 1 LUN each, in an Active-Active 
configuration.  The SSD SAN suffered from a misconfiguration issue, so 
we had to put it into maintenance mode in a hurry, and shut it down.  I 
fully expected the Volumes and VMs provisioned on the SSD san to be 
unavailable.  The problem is, Cloudstack continued to try to access 
Volume id 204, which is target 0 on the SSD san.  It shut every VM down, 
and put all Hypervisors into Alert state, and went into a loop trying to 
connect to a volume that is in Maintenence mode.  This creates a very 
bad situation for me and my customers.

Regards,

Nik

> On 10/3/12 6:51 AM, "Nik Martin" <ni...@nfinausa.com> wrote:
>
>> Bump?  This is a serious issue that I need to get resolved.  An entire
>> cloud going down while one SAN is being repaired is a bad thing.  My
>> cloud controller still refuses to start VMs because it cannot connect to
>> a SAN that is in maintenance mode and is offline.
>>
>>
>> On 10/02/2012 03:12 PM, Nik Martin wrote:
>>> I have two SANs connected to CS as primary storage.  One is an HD based
>>> SAN, with a single target and LUN, and the other is an SSD SAN split
>>> into two volumes, each connected with a target and LUN.  The HD san is
>>> where all system VMs are stored (or they were before I added the HD SAN,
>>> but I have no ide where the system vm volumens are stored).  This
>>> morning, I had to do a semi emergency shutdown of the SSD SAN, so I put
>>> both LUNS in emergency maintenance mode in CS.  CS shutdown the entire
>>> cloud, not just the volumes stored in the SSD san.  The san is offline,
>>> and CS shows it in maintenance mode, but NO vm's will start, and the cs
>>> management log shows:
>>>
>>> onnecting; event = AgentDisconnected; new status = Alert; old update
>>> count = 959; new update count = 960]
>>> 2012-10-02 15:10:40,370 DEBUG [agent.manager.ClusteredAgentManagerImpl]
>>> (AgentTaskPool-2:null) Notifying other nodes of to disconnect
>>> 2012-10-02 15:10:40,370 WARN  [cloud.resource.ResourceManagerImpl]
>>> (AgentTaskPool-2:null) Unable to connect due to
>>> com.cloud.exception.ConnectionException: Unable to connect to pool
>>> Pool[204|IscsiLUN]
>>>       at
>>>       at
>>>
>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.jav
>>> a:603)
>>>
>>>       at java.lang.Thread.run(Thread.java:679)
>>> Caused by: com.cloud.exception.StorageUnavailableException: Resource
>>> [StoragePool:204] is unreachable: Unable establish connection from
>>> storage head to storage pool 204 due to ModifyStoragePoolCommand add
>>> XenAPIException:Can not see storage pool:
>>> cfd3b016-d4d9-3bb9-b1f9-f31374c44185 from on
>>> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd
>>> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool:
>>> 172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/0
>>>       at
>>>
>>> com.cloud.storage.StorageManagerImpl.connectHostToSharedPool(StorageManag
>>> erImpl.java:1567)
>>>
>>>       at
>>>
>>> com.cloud.storage.listener.StoragePoolMonitor.processConnect(StoragePoolM
>>> onitor.java:88)
>>>
>>>       ... 8 more
>>> 2012-10-02 15:10:40,371 DEBUG [cloud.host.Status] (AgentTaskPool-2:null)
>>> Transition:[Resource state = Enabled, Agent event = AgentDisconnected,
>>> Host id = 6, name = hv1]
>>> 2012-10-02 15:10:40,375 DEBUG [cloud.host.Status] (AgentTaskPool-2:null)
>>> Agent status update: [id = 6; name = hv1; old status = Alert; event =
>>> AgentDisconnected; new status = Alert; old update count = 960; new
>>> update count = 961]
>>>
>>>
>>> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool:
>>> 172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/1 is the SAN that is
>>> in maintenance mode, so why is CS still trying to connect?  All my HVs
>>> are in alert state becasue of this.
>>>
>>
>>
>> --
>> Regards,
>>
>> Nik
>>
>> Nik Martin
>> VP Business Development
>> Nfina Technologies, Inc.
>> +1.251.243.0043 x1003
>> Relentless Reliability
>>
>
>


-- 
Regards,

Nik

Nik Martin
VP Business Development
Nfina Technologies, Inc.
+1.251.243.0043 x1003
Relentless Reliability

Re: Storage failure in not handled well in CS

Posted by Ahmad Emneina <Ah...@citrix.com>.

Hey Nik,

It appears the compute host, or cluster, cant connect to the SAN
referenced below. Have you peered into the compute hosts logs, they should
be more informative as to why it cant connect the storage. You should also
have at least one storage pool up to be able to provision against.

On 10/3/12 6:51 AM, "Nik Martin" <ni...@nfinausa.com> wrote:

>Bump?  This is a serious issue that I need to get resolved.  An entire
>cloud going down while one SAN is being repaired is a bad thing.  My
>cloud controller still refuses to start VMs because it cannot connect to
>a SAN that is in maintenance mode and is offline.
>
>
>On 10/02/2012 03:12 PM, Nik Martin wrote:
>> I have two SANs connected to CS as primary storage.  One is an HD based
>> SAN, with a single target and LUN, and the other is an SSD SAN split
>> into two volumes, each connected with a target and LUN.  The HD san is
>> where all system VMs are stored (or they were before I added the HD SAN,
>> but I have no ide where the system vm volumens are stored).  This
>> morning, I had to do a semi emergency shutdown of the SSD SAN, so I put
>> both LUNS in emergency maintenance mode in CS.  CS shutdown the entire
>> cloud, not just the volumes stored in the SSD san.  The san is offline,
>> and CS shows it in maintenance mode, but NO vm's will start, and the cs
>> management log shows:
>>
>> onnecting; event = AgentDisconnected; new status = Alert; old update
>> count = 959; new update count = 960]
>> 2012-10-02 15:10:40,370 DEBUG [agent.manager.ClusteredAgentManagerImpl]
>> (AgentTaskPool-2:null) Notifying other nodes of to disconnect
>> 2012-10-02 15:10:40,370 WARN  [cloud.resource.ResourceManagerImpl]
>> (AgentTaskPool-2:null) Unable to connect due to
>> com.cloud.exception.ConnectionException: Unable to connect to pool
>> Pool[204|IscsiLUN]
>>      at
>>      at
>> 
>>java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.jav
>>a:603)
>>
>>      at java.lang.Thread.run(Thread.java:679)
>> Caused by: com.cloud.exception.StorageUnavailableException: Resource
>> [StoragePool:204] is unreachable: Unable establish connection from
>> storage head to storage pool 204 due to ModifyStoragePoolCommand add
>> XenAPIException:Can not see storage pool:
>> cfd3b016-d4d9-3bb9-b1f9-f31374c44185 from on
>> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd
>> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool:
>> 172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/0
>>      at
>> 
>>com.cloud.storage.StorageManagerImpl.connectHostToSharedPool(StorageManag
>>erImpl.java:1567)
>>
>>      at
>> 
>>com.cloud.storage.listener.StoragePoolMonitor.processConnect(StoragePoolM
>>onitor.java:88)
>>
>>      ... 8 more
>> 2012-10-02 15:10:40,371 DEBUG [cloud.host.Status] (AgentTaskPool-2:null)
>> Transition:[Resource state = Enabled, Agent event = AgentDisconnected,
>> Host id = 6, name = hv1]
>> 2012-10-02 15:10:40,375 DEBUG [cloud.host.Status] (AgentTaskPool-2:null)
>> Agent status update: [id = 6; name = hv1; old status = Alert; event =
>> AgentDisconnected; new status = Alert; old update count = 960; new
>> update count = 961]
>>
>>
>> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool:
>> 172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/1 is the SAN that is
>> in maintenance mode, so why is CS still trying to connect?  All my HVs
>> are in alert state becasue of this.
>>
>
>
>-- 
>Regards,
>
>Nik
>
>Nik Martin
>VP Business Development
>Nfina Technologies, Inc.
>+1.251.243.0043 x1003
>Relentless Reliability
>


-- 
Æ

Re: Storage failure in not handled well in CS

Posted by Nik Martin <ni...@nfinausa.com>.

Bump?  This is a serious issue that I need to get resolved.  An entire 
cloud going down while one SAN is being repaired is a bad thing.  My 
cloud controller still refuses to start VMs because it cannot connect to 
a SAN that is in maintenance mode and is offline.


On 10/02/2012 03:12 PM, Nik Martin wrote:
> I have two SANs connected to CS as primary storage.  One is an HD based
> SAN, with a single target and LUN, and the other is an SSD SAN split
> into two volumes, each connected with a target and LUN.  The HD san is
> where all system VMs are stored (or they were before I added the HD SAN,
> but I have no ide where the system vm volumens are stored).  This
> morning, I had to do a semi emergency shutdown of the SSD SAN, so I put
> both LUNS in emergency maintenance mode in CS.  CS shutdown the entire
> cloud, not just the volumes stored in the SSD san.  The san is offline,
> and CS shows it in maintenance mode, but NO vm's will start, and the cs
> management log shows:
>
> onnecting; event = AgentDisconnected; new status = Alert; old update
> count = 959; new update count = 960]
> 2012-10-02 15:10:40,370 DEBUG [agent.manager.ClusteredAgentManagerImpl]
> (AgentTaskPool-2:null) Notifying other nodes of to disconnect
> 2012-10-02 15:10:40,370 WARN  [cloud.resource.ResourceManagerImpl]
> (AgentTaskPool-2:null) Unable to connect due to
> com.cloud.exception.ConnectionException: Unable to connect to pool
> Pool[204|IscsiLUN]
>      at
>      at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:603)
>
>      at java.lang.Thread.run(Thread.java:679)
> Caused by: com.cloud.exception.StorageUnavailableException: Resource
> [StoragePool:204] is unreachable: Unable establish connection from
> storage head to storage pool 204 due to ModifyStoragePoolCommand add
> XenAPIException:Can not see storage pool:
> cfd3b016-d4d9-3bb9-b1f9-f31374c44185 from on
> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd
> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool:
> 172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/0
>      at
> com.cloud.storage.StorageManagerImpl.connectHostToSharedPool(StorageManagerImpl.java:1567)
>
>      at
> com.cloud.storage.listener.StoragePoolMonitor.processConnect(StoragePoolMonitor.java:88)
>
>      ... 8 more
> 2012-10-02 15:10:40,371 DEBUG [cloud.host.Status] (AgentTaskPool-2:null)
> Transition:[Resource state = Enabled, Agent event = AgentDisconnected,
> Host id = 6, name = hv1]
> 2012-10-02 15:10:40,375 DEBUG [cloud.host.Status] (AgentTaskPool-2:null)
> Agent status update: [id = 6; name = hv1; old status = Alert; event =
> AgentDisconnected; new status = Alert; old update count = 960; new
> update count = 961]
>
>
> host:82cad07f-6fbc-464e-86fe-28bb4af4bbcd pool:
> 172.16.10.15/iqn.2012-01:com.nfinausa.san2:mirror0/1 is the SAN that is
> in maintenance mode, so why is CS still trying to connect?  All my HVs
> are in alert state becasue of this.
>


-- 
Regards,

Nik

Nik Martin
VP Business Development
Nfina Technologies, Inc.
+1.251.243.0043 x1003
Relentless Reliability