You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@cloudstack.apache.org by Indra Pramana <in...@sg.or.id> on 2016/05/01 19:53:29 UTC

CloudStack agent shuts down VMs upon reconnecting to Management server

Dear all,

We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage. We have
been having a specific problem, which has been happening for quite some
time (may be from the first day we use CloudStack), which we suspect is
related to HA.

When a CloudStack agent gets disconnected from the management server for
any reason, CloudStack would gradually mark some or all the VMs on the
disconnected host as "Stopped" even though it's actually still running on
the disconnected VM. When I tried to reconnect the agent, CloudStack seems
to instruct the agent to stop the VM first, and will be busy shutting down
each of the VMs one by one while in "Connecting" state, before it can
obtain "Up" state.

This caused all the VMs inside the host (with the disconnected agent) to be
down unnecessarily, even though technically they can actually stay up while
the agent is reconnecting to the management server.

Is there a way we can prevent CloudStack from shutting down the VMs during
agent re-connection? Relevant logs from management server and agent are
below, it seems HA is the culprit.

Any advice is appreciated.

Excerpts from management server logs -- on below example, the hostname of
the affected VM on the disconnected host is "vm-hostname" and below is the
result of grepping "vm-hostname" from the logs.

====
2016-04-30 23:24:32,680 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(Timer-1:null) Schedule vm for HA:  VM[User|vm-hostname]
2016-04-30 23:24:35,565 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-11007) HA on VM[User|vm-hostname]
2016-04-30 23:24:35,571 DEBUG [cloud.ha.CheckOnAgentInvestigator]
(HA-Worker-1:work-11007) Unable to reach the agent for
VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with
specified id is not in the right state: Disconnected
2016-04-30 23:24:35,571 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-11007) SimpleInvestigator found VM[User|vm-hostname]to be
alive? null
2016-04-30 23:24:35,571 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-11007) XenServerInvestigator found VM[User|vm-hostname]to
be alive? null
2016-04-30 23:24:35,571 DEBUG [cloud.ha.UserVmDomRInvestigator]
(HA-Worker-1:work-11007) testing if VM[User|vm-hostname] is alive
2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator]
(HA-Worker-1:work-11007) VM[User|vm-hostname] could not be pinged,
returning that it is unknown
2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator]
(HA-Worker-1:work-11007) Returning null since we're unable to determine
state of VM[User|vm-hostname]
2016-04-30 23:24:35,581 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null
2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
(HA-Worker-1:work-11007) Not a System Vm, unable to determine state of
VM[User|vm-hostname] returning null
2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
(HA-Worker-1:work-11007) Testing if VM[User|vm-hostname] is alive
2016-04-30 23:24:35,586 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
(HA-Worker-1:work-11007) Unable to find a management nic, cannot ping this
system VM, unable to determine state of VM[User|vm-hostname] returning null
2016-04-30 23:24:35,586 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null
2016-04-30 23:24:35,588 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-11007) KVMInvestigator found VM[User|vm-hostname]to be
alive? null
2016-04-30 23:24:35,592 DEBUG [cloud.ha.KVMFencer] (HA-Worker-1:work-11007)
Unable to fence off VM[User|vm-hostname] on Host[-34-Routing]
2016-04-30 23:24:35,592 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-1:work-11007) We were unable to fence off the VM
VM[User|vm-hostname]
2016-04-30 23:24:35,592 WARN  [apache.cloudstack.alerts]
(HA-Worker-1:work-11007)  alertType:: 8 // dataCenterId:: 6 // podId:: 6 //
clusterId:: null // message:: Unable to restart vm-hostname which was
running on host name: hypervisor-host(id:34), availability zone:
xxxxxxxxxx-Singapore-01, pod: xxxxxxxxxx-Singapore-Pod-01
2016-04-30 23:24:41,028 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(AgentConnectTaskPool-4:null) Both states are Running for
VM[User|vm-hostname]
=====

The above will keep on looping until a time when CloudStack management
server decides to do a force stop as follows:

=====
2016-05-01 00:30:23,305 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11249) HA on VM[User|vm-hostname]
2016-05-01 00:30:23,311 DEBUG [cloud.ha.CheckOnAgentInvestigator]
(HA-Worker-3:work-11249) Unable to reach the agent for
VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with
specified id is not in the right state: Disconnected
2016-05-01 00:30:23,311 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11249) SimpleInvestigator found VM[User|vm-hostname]to be
alive? null
2016-05-01 00:30:23,311 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11249) XenServerInvestigator found VM[User|vm-hostname]to
be alive? null
2016-05-01 00:30:23,311 DEBUG [cloud.ha.UserVmDomRInvestigator]
(HA-Worker-3:work-11249) testing if VM[User|vm-hostname] is alive
2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator]
(HA-Worker-3:work-11249) VM[User|vm-hostname] could not be pinged,
returning that it is unknown
2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator]
(HA-Worker-3:work-11249) Returning null since we're unable to determine
state of VM[User|vm-hostname]
2016-05-01 00:30:35,499 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null
2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
(HA-Worker-3:work-11249) Not a System Vm, unable to determine state of
VM[User|vm-hostname] returning null
2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
(HA-Worker-3:work-11249) Testing if VM[User|vm-hostname] is alive
2016-05-01 00:30:35,505 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
(HA-Worker-3:work-11249) Unable to find a management nic, cannot ping this
system VM, unable to determine state of VM[User|vm-hostname] returning null
2016-05-01 00:30:35,505 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null
2016-05-01 00:30:35,558 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11249) KVMInvestigator found VM[User|vm-hostname]to be
alive? null
2016-05-01 00:30:35,688 WARN  [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-3:work-11249) Unable to actually stop VM[User|vm-hostname] but
continue with release because it's a force stop
2016-05-01 00:30:35,693 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-3:work-11249) VM[User|vm-hostname] is stopped on the host.
Proceeding to release resource held.
2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-3:work-11249) Successfully released network resources for the vm
VM[User|vm-hostname]
2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl]
(HA-Worker-3:work-11249) Successfully released storage resources for the vm
VM[User|vm-hostname]
2016-05-01 00:31:38,426 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11183) HA on VM[User|vm-hostname]
2016-05-01 00:31:38,426 INFO  [cloud.ha.HighAvailabilityManagerImpl]
(HA-Worker-3:work-11183) VM VM[User|vm-hostname] has been changed.  Current
State = Stopped Previous State = Running last updated = 113 previous
updated = 111
=====

Below are the excerpts from the corresponding agent.log, note that
i-1082-3086-VM is the VM ID for the above vm-hostname as example:

=====
2016-04-30 23:24:36,592 DEBUG [kvm.resource.LibvirtComputingResource]
(Agent-Handler-1:null) Detecting a new state but couldn't find a old state
so adding it to the changes: i-1082-3086-VM
=====

After CloudStack management server decides to mark the VM as stopped, the
agent will try to shutdown the VM upon reconnecting of the agent to the
management server:

=====
2016-05-01 00:32:32,029 DEBUG [cloud.agent.Agent]
(agentRequest-Handler-3:null) Processing command:
com.cloud.agent.api.StopCommand
2016-05-01 00:32:32,063 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-3:null) Executing:
/usr/share/cloudstack-common/scripts/vm/network/security_group.py
destroy_network_rules_for_vm --vmname i-1082-3086-VM --vif vnet11
2016-05-01 00:32:32,195 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-3:null) Execution is successful.
2016-05-01 00:32:32,196 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-3:null) Try to stop the vm at first
=====

and

=====
2016-05-01 00:33:04,835 DEBUG [kvm.resource.LibvirtComputingResource]
(agentRequest-Handler-3:null) successfully shut down vm i-1082-3086-VM
2016-05-01 00:33:04,836 DEBUG [utils.script.Script]
(agentRequest-Handler-3:null) Executing: /bin/bash -c ls
/sys/class/net/breth1-8/brif | grep vnet
2016-05-01 00:33:04,847 DEBUG [utils.script.Script]
(agentRequest-Handler-3:null) Execution is successful.
=====

- Is there a way for us to prevent the above scenario from happening?
- Is the only way to prevent the above scenario is to disable HA on the VM?
- Understand that disabling HA will require applying new service offering
for each VM and restart the VM for the changes to take effect. Is there a
way to disable HA globally without changing the service offering for each
VM?
- Is it possible to avoid the above scenario from happening without having
to disable HA and losing the HA features and functionality?

Any advice is greatly appreciated.

Looking forward to your reply, thank you.

Cheers.

Re: CloudStack agent shuts down VMs upon reconnecting to Management server

Posted by Simon Weller <sw...@ena.com>.
Indra,

Take a look at investigate.retry.interval and restart.retry.interval.

- Si

________________________________________
From: Indra Pramana <in...@sg.or.id>
Sent: Sunday, May 1, 2016 8:42 PM
To: users@cloudstack.apache.org
Subject: Re: CloudStack agent shuts down VMs upon reconnecting to Management server

Dear all,

I received an advice from a nice guy on the IRC channel to increase the HA
timer, which I suppose is the time period when HA workers are started upon
disconnection of a host.However, I can't seem to find the settings on the
CloudStack's global settings. Anyone knows how to set this up? I can only
find these two settings related to HA on global settings:

ha.tagHA tag defining that the host marked with this tag can be used for HA
purposes only

ha.workersNumber of ha worker threads.5


I also noted that the default value of ha.workers is set to 5, and I can
actually set to 0 -- can I prevent the HA workers to be started by doing
this, and will there be any impact to the overall CloudStack operations?
Looking into setting this temporarily until I can find the more proper and
best solution to the problem.

Looking forward to your reply, thank you.

Cheers.


On Mon, May 2, 2016 at 1:53 AM, Indra Pramana <in...@sg.or.id> wrote:

> Dear all,
>
> We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage. We
> have been having a specific problem, which has been happening for quite
> some time (may be from the first day we use CloudStack), which we suspect
> is related to HA.
>
> When a CloudStack agent gets disconnected from the management server for
> any reason, CloudStack would gradually mark some or all the VMs on the
> disconnected host as "Stopped" even though it's actually still running on
> the disconnected VM. When I tried to reconnect the agent, CloudStack seems
> to instruct the agent to stop the VM first, and will be busy shutting down
> each of the VMs one by one while in "Connecting" state, before it can
> obtain "Up" state.
>
> This caused all the VMs inside the host (with the disconnected agent) to
> be down unnecessarily, even though technically they can actually stay up
> while the agent is reconnecting to the management server.
>
> Is there a way we can prevent CloudStack from shutting down the VMs during
> agent re-connection? Relevant logs from management server and agent are
> below, it seems HA is the culprit.
>
> Any advice is appreciated.
>
> Excerpts from management server logs -- on below example, the hostname of
> the affected VM on the disconnected host is "vm-hostname" and below is the
> result of grepping "vm-hostname" from the logs.
>
> ====
> 2016-04-30 23:24:32,680 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (Timer-1:null) Schedule vm for HA:  VM[User|vm-hostname]
> 2016-04-30 23:24:35,565 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) HA on VM[User|vm-hostname]
> 2016-04-30 23:24:35,571 DEBUG [cloud.ha.CheckOnAgentInvestigator]
> (HA-Worker-1:work-11007) Unable to reach the agent for
> VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with
> specified id is not in the right state: Disconnected
> 2016-04-30 23:24:35,571 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) SimpleInvestigator found VM[User|vm-hostname]to be
> alive? null
> 2016-04-30 23:24:35,571 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) XenServerInvestigator found VM[User|vm-hostname]to
> be alive? null
> 2016-04-30 23:24:35,571 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-1:work-11007) testing if VM[User|vm-hostname] is alive
> 2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-1:work-11007) VM[User|vm-hostname] could not be pinged,
> returning that it is unknown
> 2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-1:work-11007) Returning null since we're unable to determine
> state of VM[User|vm-hostname]
> 2016-04-30 23:24:35,581 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null
> 2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-1:work-11007) Not a System Vm, unable to determine state of
> VM[User|vm-hostname] returning null
> 2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-1:work-11007) Testing if VM[User|vm-hostname] is alive
> 2016-04-30 23:24:35,586 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-1:work-11007) Unable to find a management nic, cannot ping this
> system VM, unable to determine state of VM[User|vm-hostname] returning null
> 2016-04-30 23:24:35,586 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null
> 2016-04-30 23:24:35,588 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) KVMInvestigator found VM[User|vm-hostname]to be
> alive? null
> 2016-04-30 23:24:35,592 DEBUG [cloud.ha.KVMFencer]
> (HA-Worker-1:work-11007) Unable to fence off VM[User|vm-hostname] on
> Host[-34-Routing]
> 2016-04-30 23:24:35,592 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) We were unable to fence off the VM
> VM[User|vm-hostname]
> 2016-04-30 23:24:35,592 WARN  [apache.cloudstack.alerts]
> (HA-Worker-1:work-11007)  alertType:: 8 // dataCenterId:: 6 // podId:: 6 //
> clusterId:: null // message:: Unable to restart vm-hostname which was
> running on host name: hypervisor-host(id:34), availability zone:
> xxxxxxxxxx-Singapore-01, pod: xxxxxxxxxx-Singapore-Pod-01
> 2016-04-30 23:24:41,028 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (AgentConnectTaskPool-4:null) Both states are Running for
> VM[User|vm-hostname]
> =====
>
> The above will keep on looping until a time when CloudStack management
> server decides to do a force stop as follows:
>
> =====
> 2016-05-01 00:30:23,305 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) HA on VM[User|vm-hostname]
> 2016-05-01 00:30:23,311 DEBUG [cloud.ha.CheckOnAgentInvestigator]
> (HA-Worker-3:work-11249) Unable to reach the agent for
> VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with
> specified id is not in the right state: Disconnected
> 2016-05-01 00:30:23,311 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) SimpleInvestigator found VM[User|vm-hostname]to be
> alive? null
> 2016-05-01 00:30:23,311 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) XenServerInvestigator found VM[User|vm-hostname]to
> be alive? null
> 2016-05-01 00:30:23,311 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-3:work-11249) testing if VM[User|vm-hostname] is alive
> 2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-3:work-11249) VM[User|vm-hostname] could not be pinged,
> returning that it is unknown
> 2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-3:work-11249) Returning null since we're unable to determine
> state of VM[User|vm-hostname]
> 2016-05-01 00:30:35,499 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null
> 2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-3:work-11249) Not a System Vm, unable to determine state of
> VM[User|vm-hostname] returning null
> 2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-3:work-11249) Testing if VM[User|vm-hostname] is alive
> 2016-05-01 00:30:35,505 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-3:work-11249) Unable to find a management nic, cannot ping this
> system VM, unable to determine state of VM[User|vm-hostname] returning null
> 2016-05-01 00:30:35,505 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null
> 2016-05-01 00:30:35,558 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) KVMInvestigator found VM[User|vm-hostname]to be
> alive? null
> 2016-05-01 00:30:35,688 WARN  [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-3:work-11249) Unable to actually stop VM[User|vm-hostname] but
> continue with release because it's a force stop
> 2016-05-01 00:30:35,693 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-3:work-11249) VM[User|vm-hostname] is stopped on the host.
> Proceeding to release resource held.
> 2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-3:work-11249) Successfully released network resources for the vm
> VM[User|vm-hostname]
> 2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-3:work-11249) Successfully released storage resources for the vm
> VM[User|vm-hostname]
> 2016-05-01 00:31:38,426 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11183) HA on VM[User|vm-hostname]
> 2016-05-01 00:31:38,426 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11183) VM VM[User|vm-hostname] has been changed.  Current
> State = Stopped Previous State = Running last updated = 113 previous
> updated = 111
> =====
>
> Below are the excerpts from the corresponding agent.log, note that
> i-1082-3086-VM is the VM ID for the above vm-hostname as example:
>
> =====
> 2016-04-30 23:24:36,592 DEBUG [kvm.resource.LibvirtComputingResource]
> (Agent-Handler-1:null) Detecting a new state but couldn't find a old state
> so adding it to the changes: i-1082-3086-VM
> =====
>
> After CloudStack management server decides to mark the VM as stopped, the
> agent will try to shutdown the VM upon reconnecting of the agent to the
> management server:
>
> =====
> 2016-05-01 00:32:32,029 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-3:null) Processing command:
> com.cloud.agent.api.StopCommand
> 2016-05-01 00:32:32,063 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-3:null) Executing:
> /usr/share/cloudstack-common/scripts/vm/network/security_group.py
> destroy_network_rules_for_vm --vmname i-1082-3086-VM --vif vnet11
> 2016-05-01 00:32:32,195 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-3:null) Execution is successful.
> 2016-05-01 00:32:32,196 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-3:null) Try to stop the vm at first
> =====
>
> and
>
> =====
> 2016-05-01 00:33:04,835 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-3:null) successfully shut down vm i-1082-3086-VM
> 2016-05-01 00:33:04,836 DEBUG [utils.script.Script]
> (agentRequest-Handler-3:null) Executing: /bin/bash -c ls
> /sys/class/net/breth1-8/brif | grep vnet
> 2016-05-01 00:33:04,847 DEBUG [utils.script.Script]
> (agentRequest-Handler-3:null) Execution is successful.
> =====
>
> - Is there a way for us to prevent the above scenario from happening?
> - Is the only way to prevent the above scenario is to disable HA on the VM?
> - Understand that disabling HA will require applying new service offering
> for each VM and restart the VM for the changes to take effect. Is there a
> way to disable HA globally without changing the service offering for each
> VM?
> - Is it possible to avoid the above scenario from happening without having
> to disable HA and losing the HA features and functionality?
>
> Any advice is greatly appreciated.
>
> Looking forward to your reply, thank you.
>
> Cheers.
>

Re: CloudStack agent shuts down VMs upon reconnecting to Management server

Posted by Indra Pramana <in...@sg.or.id>.
Dear all,

I received an advice from a nice guy on the IRC channel to increase the HA
timer, which I suppose is the time period when HA workers are started upon
disconnection of a host.However, I can't seem to find the settings on the
CloudStack's global settings. Anyone knows how to set this up? I can only
find these two settings related to HA on global settings:

ha.tagHA tag defining that the host marked with this tag can be used for HA
purposes only

ha.workersNumber of ha worker threads.5


I also noted that the default value of ha.workers is set to 5, and I can
actually set to 0 -- can I prevent the HA workers to be started by doing
this, and will there be any impact to the overall CloudStack operations?
Looking into setting this temporarily until I can find the more proper and
best solution to the problem.

Looking forward to your reply, thank you.

Cheers.


On Mon, May 2, 2016 at 1:53 AM, Indra Pramana <in...@sg.or.id> wrote:

> Dear all,
>
> We are using CloudStack 4.2.0, KVM hypervisor and Ceph RBD storage. We
> have been having a specific problem, which has been happening for quite
> some time (may be from the first day we use CloudStack), which we suspect
> is related to HA.
>
> When a CloudStack agent gets disconnected from the management server for
> any reason, CloudStack would gradually mark some or all the VMs on the
> disconnected host as "Stopped" even though it's actually still running on
> the disconnected VM. When I tried to reconnect the agent, CloudStack seems
> to instruct the agent to stop the VM first, and will be busy shutting down
> each of the VMs one by one while in "Connecting" state, before it can
> obtain "Up" state.
>
> This caused all the VMs inside the host (with the disconnected agent) to
> be down unnecessarily, even though technically they can actually stay up
> while the agent is reconnecting to the management server.
>
> Is there a way we can prevent CloudStack from shutting down the VMs during
> agent re-connection? Relevant logs from management server and agent are
> below, it seems HA is the culprit.
>
> Any advice is appreciated.
>
> Excerpts from management server logs -- on below example, the hostname of
> the affected VM on the disconnected host is "vm-hostname" and below is the
> result of grepping "vm-hostname" from the logs.
>
> ====
> 2016-04-30 23:24:32,680 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (Timer-1:null) Schedule vm for HA:  VM[User|vm-hostname]
> 2016-04-30 23:24:35,565 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) HA on VM[User|vm-hostname]
> 2016-04-30 23:24:35,571 DEBUG [cloud.ha.CheckOnAgentInvestigator]
> (HA-Worker-1:work-11007) Unable to reach the agent for
> VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with
> specified id is not in the right state: Disconnected
> 2016-04-30 23:24:35,571 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) SimpleInvestigator found VM[User|vm-hostname]to be
> alive? null
> 2016-04-30 23:24:35,571 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) XenServerInvestigator found VM[User|vm-hostname]to
> be alive? null
> 2016-04-30 23:24:35,571 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-1:work-11007) testing if VM[User|vm-hostname] is alive
> 2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-1:work-11007) VM[User|vm-hostname] could not be pinged,
> returning that it is unknown
> 2016-04-30 23:24:35,581 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-1:work-11007) Returning null since we're unable to determine
> state of VM[User|vm-hostname]
> 2016-04-30 23:24:35,581 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null
> 2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-1:work-11007) Not a System Vm, unable to determine state of
> VM[User|vm-hostname] returning null
> 2016-04-30 23:24:35,582 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-1:work-11007) Testing if VM[User|vm-hostname] is alive
> 2016-04-30 23:24:35,586 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-1:work-11007) Unable to find a management nic, cannot ping this
> system VM, unable to determine state of VM[User|vm-hostname] returning null
> 2016-04-30 23:24:35,586 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) null found VM[User|vm-hostname]to be alive? null
> 2016-04-30 23:24:35,588 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) KVMInvestigator found VM[User|vm-hostname]to be
> alive? null
> 2016-04-30 23:24:35,592 DEBUG [cloud.ha.KVMFencer]
> (HA-Worker-1:work-11007) Unable to fence off VM[User|vm-hostname] on
> Host[-34-Routing]
> 2016-04-30 23:24:35,592 DEBUG [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-1:work-11007) We were unable to fence off the VM
> VM[User|vm-hostname]
> 2016-04-30 23:24:35,592 WARN  [apache.cloudstack.alerts]
> (HA-Worker-1:work-11007)  alertType:: 8 // dataCenterId:: 6 // podId:: 6 //
> clusterId:: null // message:: Unable to restart vm-hostname which was
> running on host name: hypervisor-host(id:34), availability zone:
> xxxxxxxxxx-Singapore-01, pod: xxxxxxxxxx-Singapore-Pod-01
> 2016-04-30 23:24:41,028 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (AgentConnectTaskPool-4:null) Both states are Running for
> VM[User|vm-hostname]
> =====
>
> The above will keep on looping until a time when CloudStack management
> server decides to do a force stop as follows:
>
> =====
> 2016-05-01 00:30:23,305 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) HA on VM[User|vm-hostname]
> 2016-05-01 00:30:23,311 DEBUG [cloud.ha.CheckOnAgentInvestigator]
> (HA-Worker-3:work-11249) Unable to reach the agent for
> VM[User|vm-hostname]: Resource [Host:34] is unreachable: Host 34: Host with
> specified id is not in the right state: Disconnected
> 2016-05-01 00:30:23,311 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) SimpleInvestigator found VM[User|vm-hostname]to be
> alive? null
> 2016-05-01 00:30:23,311 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) XenServerInvestigator found VM[User|vm-hostname]to
> be alive? null
> 2016-05-01 00:30:23,311 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-3:work-11249) testing if VM[User|vm-hostname] is alive
> 2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-3:work-11249) VM[User|vm-hostname] could not be pinged,
> returning that it is unknown
> 2016-05-01 00:30:35,499 DEBUG [cloud.ha.UserVmDomRInvestigator]
> (HA-Worker-3:work-11249) Returning null since we're unable to determine
> state of VM[User|vm-hostname]
> 2016-05-01 00:30:35,499 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null
> 2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-3:work-11249) Not a System Vm, unable to determine state of
> VM[User|vm-hostname] returning null
> 2016-05-01 00:30:35,499 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-3:work-11249) Testing if VM[User|vm-hostname] is alive
> 2016-05-01 00:30:35,505 DEBUG [cloud.ha.ManagementIPSystemVMInvestigator]
> (HA-Worker-3:work-11249) Unable to find a management nic, cannot ping this
> system VM, unable to determine state of VM[User|vm-hostname] returning null
> 2016-05-01 00:30:35,505 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) null found VM[User|vm-hostname]to be alive? null
> 2016-05-01 00:30:35,558 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11249) KVMInvestigator found VM[User|vm-hostname]to be
> alive? null
> 2016-05-01 00:30:35,688 WARN  [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-3:work-11249) Unable to actually stop VM[User|vm-hostname] but
> continue with release because it's a force stop
> 2016-05-01 00:30:35,693 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-3:work-11249) VM[User|vm-hostname] is stopped on the host.
> Proceeding to release resource held.
> 2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-3:work-11249) Successfully released network resources for the vm
> VM[User|vm-hostname]
> 2016-05-01 00:30:35,698 DEBUG [cloud.vm.VirtualMachineManagerImpl]
> (HA-Worker-3:work-11249) Successfully released storage resources for the vm
> VM[User|vm-hostname]
> 2016-05-01 00:31:38,426 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11183) HA on VM[User|vm-hostname]
> 2016-05-01 00:31:38,426 INFO  [cloud.ha.HighAvailabilityManagerImpl]
> (HA-Worker-3:work-11183) VM VM[User|vm-hostname] has been changed.  Current
> State = Stopped Previous State = Running last updated = 113 previous
> updated = 111
> =====
>
> Below are the excerpts from the corresponding agent.log, note that
> i-1082-3086-VM is the VM ID for the above vm-hostname as example:
>
> =====
> 2016-04-30 23:24:36,592 DEBUG [kvm.resource.LibvirtComputingResource]
> (Agent-Handler-1:null) Detecting a new state but couldn't find a old state
> so adding it to the changes: i-1082-3086-VM
> =====
>
> After CloudStack management server decides to mark the VM as stopped, the
> agent will try to shutdown the VM upon reconnecting of the agent to the
> management server:
>
> =====
> 2016-05-01 00:32:32,029 DEBUG [cloud.agent.Agent]
> (agentRequest-Handler-3:null) Processing command:
> com.cloud.agent.api.StopCommand
> 2016-05-01 00:32:32,063 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-3:null) Executing:
> /usr/share/cloudstack-common/scripts/vm/network/security_group.py
> destroy_network_rules_for_vm --vmname i-1082-3086-VM --vif vnet11
> 2016-05-01 00:32:32,195 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-3:null) Execution is successful.
> 2016-05-01 00:32:32,196 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-3:null) Try to stop the vm at first
> =====
>
> and
>
> =====
> 2016-05-01 00:33:04,835 DEBUG [kvm.resource.LibvirtComputingResource]
> (agentRequest-Handler-3:null) successfully shut down vm i-1082-3086-VM
> 2016-05-01 00:33:04,836 DEBUG [utils.script.Script]
> (agentRequest-Handler-3:null) Executing: /bin/bash -c ls
> /sys/class/net/breth1-8/brif | grep vnet
> 2016-05-01 00:33:04,847 DEBUG [utils.script.Script]
> (agentRequest-Handler-3:null) Execution is successful.
> =====
>
> - Is there a way for us to prevent the above scenario from happening?
> - Is the only way to prevent the above scenario is to disable HA on the VM?
> - Understand that disabling HA will require applying new service offering
> for each VM and restart the VM for the changes to take effect. Is there a
> way to disable HA globally without changing the service offering for each
> VM?
> - Is it possible to avoid the above scenario from happening without having
> to disable HA and losing the HA features and functionality?
>
> Any advice is greatly appreciated.
>
> Looking forward to your reply, thank you.
>
> Cheers.
>