You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@cloudstack.apache.org by "Prachi Damle (JIRA)" <ji...@apache.org> on 2013/12/12 01:44:07 UTC
[jira] [Commented] (CLOUDSTACK-4620) Vm failed to start on the host on which it was running due to not having enough reservedMem when the host was powered on after being shutdown.

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-4620?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13845926#comment-13845926 ] 

Prachi Damle commented on CLOUDSTACK-4620:
------------------------------------------

Root cause analysis:
---------------------------
Scenario Observed:
------------------------
The VM 'tempsnap' fails to find any reserved capacity on the host:

2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-a441-533
186b3cbed ]) STATS: Failed to alloc resource from host: 1 reservedCpu: 1500, requested cpu: 500, reservedMem: 0, requested
 mem: 536870912
2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-a441-533
186b3cbed ]) Host does not have enough reserved RAM available, cannot allocate to this host.

Even if there is no more reserved capacity, we will try to see if this host has any free capacity to start the VM. 
While checking this we find that the host has crossed the CPU threshold limit, so no VMs can be allocated any more. Hence we error out to start this VM. Logs:

2013-09-05 12:52:44,943 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-a441-533186b3c
bed ]) Cannot allocate cluster list [1] for vm creation since their allocated percentage crosses the disable capacity thre
shold defined at each cluster/ at global value for capacity Type : 1, skipping these clusters

However after some time the same VM starts back fine - this happens because the CapacityChecker thread runs in between and corrects the capacity numbers of this host.

Problem: 
-------------
Why does the host cross CPU threshold limit, if there are no new VMs getting deployed?


When the host is shutdown, the SSVM and CPVM keep on trying to start back over and over. 

In case of SSVM, every time a new SSVM is created, allocated using the host's available free capacity - when it fails to start the SSVM is destryoed and the allocated capacity is freed. SSVM does not cause any capacity bug.

But in case of CPVM, we use the same CPVM entry - so it tries to start the CPVM on the last host using the reserved capacity. However when it fails to start, the capacity is not added back to the reserved quota. Thus the CPVM keeps on subtracting capacity from reserved quota on each try, but never adds it back on failure to start.

Now when CS detects that the host is down, all user VMs enter 'Stopped' state and all Vm's capacity is put into reserved quota. CPVM retries keep on reducing this quota - RAM requirement of CPVM is higher than that of a user VM - so it reduces the RAM faster to zero than the CPU.

When host comes back, all user VMs try to start back - first they try to use reserved capacity - but since the reserved RAM is zero, they use up the free capacity - Thus the user VMs keep on increasing the 'used' CPU value of the host without reducing the 'reserved' CPU (which was reserved when they got Stopped) 

So at some point, the (used + reserved) CPU crosses the threshold limit, causing failures in starting any more user VMs

Why this is not a big issue?
-------------
The above situation gets corrected when the CapacityChecker thread runs and corrects the host's reserved capacity. The thread runs every 5 minutes.

Thus on next try the user VM starts fine because this thread has corrected the used + reserved > threshold situation.










 


> Vm failed to start on the host on which it was running due to not having enough reservedMem when the host was powered on after being shutdown.
> ----------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-4620
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-4620
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Management Server
>    Affects Versions: 4.2.1
>         Environment: Build from 4.2-forward
>            Reporter: Sangeetha Hariharan
>            Assignee: Prachi Damle
>             Fix For: 4.3.0
>
>         Attachments: hostdown.rar
>
>
> Vm failed to start on the host on which it was running due to no having enough reservedMem when the host was powered on after being shutdown
> Steps to reproduce the problem:
> Advanced zone with 1 cluster having 1 host (Xenserver).
> Had SSVM,CCPVM, 2 routers and few user Vms running in the host.
> Power down the host. 
> After few hours, powered on the host.
> All the Vms running on this host were marked "Stopped".
> Tried to start all the user Vms running in this host.
> 1 of the user Vms fails to start because of not having enough "Reserved RAM"
> 2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) Reserved RAM: 0 , Requested RAM: 536870912
> When i tried to start the same Vm  again after few minutes , it started successfully on the same host.
> Seems like there is some issue with releasing the capacity when all the Vms get marked as "Stopped" by VM sync process.
> Vm that failed to start because of capacity and then eventually succeeded when starting after few minutes is "temfromsnap" .
> Management server logs when starting the VM fails to start in the last_host_id.
> 2013-09-05 12:52:44,934 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-
> a441-533186b3cbed ]) DeploymentPlanner allocation algorithm: com.cloud.deploy.FirstFitPlanner_EnhancerByCloudStack_b297c61
> b@7e43d432
> 2013-09-05 12:52:44,934 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-
> a441-533186b3cbed ]) Trying to allocate a host and storage pools from dc:1, pod:1,cluster:1, requested cpu: 500, requested
>  ram: 536870912
> 2013-09-05 12:52:44,934 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-
> a441-533186b3cbed ]) Is ROOT volume READY (pool already allocated)?: Yes
> 2013-09-05 12:52:44,934 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-
> a441-533186b3cbed ]) This VM has last host_id specified, trying to choose the same host: 1
> 2013-09-05 12:52:44,938 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) Checking if host: 1 has enough capacity for requested CPU: 500 and requested RAM: 536870912 , cpuOverprovisio
> ningFactor: 1.0
> 2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) Hosts's actual total CPU: 9040 and CPU after applying overprovisioning: 9040
> 2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) We need to allocate to the last host again, so checking if there is enough reserved capacity
> 2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) Reserved CPU: 1500 , Requested CPU: 500
> 2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) Reserved RAM: 0 , Requested RAM: 536870912
> 2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) STATS: Failed to alloc resource from host: 1 reservedCpu: 1500, requested cpu: 500, reservedMem: 0, requested
>  mem: 536870912
> 2013-09-05 12:52:44,940 DEBUG [cloud.capacity.CapacityManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-a441-533
> 186b3cbed ]) Host does not have enough reserved RAM available, cannot allocate to this host.
> 2013-09-05 12:52:44,940 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-
> a441-533186b3cbed ]) The last host of this VM does not have enough capacity
> 2013-09-05 12:52:44,940 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-
> a441-533186b3cbed ]) Cannot choose the last host to deploy this VM
> 2013-09-05 12:52:44,940 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-a441-533186b3c
> bed ]) Searching resources only under specified Cluster: 1
> 2013-09-05 12:52:44,943 DEBUG [cloud.deploy.FirstFitPlanner] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-a441-533186b3c
> bed ]) Cannot allocate cluster list [1] for vm creation since their allocated percentage crosses the disable capacity thre
> shold defined at each cluster/ at global value for capacity Type : 1, skipping these clusters
> 2013-09-05 12:52:44,948 DEBUG [cloud.deploy.DeploymentPlanningManagerImpl] (Job-Executor-26:job-84 = [ ac245729-bfda-4e77-
> a441-533186b3cbed ]) Deploy avoids pods: [], clusters: [1], hosts: []



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)