You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@cloudstack.apache.org by "Rohit Yadav (JIRA)" <ji...@apache.org> on 2018/11/12 05:43:00 UTC
[jira] [Commented] (CLOUDSTACK-10400) VPC Router Corruption when working with large number of networks containing instances with public IP addresses

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-10400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16683240#comment-16683240 ] 

Rohit Yadav commented on CLOUDSTACK-10400:
------------------------------------------

[~dubauski@gmail.com] please use github to log issue, use of Jira is deprecated: [https://github.com/apache/cloudstack/issues]

One of us can then help you investigate/triage the issue. /cc [~paulangus] 

> VPC Router Corruption when working with large number of networks containing instances with public IP addresses 
> ---------------------------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-10400
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-10400
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: API
>    Affects Versions: 4.11.1.0
>            Reporter: Barys Dubauski
>            Priority: Critical
>         Attachments: testCloudStack.jar
>
>
> We are using CloudStack 4.11.1 running with KVM hosts.  To simulate our usecase, we created a small program that calls CloudStack API to
> 1) create VPC network with 20 guest networks, each containing one virtual machine with a public IP address allocated.  
> 2) delete the machines and networks one by one. 
>  
> However,  we frequently get a timeout error, sometimes during VM deletion, and sometimes during guest network deletion or even during static NAT disable step.  Once the timeout occurs, it seems that the VPC network / Virtual router is in an *unstable/corrupted* state.  We need to restart the Virtual Router with a clean option (sometimes have to try restart several times as it fails to deploy router VM as well).  After that, we can continue delete the network remaining environment.  Here is the high level steps that we did:
>  # Create VPC Network
>  # For each of the 20 "environments"
>  ## Create Guest Network
>  ## Add a VM to the network
>  ## Acquire Public IP
>  ## Associate the Public IP with VM
>  # For each of the 20 environment
>  ## Disassociate the Public IP
>  ## Delete VM
>  ## Delete Guest network
>  # Delete VPC
>  
> The hanging / timeout problems could be in any time during environment deletion.  The first few deletion could go through successfully, and then fail at some point.  The failure could be in any stage.  i.e. Disassociate public IP, delete VM or delete guest network.  We looked at cloud.log, agent log and management server log but couldn’t get any obvious errors.  It seems that management server sends the request to do the deletion, but the VR does not respond and the system/network becomes stuck in an invalid state. Network often gets stuck in “Shutdown” state as a result.
>  
> Here are some errors in the management server log:
> ============================================
>  2018-11-01 01:15:29,263 DEBUG [o.a.c.f.j.i.AsyncJobManagerImpl] (API-Job-Executor-119:ctx-c14b2ab4 job-29965) (logid:dbe80d4f) Complete async job-29965, jobStatus: FAILED, resultCode: 530, result: org.apache.cloudstack.api.response.ExceptionResponse/null/\{"uuidList":[],"errorcode":530,"errortext":"Failed to delete network"}
> 2018-11-01 01:15:29,245 DEBUG [c.c.a.t.Request] (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) Seq 4-667095694804259240: Received: 
> { Ans: , MgmtId: [7474664765770|tel:7474664765770], via: 4([cehv02.core.jazz.net|http://cehv02.core.jazz.net/]), Ver: v1, Flags: 110, \\{ GroupAnswer }
> }
>  2018-11-01 01:15:29,245 WARN  [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Unable to destroy guest network on router VM*[DomainRouter|r-3388-VM]
>  2018-11-01 01:15:29,247 WARN  [c.c.n.r.VpcVirtualNetworkApplianceManagerImpl] (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Failed to destroy guest network config Ntwk*[1122|Guest|12] on router VM[DomainRouter|r-3388-VM]
>  2018-11-01 01:15:29,247 WARN  [c.c.n.e.VpcVirtualRouterElement] (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Failed to unplug nic in network Ntwk*[1122|Guest|12] for virtual router VM[DomainRouter|r-3388-VM]
>  2018-11-01 01:15:29,247 WARN  [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Unable to complete shutdown of the network elements due to element: VpcVirtualRouter*
>  2018-11-01 01:15:29,255 DEBUG [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) Lock is released for network Ntwk[1122|Guest|12] as a part of network shutdown
>  2018-11-01 01:15:29,256 DEBUG [o.a.c.e.o.NetworkOrchestrator] (API-Job-Executor-119:ctx-c14b2ab4 job-29965 ctx-eb2dda94) (logid:dbe80d4f) *Network is not not in the correct state to be destroyed: Shutdown*
> ============================================
>  
> I'm attaching the simple java program which performs all of the above described steps and which allowed us to consistently run into the bug.
>  
> To use the application:
>  
> java -jar testCloudStack.jar <CloudStack API url: e.g. [http://foo:8080/client/api]> <apiKey> <secretKey> <zoneName>
>  
> Note, that the test application works successfully with CloudStack server 4.9.2 but consistently reproduces the bug with CloudStack server 4.11.1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)