You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cloudstack.apache.org by GitBox <gi...@apache.org> on 2019/12/27 20:58:26 UTC

[GitHub] [cloudstack] andrijapanicsb edited a comment on issue #3721: network: cleanup dhcp/dns entries while remove a nic from vm

andrijapanicsb edited a comment on issue #3721: network: cleanup dhcp/dns entries while remove a nic from vm
URL: https://github.com/apache/cloudstack/pull/3721#issuecomment-569343955
 
 
   Alright, here we go (apologies for long one... ... ...):
   
   Working fine and in general LGTM.
   
   The "garbage" removed fine when detaching VM from network while:
   - VM running, VR running
   - VM stopped, VR running
   
   But some edge cases are there and not sure if this is possible to address at all (same/worse happens when VM is being expunged while VR is stopped - read near the end of this comment)
   
   **Issues when VM stopped, VR stopped, VM detached from the network. VR started - garbage is left inside the VR.**
   Reproduce the issue case:
   - Have a VM attached to an additional (shared) network. 
   - Start VM - DHCP/DNS stuff provisioned fine.
   - Stop VM (nothing deleted from VR - fine!), stop the VR.
   - Detach VM from that additional network while it's VR is stopped - VR can't be contacted so "garbage" can't be removed.
   
   Now do a) or b):
   - a) Start VR - all the "garbage" is there (recreate.systemvm.enabled=false - default behaviour)
   - a) Start the VM, attach the VM again to the same network - VM's record added/updated in the /etc/dhcphosts.txt, while the record from /etc/hosts is deleted! (new not added) and also old/garbage lease deleted from  /var/lib/misc/dnsmasq.leases
   - b) Start VR - all the "garbage" is there (recreate.systemvm.enabled=false)
   - b) Attach the VM again to the same network, Start the VM - duplicate VM rows (different IP/MAC) in all 3 files (new records provisioned, old one not removed). DNS resolution is broken due to duplicate records in /etc/hosts.
   - b) detach the Network from the VM again, just the new records are cleaned up.
   
   I'm not sure how/if this can be fixed.
   The proper workaround is to restart the network with the cleanup.
   
   **Additionally, if VM is expunged while the VR is stopped, later starting the VR (recreate.systemvm.enabled=false) will result in garbage being left and never removed.**
   
   I believe both issues (detaching NIC or expunging a VM while VR is stopped) are somewhat edge cases that - but these can happen in i.e. following scenario:
   - Single VM in the network:
   - Delete the VM, withOUT expunging it (the default behaviour for a regular user).
   - Having the default values of 1day for expunge.delay and expunge.interval, it will happen that the VR will be stopped after (network.gc.wait/network.gc.interval) - 600-1200 seconds, so when later the VM is being expunged in 86400+ seconds, the VR is already down. This leaves garbage in the VR if it's started again after i.e. 2 days from the VM deletion time. This can happen during some test scenarios / other very small environments.
   
   @weizhouapache @rhtyd @PaulAngus @wido @nvazquez @DaanHoogland @onitake @GabrielBrascher @nathanejohnson @kiwiflyer (pinging you on the below **only** - no need to read above unless you are interested in that specifically):
   I'm wondering if it would make sense to make the "recreate.systemvm.enabled=**TRUE**" a default value in 4.14 and onwards, since:
   - doesn't leave all the garbage VR's old ROOT disk had since new ROOT disk created
     - no orphaned DHCP/DNS configuration data (issues explained in this comment)
     - no potential log garbage (not that much of the issue afaik)
     - /var/cache/cloud/processed files can be huge on old VRs
   - doesn't bring any usable (day to day) improvements for cloud operator/normal user
     - CPVM and SSVM are not restarted daily and cloud-operator can wait for 2 minutes more to configure CPVM/SSVM from scratch vs. booting an existing/configured VM
     - VR's are also not restarted daily and since VR can be stopped either manually (you planned some actions, so you can wait for 2 more minutes?) or automatically by network.gc when there are no Running VMs in the network - again one can wait for 2 more minutes for a clean VR for the very first existing VM he starts in that existing network.
   
   Having "recreate.systemvm.enabled=**FALSE**" by default (current behaviour):
   - keeps all the garbage as explained above
   
   Opinions?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services