You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cloudstack.apache.org by GitBox <gi...@apache.org> on 2021/07/14 14:39:52 UTC

[GitHub] [cloudstack] Slair1 edited a comment on pull request #3915: Incorporate VR OOB start checks to properly HA the VR

Slair1 edited a comment on pull request #3915:
URL: https://github.com/apache/cloudstack/pull/3915#issuecomment-879948903


   > @ggoodrich-ipp @nvazquez
   > I do not think we can merge this pr for now.
   > 
   > 1. VR has HA, cloudstack will start it on other hosts if host is determined to be DOWN. hence there are two duplicated VRs running (on old and new host). this pr cannot solve the issue.
   > 2. if cloudstack does not start VR on other host, because the host is Up again, the control IP of VR is not changed. this pr is not needed.
   > 3. if VR is started out-of-band (eg virsh start), CheckRouter checks if control IP is reachable. we do not know if iptables rules or services are configured correctly.
   > 
   > @ggoodrich-ipp did you face this issue in a real environment ? or reproduce the issue (not hack the db) in a test environment ?
   @nvazquez 
   
   We did face this issue in a real environment.  The scenario is when the KVM agent is stopped and CloudStack thinks the host is down, but the host is in fact up and all VMs are still up - it is just the agent that is down.  In this scenario (i think the original PR description is accurate):
   
   VM HA runs for the router and as part of that, its 169.x.x.x control IP is unallocated. Then, it tries to power on the router on another host, and as part of that process it allocates a NEW 169.x.x.x control IP and writes that to the DB. However, since the router isn't actually down (host is up, just agent is down) the VM HA then fails (as the vRouter is currently still running on the problem host).  At this point, the DB is already changed - the control IP is changed.
   
   Next, in this scenario, when the host agent is back online again, it sends a power report to the mgmt servers, and the management servers see the router as ON. However, the GUI will not show a control IP for the vRouter, and the DB will have the NEW control IP it tried to allocated during the failed VM HA event. Thus, leaving us unable to communicate with the vRouter


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscribe@cloudstack.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org