You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@cloudstack.apache.org by "ASF GitHub Bot (JIRA)" <ji...@apache.org> on 2015/10/17 17:58:05 UTC

[jira] [Commented] (CLOUDSTACK-8952) The redundant routers are facing a race condition due to several KeepaliveD/ConntrackD restarts

    [ https://issues.apache.org/jira/browse/CLOUDSTACK-8952?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14961950#comment-14961950 ] 

ASF GitHub Bot commented on CLOUDSTACK-8952:
--------------------------------------------

GitHub user wilderrodrigues opened a pull request:

    https://github.com/apache/cloudstack/pull/940

    CLOUDSTACK-8952 - The redundant routers are facing a race condition due to several KeepaliveD/ConntrackD restarts

    This PR fixes the following issues:
    
    * KeepAliveD being restarted for each action performed on the routers
    * ConntrackD configuration being copied for each action performed on the routers, causing several restarts
    * ACS Management Server relying in the JSON file to report which router is Master/Backup
    * Public Interface on both routers are in UP state due to several places checking if the interface is UP/DOWN and trying to do KeepAliveD
    * Removing all the sleeps from the test_vpc_redundant.py - those are no longer needed
    * When KeepAliveD calls master.py during the election, update the cmdline.json to set the router in Backup mode: the election will take care of changing it afterwards.
    * Add LB stats_rules to iptables INPUT chain
    * The RVR public interface is set to eth2 instead of eth1 - as in the rVPC. Make sure the check works in both cases
    
    Those fixes make all the routers very stable, with ACL, FW, PF and LB working just fine!

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/ekholabs/cloudstack fix/rvr__keepalived_restart

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/cloudstack/pull/940.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #940
    
----
commit 08b983fe022d309c5f49f776cce7c2b4a3f01cfd
Author: Wilder Rodrigues <wr...@schubergphilis.com>
Date:   2015-10-14T09:21:53Z

    CLOUDSTACK-8952 - Remove the '--vrrp' search criteria form the CsProcess constructor call
    
       - There is no such process, which makes the CsProcess.find return false and restart keepalived all the time.

commit 5a216056b5a325b8abbe6f7c20f98caf202a27bc
Author: Wilder Rodrigues <wr...@schubergphilis.com>
Date:   2015-10-14T12:13:24Z

    CLOUDSTACK-8952 - Do not replace the conntrackd config file unless it's needed
    
       - With the new logic, the file will be replaced when the router starts, becasue the default
         conntrackd config file will be different.

commit b4920aa028e75c64160988113ac268e5ea5ae69e
Author: Wilder Rodrigues <wr...@schubergphilis.com>
Date:   2015-10-14T12:24:11Z

    CLOUDSTACK-8952 - Do not restart conntrackd unless it's needed
    
       - With the keepalived fixed they should not be needed anymore. So first reducing them drasticaly
       - I am now making a backup of the template file, write to the template file and compare it with the existing configuration
       - The template file is recovered afer the process
       - I also check if the process is running
       - I fixed a bug in the compare method
       - I am now updating the configuration variable once the file content is flushed to disk

commit d762dc8579a3ee40c762559d62affdf44194e853
Author: Wilder Rodrigues <wr...@schubergphilis.com>
Date:   2015-10-15T10:44:28Z

    CLOUDSTACK-8952 - The public interface was comming UP in the Backup router
    
       - There were too many places trying to put the pub interface UP. I centralised it now.

commit 1886c4a1b33c2cd75bd5e49626943b5526894bc6
Author: Wilder Rodrigues <wr...@schubergphilis.com>
Date:   2015-10-15T10:44:54Z

    CLOUDSTACK-8952 - Make sure we restart dnsmasq if the configuration file changes
    
       - It was working before because the Routers were restarting about 10 times for each operation
         e.g. adding a VM to a network ot acquiring a new IP.
       - Adding stat_rules of internal LB to iptables
         We needed one extra rule in the INPUT chain

commit 2b286ecd730763a472fff2071a8fd7166692e11f
Author: Wilder Rodrigues <wr...@schubergphilis.com>
Date:   2015-10-15T14:43:29Z

    CLOUDSTACK-8952 - Make sure the calls to CsFile use the new logic of commit/is_changed methods
    
       - We now have to check if the file changed before commiting. Doesn't make sense to write on disk if there was nono change.

commit c7671f3cdd4cb1b52ff44b44288cb843098bccde
Author: Wilder Rodrigues <wr...@schubergphilis.com>
Date:   2015-10-15T16:31:03Z

    CLOUDSTACK-8952 - Restart dnsmasq everytime the configure.py runs

commit 41f4d8b58a337dc97526f2acb551c854b3432177
Author: Wilder Rodrigues <wr...@schubergphilis.com>
Date:   2015-10-16T09:55:31Z

    CLOUDSTACK-8952 - Make the check for master more reliable
    
       - Do not use the API call because it will read what is in the database, that might not have been updated yet
         * Check the status in the router directly instead
       - Remove all the sleeps

commit 5b3c99031ffa1e2f73fc839d054cb88f6abd802b
Author: Wilder Rodrigues <wr...@schubergphilis.com>
Date:   2015-10-17T06:09:52Z

    CLOUDSTACK-8952 - Do not rely in the router state on the json file to report back to ACS
    
       - If we stop/start a router, the state in the file will still say MASTER, when it is actually not
       - Checking the state based on the interface (eth1) state
       - Once master.py is called by keepalived, save the state in the json file to BACKUP just to make sure it's also written there

commit 2a747ca73538325fb24b3eefb95197bc1f8c6222
Author: Wilder Rodrigues <wr...@schubergphilis.com>
Date:   2015-10-17T10:09:26Z

    CLOUDSTACK-8952 - Reduce retried from 20 to 5
    
       - We do not need to retry that much

commit 38d03576d61d1ddac8f29b962d9d30bc45d7a39b
Author: Wilder Rodrigues <wr...@schubergphilis.com>
Date:   2015-10-17T12:47:05Z

    CLOUDSTACK-8952 - Make the tests rely on the interface state other than the json file

commit fb33cb28aba7bfc829651e8881a9a6afa6a70a76
Author: Wilder Rodrigues <wr...@schubergphilis.com>
Date:   2015-10-17T12:48:08Z

    CLOUDSTACK-8952 - Make the checkrouter.sh compatible with RVR as well

----


> The redundant routers are facing a race condition due to several KeepaliveD/ConntrackD restarts
> -----------------------------------------------------------------------------------------------
>
>                 Key: CLOUDSTACK-8952
>                 URL: https://issues.apache.org/jira/browse/CLOUDSTACK-8952
>             Project: CloudStack
>          Issue Type: Bug
>      Security Level: Public(Anyone can view this level - this is the default.) 
>          Components: Virtual Router
>    Affects Versions: 4.6.0
>            Reporter: Wilder Rodrigues
>            Assignee: Wilder Rodrigues
>            Priority: Critical
>             Fix For: 4.6.0
>
>
> In the CsRedundant.py we have a line doing:
> proc = CsProcess(['/usr/sbin/keepalived', '--vrrp'])
> However, the CsProcess cannot find a process with the string search "--vrrp", which makes it always return false and restart keepalived.
> Due to the restart, the routers start a race condition to become master, which makes network features unavailable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)