You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@cloudstack.apache.org by Rene Moser <ma...@renemoser.net> on 2018/02/20 12:46:02 UTC

[4.11] Management to VR connection issues

Hi

We upgraded from 4.9 to 4.11. VMware 6.5.0. (Testing environment).

VR upgrade went through. But we noticed that the communication between
the management server and the VR are not working properly.

We do not yet fully understand the issue, one thing we noted is that the
networks configs seems not be bound to the same interfaces after every
reboot. As a result, after a reboot you may can connect to the VR by
SSH, after another reboot you can't anymore.

The Network name eth0 switched from the NIC id 3 to 4 after reboot.

The VR is kept in "starting" state, of course as a consequence we get
many issues related to this, no VM deployments (kept in starting state),
VM expunging failure (cleanup fails), a.s.o.

Have anyone experienced similar issues?

Regards
René

Re: [4.11] Management to VR connection issues

Posted by Rene Moser <ma...@renemoser.net>.

Hi Paul

On 02/26/2018 11:35 AM, Paul Angus wrote:
> Rene,
> Have you checked the OS getting applied on vCenter?

It's "Other 3.x or later Linux (64-bit)" also see
https://photos.google.com/share/AF1QipPNnFnP8xMIHgYCQ0rZtDsyeVGoJIyHjPqWP8BP-hiNRnd0CxHuc8xn5GetIrpocQ/photo/AF1QipPIeKUnVU4hQ8_AuNT8galxvDyPsMqjkMkIqKPV?key=WmlJRUhMNnh3cXZheFp0WDZuMFZtTmpOQlo2Y2NR

RE: [4.11] Management to VR connection issues

Posted by Paul Angus <pa...@shapeblue.com>.

Rene,
Have you checked the OS getting applied on vCenter?
A lot of the issues went away once I changed the OS when testing over the weekend.

Kind regards,

Paul Angus

paul.angus@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 


-----Original Message-----
From: Rene Moser [mailto:mail@renemoser.net] 
Sent: 26 February 2018 10:22
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Re: [4.11] Management to VR connection issues

Hi again

We found the main problem.

== cloud-postinit hang

When having many iptables rules resulting in cloud-postinit to hang for 10min unless it was killed by systemd. As a result the ssh daemon was not started for 10 min because it is configured to be started after cloud-postinit.

It seems the issue was already fixed by
https://github.com/apache/cloudstack/commit/ce67726c6d3db6e7db537e76da6217c5d5f4b10e

== VR still needs manual reboot

However, we still notice adapter changes after a reboot: see before after screenshots of "ip addr" in https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2. We still need to manually reboot the VR to make the network actually working.

== VR has too many adapters?

Next thing we noticed there are many network adapters (NICs) for this non-vpc router (see screenshot of the vcenter in https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2). Adapter 4 and 5 seem unnecessary. Any comments on that?

== VR with 256 MB RAM dows not work

Next issue we found is, that the VR must have more than 256MB RAM.
Otherwise systemd will complain the daemon can not be reloaded, because the ram disk of /run has too less space.

Feb 23 16:24:36 r-413-VM postinit.sh[1089]: Failed to reload daemon:
Refusing to reload, not enough space available on /run/systemd.
Currently, 8.6M are free, but a safety buffer of 16.0M is enforced.
root@r-413-VM:~# df -h /run/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            16M  7.2M  8.7M  46% /run

Increaing to 512MB RAM helped:

root@r-413-VM:~# df -h /run/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            41M  7.8M   34M  19% /run

Unsure if this can be tuned on systemd level, didn't find a way yet.

== VR API Command timeouts

When executing command related to VR, e.g. restart network, start/stop router the command won't reach the vcenter api, and times out. We are unsure yet, why.

== VR minor fixes

Next we fixed 2 minor things along.

* rsyslogd config syntax issue
* IMHO we should start apache2 also after cloud-postinit

Also see https://github.com/apache/cloudstack/pull/2468

Regards
René

Re: [4.11] Management to VR connection issues

Posted by Rene Moser <ma...@renemoser.net>.

On 02/26/2018 12:41 PM, Rohit Yadav wrote:

> - If waiting for ssh and apache2 as part of post-init solves the issue, this would require a new systemvmtemplate as the systemd scripts cannot be changed or make effect during first boot.

The waiting for ssh was not the issue, it was a result.

The hang of cloud-postinit caused by p.wait() when having a ton of
iptable rules was the issue. But this is addressed already. should be fine.

a systemctl list-jobs shows "no pending jobs" anymore, so the boot has
completed.

After that the VR should be accessable by SSH (3922) by managemement
right, but it is not.

Did you see the changes after a reboot (please compare the screenshots
of the ip addr output I sent). After that reboot/network change, SSH
works...

> - I think the additional nics always used to show up for vmware, there is a global setting to configure this (extra nics for vmware, probably because older versions did not support dynamic nic addition on vmware vrs).

On 4.5.2, we only see 4 NICs. in 4.11 we see 5 of them. We were just
wondering if this could result in an issue. What global setting would
that be?

> - For VR timeouts, see logs and check if from management server host you're able to SSH into the VR using the private IP and port 3922. See the troubleshooting wiki: https://cwiki.apache.org/confluence/display/CLOUDSTACK/SSVM%2C+templates%2C+Secondary+storage+troubleshooting

Yes, after a manual reboot of the VR, we can SSH-in as I wrote. Without
a reboot of the VR, we get a "no route to host". So it seems not even an
arp ping is working.

> - Can you share/check which processes are consuming the RAM, 256MB ram is usually enough for non-redundant VRs. (share output of top or check using htop?). Make sure to use a latest Linux version (any Debian variant such as Debian 8, 9 or Ubuntu 16.04+ may also work). The issue is vCenter/ESXi 6.5 for some reason, gives lower RAM compared to 6.0 and 5.5 and has poor support for legacy os. I had faced/found this issue while testing redundant VRs which take more RAM usually than normal VRs.

Using the shapeblue VR template (your template ;))

So the man docs says
https://manpages.debian.org/stretch/initscripts/tmpfs.5.en.html

unfortunately only a fstab entry worked for me, setting the
/etc/default/tmpfs didn't.

https://github.com/apache/cloudstack/pull/2468/commits/bd882a8f80763595a89a3b74330500e1965bfda3

Re: [4.11] Management to VR connection issues

Posted by Rene Moser <ma...@renemoser.net>.

On 02/26/2018 12:41 PM, Rohit Yadav wrote:

> - If waiting for ssh and apache2 as part of post-init solves the issue, this would require a new systemvmtemplate as the systemd scripts cannot be changed or make effect during first boot.

The waiting for ssh was not the issue, it was a result.

The hang of cloud-postinit caused by p.wait() when having a ton of
iptable rules was the issue. But this is addressed already. should be fine.

a systemctl list-jobs shows "no pending jobs" anymore, so the boot has
completed.

After that the VR should be accessable by SSH (3922) by managemement
right, but it is not.

Did you see the changes after a reboot (please compare the screenshots
of the ip addr output I sent). After that reboot/network change, SSH
works...

On 4.5.2, we only see 4 NICs. in 4.11 we see 5 of them. We were just
wondering if this could result in an issue. What global setting would
that be?

Yes, after a manual reboot of the VR, we can SSH-in as I wrote. Without
a reboot of the VR, we get a "no route to host". So it seems not even an
arp ping is working.

Using the shapeblue VR template (your template ;))

So the man docs says
https://manpages.debian.org/stretch/initscripts/tmpfs.5.en.html

unfortunately only a fstab entry worked for me, setting the
/etc/default/tmpfs didn't.

https://github.com/apache/cloudstack/pull/2468/commits/bd882a8f80763595a89a3b74330500e1965bfda3

Re: [4.11] Management to VR connection issues

Posted by Rohit Yadav <ro...@shapeblue.com>.

Hi Rene,


- I think on the general issue of slow iptables rules application, we need to fix that. Does it help to increase aggregation timeouts?


- If waiting for ssh and apache2 as part of post-init solves the issue, this would require a new systemvmtemplate as the systemd scripts cannot be changed or make effect during first boot.


- I think the additional nics always used to show up for vmware, there is a global setting to configure this (extra nics for vmware, probably because older versions did not support dynamic nic addition on vmware vrs).


- For VR timeouts, see logs and check if from management server host you're able to SSH into the VR using the private IP and port 3922. See the troubleshooting wiki: https://cwiki.apache.org/confluence/display/CLOUDSTACK/SSVM%2C+templates%2C+Secondary+storage+troubleshooting


- Can you share/check which processes are consuming the RAM, 256MB ram is usually enough for non-redundant VRs. (share output of top or check using htop?). Make sure to use a latest Linux version (any Debian variant such as Debian 8, 9 or Ubuntu 16.04+ may also work). The issue is vCenter/ESXi 6.5 for some reason, gives lower RAM compared to 6.0 and 5.5 and has poor support for legacy os. I had faced/found this issue while testing redundant VRs which take more RAM usually than normal VRs.


- Rohit

<https://cloudstack.apache.org>



________________________________
From: Rene Moser <ma...@renemoser.net>
Sent: Monday, February 26, 2018 11:22:27 AM
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Re: [4.11] Management to VR connection issues

Hi again

We found the main problem.

== cloud-postinit hang

When having many iptables rules resulting in cloud-postinit to hang for
10min unless it was killed by systemd. As a result the ssh daemon was
not started for 10 min because it is configured to be started after
cloud-postinit.

It seems the issue was already fixed by
https://github.com/apache/cloudstack/commit/ce67726c6d3db6e7db537e76da6217c5d5f4b10e

== VR still needs manual reboot

However, we still notice adapter changes after a reboot: see before
after screenshots of "ip addr" in
https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2. We still need to manually
reboot the VR to make the network actually working.

== VR has too many adapters?

Next thing we noticed there are many network adapters (NICs) for this
non-vpc router (see screenshot of the vcenter in
https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2). Adapter 4 and 5 seem
unnecessary. Any comments on that?

== VR with 256 MB RAM dows not work

Next issue we found is, that the VR must have more than 256MB RAM.
Otherwise systemd will complain the daemon can not be reloaded, because
the ram disk of /run has too less space.

Feb 23 16:24:36 r-413-VM postinit.sh[1089]: Failed to reload daemon:
Refusing to reload, not enough space available on /run/systemd.
Currently, 8.6M are free, but a safety buffer of 16.0M is enforced.
root@r-413-VM:~# df -h /run/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            16M  7.2M  8.7M  46% /run

Increaing to 512MB RAM helped:

root@r-413-VM:~# df -h /run/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            41M  7.8M   34M  19% /run

Unsure if this can be tuned on systemd level, didn't find a way yet.

== VR API Command timeouts

When executing command related to VR, e.g. restart network, start/stop
router the command won't reach the vcenter api, and times out. We are
unsure yet, why.

== VR minor fixes

Next we fixed 2 minor things along.

* rsyslogd config syntax issue
* IMHO we should start apache2 also after cloud-postinit

Also see https://github.com/apache/cloudstack/pull/2468

Regards
René

rohit.yadav@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue

Re: [4.11] Management to VR connection issues

Posted by Rohit Yadav <ro...@shapeblue.com>.

Hi Rene,


- I think on the general issue of slow iptables rules application, we need to fix that. Does it help to increase aggregation timeouts?


- If waiting for ssh and apache2 as part of post-init solves the issue, this would require a new systemvmtemplate as the systemd scripts cannot be changed or make effect during first boot.


- I think the additional nics always used to show up for vmware, there is a global setting to configure this (extra nics for vmware, probably because older versions did not support dynamic nic addition on vmware vrs).


- For VR timeouts, see logs and check if from management server host you're able to SSH into the VR using the private IP and port 3922. See the troubleshooting wiki: https://cwiki.apache.org/confluence/display/CLOUDSTACK/SSVM%2C+templates%2C+Secondary+storage+troubleshooting


- Can you share/check which processes are consuming the RAM, 256MB ram is usually enough for non-redundant VRs. (share output of top or check using htop?). Make sure to use a latest Linux version (any Debian variant such as Debian 8, 9 or Ubuntu 16.04+ may also work). The issue is vCenter/ESXi 6.5 for some reason, gives lower RAM compared to 6.0 and 5.5 and has poor support for legacy os. I had faced/found this issue while testing redundant VRs which take more RAM usually than normal VRs.


- Rohit

<https://cloudstack.apache.org>



________________________________
From: Rene Moser <ma...@renemoser.net>
Sent: Monday, February 26, 2018 11:22:27 AM
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Re: [4.11] Management to VR connection issues

Hi again

We found the main problem.

== cloud-postinit hang

When having many iptables rules resulting in cloud-postinit to hang for
10min unless it was killed by systemd. As a result the ssh daemon was
not started for 10 min because it is configured to be started after
cloud-postinit.

It seems the issue was already fixed by
https://github.com/apache/cloudstack/commit/ce67726c6d3db6e7db537e76da6217c5d5f4b10e

== VR still needs manual reboot

However, we still notice adapter changes after a reboot: see before
after screenshots of "ip addr" in
https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2. We still need to manually
reboot the VR to make the network actually working.

== VR has too many adapters?

Next thing we noticed there are many network adapters (NICs) for this
non-vpc router (see screenshot of the vcenter in
https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2). Adapter 4 and 5 seem
unnecessary. Any comments on that?

== VR with 256 MB RAM dows not work

Next issue we found is, that the VR must have more than 256MB RAM.
Otherwise systemd will complain the daemon can not be reloaded, because
the ram disk of /run has too less space.

Feb 23 16:24:36 r-413-VM postinit.sh[1089]: Failed to reload daemon:
Refusing to reload, not enough space available on /run/systemd.
Currently, 8.6M are free, but a safety buffer of 16.0M is enforced.
root@r-413-VM:~# df -h /run/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            16M  7.2M  8.7M  46% /run

Increaing to 512MB RAM helped:

root@r-413-VM:~# df -h /run/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            41M  7.8M   34M  19% /run

Unsure if this can be tuned on systemd level, didn't find a way yet.

== VR API Command timeouts

When executing command related to VR, e.g. restart network, start/stop
router the command won't reach the vcenter api, and times out. We are
unsure yet, why.

== VR minor fixes

Next we fixed 2 minor things along.

* rsyslogd config syntax issue
* IMHO we should start apache2 also after cloud-postinit

Also see https://github.com/apache/cloudstack/pull/2468

Regards
René

rohit.yadav@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue

RE: [4.11] Management to VR connection issues

Posted by Paul Angus <pa...@shapeblue.com>.

Rene,
Have you checked the OS getting applied on vCenter?
A lot of the issues went away once I changed the OS when testing over the weekend.

Kind regards,

Paul Angus

paul.angus@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue
  
 


-----Original Message-----
From: Rene Moser [mailto:mail@renemoser.net] 
Sent: 26 February 2018 10:22
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Re: [4.11] Management to VR connection issues

Hi again

We found the main problem.

== cloud-postinit hang

When having many iptables rules resulting in cloud-postinit to hang for 10min unless it was killed by systemd. As a result the ssh daemon was not started for 10 min because it is configured to be started after cloud-postinit.

It seems the issue was already fixed by
https://github.com/apache/cloudstack/commit/ce67726c6d3db6e7db537e76da6217c5d5f4b10e

== VR still needs manual reboot

However, we still notice adapter changes after a reboot: see before after screenshots of "ip addr" in https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2. We still need to manually reboot the VR to make the network actually working.

== VR has too many adapters?

Next thing we noticed there are many network adapters (NICs) for this non-vpc router (see screenshot of the vcenter in https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2). Adapter 4 and 5 seem unnecessary. Any comments on that?

== VR with 256 MB RAM dows not work

Next issue we found is, that the VR must have more than 256MB RAM.
Otherwise systemd will complain the daemon can not be reloaded, because the ram disk of /run has too less space.

Feb 23 16:24:36 r-413-VM postinit.sh[1089]: Failed to reload daemon:
Refusing to reload, not enough space available on /run/systemd.
Currently, 8.6M are free, but a safety buffer of 16.0M is enforced.
root@r-413-VM:~# df -h /run/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            16M  7.2M  8.7M  46% /run

Increaing to 512MB RAM helped:

root@r-413-VM:~# df -h /run/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            41M  7.8M   34M  19% /run

Unsure if this can be tuned on systemd level, didn't find a way yet.

== VR API Command timeouts

When executing command related to VR, e.g. restart network, start/stop router the command won't reach the vcenter api, and times out. We are unsure yet, why.

== VR minor fixes

Next we fixed 2 minor things along.

* rsyslogd config syntax issue
* IMHO we should start apache2 also after cloud-postinit

Also see https://github.com/apache/cloudstack/pull/2468

Regards
René

Re: [4.11] Management to VR connection issues

Posted by Rene Moser <ma...@renemoser.net>.

Hi again

We found the main problem.

== cloud-postinit hang

When having many iptables rules resulting in cloud-postinit to hang for
10min unless it was killed by systemd. As a result the ssh daemon was
not started for 10 min because it is configured to be started after
cloud-postinit.

It seems the issue was already fixed by
https://github.com/apache/cloudstack/commit/ce67726c6d3db6e7db537e76da6217c5d5f4b10e

== VR still needs manual reboot

However, we still notice adapter changes after a reboot: see before
after screenshots of "ip addr" in
https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2. We still need to manually
reboot the VR to make the network actually working.

== VR has too many adapters?

Next thing we noticed there are many network adapters (NICs) for this
non-vpc router (see screenshot of the vcenter in
https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2). Adapter 4 and 5 seem
unnecessary. Any comments on that?

== VR with 256 MB RAM dows not work

Next issue we found is, that the VR must have more than 256MB RAM.
Otherwise systemd will complain the daemon can not be reloaded, because
the ram disk of /run has too less space.

Feb 23 16:24:36 r-413-VM postinit.sh[1089]: Failed to reload daemon:
Refusing to reload, not enough space available on /run/systemd.
Currently, 8.6M are free, but a safety buffer of 16.0M is enforced.
root@r-413-VM:~# df -h /run/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            16M  7.2M  8.7M  46% /run

Increaing to 512MB RAM helped:

root@r-413-VM:~# df -h /run/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            41M  7.8M   34M  19% /run

Unsure if this can be tuned on systemd level, didn't find a way yet.

== VR API Command timeouts

When executing command related to VR, e.g. restart network, start/stop
router the command won't reach the vcenter api, and times out. We are
unsure yet, why.

== VR minor fixes

Next we fixed 2 minor things along.

* rsyslogd config syntax issue
* IMHO we should start apache2 also after cloud-postinit

Also see https://github.com/apache/cloudstack/pull/2468

Regards
René

Re: [4.11] Management to VR connection issues

Posted by Rene Moser <ma...@renemoser.net>.

Hi again

We found the main problem.

== cloud-postinit hang

When having many iptables rules resulting in cloud-postinit to hang for
10min unless it was killed by systemd. As a result the ssh daemon was
not started for 10 min because it is configured to be started after
cloud-postinit.

It seems the issue was already fixed by
https://github.com/apache/cloudstack/commit/ce67726c6d3db6e7db537e76da6217c5d5f4b10e

== VR still needs manual reboot

However, we still notice adapter changes after a reboot: see before
after screenshots of "ip addr" in
https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2. We still need to manually
reboot the VR to make the network actually working.

== VR has too many adapters?

Next thing we noticed there are many network adapters (NICs) for this
non-vpc router (see screenshot of the vcenter in
https://photos.app.goo.gl/9XsjOJjLqQ9SRjYV2). Adapter 4 and 5 seem
unnecessary. Any comments on that?

== VR with 256 MB RAM dows not work

Next issue we found is, that the VR must have more than 256MB RAM.
Otherwise systemd will complain the daemon can not be reloaded, because
the ram disk of /run has too less space.

Feb 23 16:24:36 r-413-VM postinit.sh[1089]: Failed to reload daemon:
Refusing to reload, not enough space available on /run/systemd.
Currently, 8.6M are free, but a safety buffer of 16.0M is enforced.
root@r-413-VM:~# df -h /run/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            16M  7.2M  8.7M  46% /run

Increaing to 512MB RAM helped:

root@r-413-VM:~# df -h /run/
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            41M  7.8M   34M  19% /run

Unsure if this can be tuned on systemd level, didn't find a way yet.

== VR API Command timeouts

When executing command related to VR, e.g. restart network, start/stop
router the command won't reach the vcenter api, and times out. We are
unsure yet, why.

== VR minor fixes

Next we fixed 2 minor things along.

* rsyslogd config syntax issue
* IMHO we should start apache2 also after cloud-postinit

Also see https://github.com/apache/cloudstack/pull/2468

Regards
René

Re: [4.11] Management to VR connection issues

Posted by Rohit Yadav <ro...@shapeblue.com>.

Hi Rene,

Paul is correct, for default VMware systemvm I had fixed it here:

<https://github.com/apache/cloudstack/blob/master/engine/schema/src/main/resources/META-INF/db/schema-41000to41100.sql#L403>

https://github.com/apache/cloudstack/blob/4.11/engine/schema/resources/META-INF/db/schema-41000to41100.sql#L403

But the above would have worked only for new installations, for upgraded ones we'll need to fix the release notes to ask users/admins to select 'Other Linux 64-bit'. Can you try that and share if that works for you?

I also checked, we're still using the 6.0 sdk jars. That needs to be fixed as well.

- Rohit

<https://cloudstack.apache.org>

________________________________

rohit.yadav@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue

From: Paul Angus
Sent: Sunday, February 25, 2018 8:57:55 AM
To: dev@cloudstack.apache.org; users@cloudstack.apache.org
Cc: Rohit Yadav
Subject: RE: [4.11] Management to VR connection issues

Hey Rene.

Can you check that OS type that has been applied to your system VM template.
I found that mine were coming up as 32bit Debian 5, making them go REALLY slow and if there are rules applied to the firewall it takes forever to provision.  Switching the guest OS fixed it.

If you use Linux (other 64)  - which is guestos_id 99 they run properly

I suspect that the VMware 6.5 mappings are failing when they aren’t supported by the 6.0 SDK which we use, but I'll need to get that verified....

I think that we should have a 'ACS SystemVM' guest OS, which we can map to the best performing guest OS for each hypervisor version.

VP Technology
paul.angus@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>

-----Original Message-----
From: Rene Moser [mailto:mail@renemoser.net]
Sent: 22 February 2018 16:27
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Re: [4.11] Management to VR connection issues

On 02/20/2018 08:04 PM, Rohit Yadav wrote:
> Hi Rene,
>
>
> Thanks for sharing - I've not seen this in test/production environment yet. Does it help to destroy the VR and check if the issue persists? Also, is this behaviour system-wide for every VR, or VRs of specific networks or topologies such as VPCs? Are these VRs redundant in nature?

We have non-redundant VRs, and we haven't looked at VPC routers yet.

The current analyses shows the following:

1. Started the process to upgrade an existing router.
2. Router gets destroyed and re-deployed with new template 4.11 as expected.
3. Router OS has started, ACS router state keeps "starting". When we login by console, we see some actions in the cloud.log. At this point, router will be left in this state and gets destroyed after job timeout.
4. We reboot manually on the OS level. VR gets rebooted.
5. After the OS has booted, ACS Router state switches to "Running"
6. We can login by ssh. however ACS router still shows "requires upgrade" (but the OS has already booted with template 4.11) 7. When we upgrade, the same process happens again points 1-3. Feels like a dead lock.

Logs:
https://transfer.sh/DdTtH/management-server.log.gz

We continue our investigations

Regards
René

Re: [4.11] Management to VR connection issues

Posted by Rohit Yadav <ro...@shapeblue.com>.

Hi Rene,

Paul is correct, for default VMware systemvm I had fixed it here:

<https://github.com/apache/cloudstack/blob/master/engine/schema/src/main/resources/META-INF/db/schema-41000to41100.sql#L403>

https://github.com/apache/cloudstack/blob/4.11/engine/schema/resources/META-INF/db/schema-41000to41100.sql#L403

But the above would have worked only for new installations, for upgraded ones we'll need to fix the release notes to ask users/admins to select 'Other Linux 64-bit'. Can you try that and share if that works for you?

I also checked, we're still using the 6.0 sdk jars. That needs to be fixed as well.

- Rohit

<https://cloudstack.apache.org>

________________________________

rohit.yadav@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue

From: Paul Angus
Sent: Sunday, February 25, 2018 8:57:55 AM
To: dev@cloudstack.apache.org; users@cloudstack.apache.org
Cc: Rohit Yadav
Subject: RE: [4.11] Management to VR connection issues

Hey Rene.

Can you check that OS type that has been applied to your system VM template.
I found that mine were coming up as 32bit Debian 5, making them go REALLY slow and if there are rules applied to the firewall it takes forever to provision.  Switching the guest OS fixed it.

If you use Linux (other 64)  - which is guestos_id 99 they run properly

I suspect that the VMware 6.5 mappings are failing when they aren’t supported by the 6.0 SDK which we use, but I'll need to get that verified....

I think that we should have a 'ACS SystemVM' guest OS, which we can map to the best performing guest OS for each hypervisor version.

VP Technology
paul.angus@shapeblue.com
www.shapeblue.com<http://www.shapeblue.com>

-----Original Message-----
From: Rene Moser [mailto:mail@renemoser.net]
Sent: 22 February 2018 16:27
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Re: [4.11] Management to VR connection issues

On 02/20/2018 08:04 PM, Rohit Yadav wrote:
> Hi Rene,
>
>
> Thanks for sharing - I've not seen this in test/production environment yet. Does it help to destroy the VR and check if the issue persists? Also, is this behaviour system-wide for every VR, or VRs of specific networks or topologies such as VPCs? Are these VRs redundant in nature?

We have non-redundant VRs, and we haven't looked at VPC routers yet.

The current analyses shows the following:

1. Started the process to upgrade an existing router.
2. Router gets destroyed and re-deployed with new template 4.11 as expected.
3. Router OS has started, ACS router state keeps "starting". When we login by console, we see some actions in the cloud.log. At this point, router will be left in this state and gets destroyed after job timeout.
4. We reboot manually on the OS level. VR gets rebooted.
5. After the OS has booted, ACS Router state switches to "Running"
6. We can login by ssh. however ACS router still shows "requires upgrade" (but the OS has already booted with template 4.11) 7. When we upgrade, the same process happens again points 1-3. Feels like a dead lock.

Logs:
https://transfer.sh/DdTtH/management-server.log.gz

We continue our investigations

Regards
René

RE: [4.11] Management to VR connection issues

Posted by Paul Angus <pa...@shapeblue.com>.

Hey Rene.

Can you check that OS type that has been applied to your system VM template.
I found that mine were coming up as 32bit Debian 5, making them go REALLY slow and if there are rules applied to the firewall it takes forever to provision.  Switching the guest OS fixed it.

If you use Linux (other 64)  - which is guestos_id 99 they run properly

I suspect that the VMware 6.5 mappings are failing when they aren't supported by the 6.0 SDK which we use, but I'll need to get that verified....

I think that we should have a 'ACS SystemVM' guest OS, which we can map to the best performing guest OS for each hypervisor version.

paul.angus@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue

-----Original Message-----
From: Rene Moser [mailto:mail@renemoser.net] 
Sent: 22 February 2018 16:27
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Re: [4.11] Management to VR connection issues

On 02/20/2018 08:04 PM, Rohit Yadav wrote:
> Hi Rene,
> 
> 
> Thanks for sharing - I've not seen this in test/production environment yet. Does it help to destroy the VR and check if the issue persists? Also, is this behaviour system-wide for every VR, or VRs of specific networks or topologies such as VPCs? Are these VRs redundant in nature?

We have non-redundant VRs, and we haven't looked at VPC routers yet.

The current analyses shows the following:

1. Started the process to upgrade an existing router.
2. Router gets destroyed and re-deployed with new template 4.11 as expected.
3. Router OS has started, ACS router state keeps "starting". When we login by console, we see some actions in the cloud.log. At this point, router will be left in this state and gets destroyed after job timeout.
4. We reboot manually on the OS level. VR gets rebooted.
5. After the OS has booted, ACS Router state switches to "Running"
6. We can login by ssh. however ACS router still shows "requires upgrade" (but the OS has already booted with template 4.11) 7. When we upgrade, the same process happens again points 1-3. Feels like a dead lock.

Logs:
https://transfer.sh/DdTtH/management-server.log.gz

We continue our investigations

Regards
René

RE: [4.11] Management to VR connection issues

Posted by Paul Angus <pa...@shapeblue.com>.

Hey Rene.

Can you check that OS type that has been applied to your system VM template.
I found that mine were coming up as 32bit Debian 5, making them go REALLY slow and if there are rules applied to the firewall it takes forever to provision.  Switching the guest OS fixed it.

If you use Linux (other 64)  - which is guestos_id 99 they run properly

I suspect that the VMware 6.5 mappings are failing when they aren't supported by the 6.0 SDK which we use, but I'll need to get that verified....

I think that we should have a 'ACS SystemVM' guest OS, which we can map to the best performing guest OS for each hypervisor version.

paul.angus@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue

-----Original Message-----
From: Rene Moser [mailto:mail@renemoser.net] 
Sent: 22 February 2018 16:27
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: Re: [4.11] Management to VR connection issues

On 02/20/2018 08:04 PM, Rohit Yadav wrote:
> Hi Rene,
> 
> 
> Thanks for sharing - I've not seen this in test/production environment yet. Does it help to destroy the VR and check if the issue persists? Also, is this behaviour system-wide for every VR, or VRs of specific networks or topologies such as VPCs? Are these VRs redundant in nature?

We have non-redundant VRs, and we haven't looked at VPC routers yet.

The current analyses shows the following:

1. Started the process to upgrade an existing router.
2. Router gets destroyed and re-deployed with new template 4.11 as expected.
3. Router OS has started, ACS router state keeps "starting". When we login by console, we see some actions in the cloud.log. At this point, router will be left in this state and gets destroyed after job timeout.
4. We reboot manually on the OS level. VR gets rebooted.
5. After the OS has booted, ACS Router state switches to "Running"
6. We can login by ssh. however ACS router still shows "requires upgrade" (but the OS has already booted with template 4.11) 7. When we upgrade, the same process happens again points 1-3. Feels like a dead lock.

Logs:
https://transfer.sh/DdTtH/management-server.log.gz

We continue our investigations

Regards
René

Re: [4.11] Management to VR connection issues

Posted by Rene Moser <ma...@renemoser.net>.

On 02/20/2018 08:04 PM, Rohit Yadav wrote:
> Hi Rene,
> 
> 
> Thanks for sharing - I've not seen this in test/production environment yet. Does it help to destroy the VR and check if the issue persists? Also, is this behaviour system-wide for every VR, or VRs of specific networks or topologies such as VPCs? Are these VRs redundant in nature?

We have non-redundant VRs, and we haven't looked at VPC routers yet.

The current analyses shows the following:

1. Started the process to upgrade an existing router.
2. Router gets destroyed and re-deployed with new template 4.11 as expected.
3. Router OS has started, ACS router state keeps "starting". When we
login by console, we see some actions in the cloud.log. At this point,
router will be left in this state and gets destroyed after job timeout.
4. We reboot manually on the OS level. VR gets rebooted.
5. After the OS has booted, ACS Router state switches to "Running"
6. We can login by ssh. however ACS router still shows
"requires upgrade" (but the OS has already booted with template 4.11)
7. When we upgrade, the same process happens again points 1-3. Feels
like a dead lock.

Logs:
https://transfer.sh/DdTtH/management-server.log.gz

We continue our investigations

Regards
René

Re: [4.11] Management to VR connection issues

Posted by Rene Moser <ma...@renemoser.net>.

On 02/20/2018 08:04 PM, Rohit Yadav wrote:
> Hi Rene,
> 
> 
> Thanks for sharing - I've not seen this in test/production environment yet. Does it help to destroy the VR and check if the issue persists? Also, is this behaviour system-wide for every VR, or VRs of specific networks or topologies such as VPCs? Are these VRs redundant in nature?

We have non-redundant VRs, and we haven't looked at VPC routers yet.

The current analyses shows the following:

1. Started the process to upgrade an existing router.
2. Router gets destroyed and re-deployed with new template 4.11 as expected.
3. Router OS has started, ACS router state keeps "starting". When we
login by console, we see some actions in the cloud.log. At this point,
router will be left in this state and gets destroyed after job timeout.
4. We reboot manually on the OS level. VR gets rebooted.
5. After the OS has booted, ACS Router state switches to "Running"
6. We can login by ssh. however ACS router still shows
"requires upgrade" (but the OS has already booted with template 4.11)
7. When we upgrade, the same process happens again points 1-3. Feels
like a dead lock.

Logs:
https://transfer.sh/DdTtH/management-server.log.gz

We continue our investigations

Regards
René

Re: [4.11] Management to VR connection issues

Posted by Rohit Yadav <ro...@shapeblue.com>.

Hi Rene,


Thanks for sharing - I've not seen this in test/production environment yet. Does it help to destroy the VR and check if the issue persists? Also, is this behaviour system-wide for every VR, or VRs of specific networks or topologies such as VPCs? Are these VRs redundant in nature?


4.11+ VRs are systemd enabled and don't reboot after patching which is a major difference between 4.9 and 4.11 systemvms/VRs; to make this work for VMware when the nics come up we use a hack (that has been followed since at least 4.6+) to ping the interfaces/gateways:

https://github.com/apache/cloudstack/blob/4.11/systemvm/debian/opt/cloud/bin/setup/common.sh#L335

After nic/mac-addresses change/configure, 4.9 and previous VRs used to reboot (i.e. 4.9 and previous VRs on vmware used to reboot twice, once after patching and once more to reconfigure nic-mac assignments). 4.11+ VRs don't do reboots at all but uses udevadm for nic/mac/interface configurations:

https://github.com/apache/cloudstack/blob/4.11/systemvm/debian/opt/cloud/bin/setup/router.sh#L62

So you may try two tests and see if it makes any difference wrt above mentioned code -- (a) one to increase timeout/ping retries and (b) another to reboot after udev/mac-address configurations (which would only require re-building the systemvm.iso file and scp-ing on the secondary storage in your test environment).

Finally, if you can share logs or other details about the test setup and environment, I can help you with some investigations.


- Rohit

<https://cloudstack.apache.org>



________________________________
From: Rene Moser <ma...@renemoser.net>
Sent: Tuesday, February 20, 2018 1:46:02 PM
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: [4.11] Management to VR connection issues

Hi

We upgraded from 4.9 to 4.11. VMware 6.5.0. (Testing environment).

VR upgrade went through. But we noticed that the communication between
the management server and the VR are not working properly.

We do not yet fully understand the issue, one thing we noted is that the
networks configs seems not be bound to the same interfaces after every
reboot. As a result, after a reboot you may can connect to the VR by
SSH, after another reboot you can't anymore.

The Network name eth0 switched from the NIC id 3 to 4 after reboot.

The VR is kept in "starting" state, of course as a consequence we get
many issues related to this, no VM deployments (kept in starting state),
VM expunging failure (cleanup fails), a.s.o.

Have anyone experienced similar issues?

Regards
René

rohit.yadav@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue

Re: [4.11] Management to VR connection issues

Posted by Rohit Yadav <ro...@shapeblue.com>.

Hi Rene,


Thanks for sharing - I've not seen this in test/production environment yet. Does it help to destroy the VR and check if the issue persists? Also, is this behaviour system-wide for every VR, or VRs of specific networks or topologies such as VPCs? Are these VRs redundant in nature?


4.11+ VRs are systemd enabled and don't reboot after patching which is a major difference between 4.9 and 4.11 systemvms/VRs; to make this work for VMware when the nics come up we use a hack (that has been followed since at least 4.6+) to ping the interfaces/gateways:

https://github.com/apache/cloudstack/blob/4.11/systemvm/debian/opt/cloud/bin/setup/common.sh#L335

After nic/mac-addresses change/configure, 4.9 and previous VRs used to reboot (i.e. 4.9 and previous VRs on vmware used to reboot twice, once after patching and once more to reconfigure nic-mac assignments). 4.11+ VRs don't do reboots at all but uses udevadm for nic/mac/interface configurations:

https://github.com/apache/cloudstack/blob/4.11/systemvm/debian/opt/cloud/bin/setup/router.sh#L62

So you may try two tests and see if it makes any difference wrt above mentioned code -- (a) one to increase timeout/ping retries and (b) another to reboot after udev/mac-address configurations (which would only require re-building the systemvm.iso file and scp-ing on the secondary storage in your test environment).

Finally, if you can share logs or other details about the test setup and environment, I can help you with some investigations.


- Rohit

<https://cloudstack.apache.org>



________________________________
From: Rene Moser <ma...@renemoser.net>
Sent: Tuesday, February 20, 2018 1:46:02 PM
To: users@cloudstack.apache.org; dev@cloudstack.apache.org
Subject: [4.11] Management to VR connection issues

Hi

We upgraded from 4.9 to 4.11. VMware 6.5.0. (Testing environment).

VR upgrade went through. But we noticed that the communication between
the management server and the VR are not working properly.

We do not yet fully understand the issue, one thing we noted is that the
networks configs seems not be bound to the same interfaces after every
reboot. As a result, after a reboot you may can connect to the VR by
SSH, after another reboot you can't anymore.

The Network name eth0 switched from the NIC id 3 to 4 after reboot.

The VR is kept in "starting" state, of course as a consequence we get
many issues related to this, no VM deployments (kept in starting state),
VM expunging failure (cleanup fails), a.s.o.

Have anyone experienced similar issues?

Regards
René

rohit.yadav@shapeblue.com 
www.shapeblue.com
53 Chandos Place, Covent Garden, London  WC2N 4HSUK
@shapeblue