You are viewing a plain text version of this content. The canonical link for it is here.
Posted to builds@apache.org by Lance Albertson <la...@osuosl.org> on 2019/05/16 15:58:27 UTC

[osuosl-openpower] Ongoing VM network connectivity issues since Pike upgrade

All,

Since the upgrade to Pike we've noticed virtual machines suddenly losing
network connectivity. This issue seems to sometimes fix itself or when we
restart the  neutron-linuxbridge-agent service on the hypervisors. We are
doing our best to track down why this is happening and how to fix it. Since
we're not monitoring every host on the cluster, it's difficult for us to
know when it happens so if you do have a problem with one of your VMs,
please let us know either via IRC in #osuosl on Freenode, or via a support
email.

I'll be sending further updates as we have them.

Thanks for your patience!

-- 
Lance Albertson
Director
Oregon State University | Open Source Lab

Re: [osuosl-openpower] Ongoing VM network connectivity issues since Pike upgrade

Posted by Lance Albertson <la...@osuosl.org>.
All,

I wanted to send you yet another update regarding this networking issue. I
have been unable to find a solution to the problem and have decided to move
forward with the upgrade to Queens to see if the issue is resolved. I will
be sending an email shortly regarding when we'll do this upgrade.

Thanks-

On Tue, May 21, 2019 at 12:16 PM Lance Albertson <la...@osuosl.org> wrote:

> All,
>
> I wanted to send you an update on where we are at on this issue. So far
> I've narrowed down the problem to happening when a VM using a private
> network is removed causing certain iptable rules on the hypervisor to get
> out of order. It only seems to effect inbound connections to the VM as
> outbound seems to still work. I haven't been able to easily reproduce the
> issue unfortunately which makes it difficult to troubleshoot. I've looked
> through the source code and also looked online to see if anyone else had
> run into this without success.
>
> I've rebooted all of the hypervisors on our x86 cluster and two on our ppc
> cluster (which was needed for the MDS updates). So far on the nodes that
> have been rebooted we haven't seen any issues, but I need to let those run
> for a few days to verify that theory. These machines were also due for a
> reboot also because of the CentOS 7.5 -> 7.6 upgrade so perhaps it's
> related to that.
>
> At any rate, I've deployed a temporary cronjob on the nodes that haven't
> been rebooted which should "fix" the networking issue. I have it set to run
> every minute so that the downtime should be minimal.
>
> I'll send another update as I have one.
>
> Thanks-
>
> On Thu, May 16, 2019 at 8:58 AM Lance Albertson <la...@osuosl.org> wrote:
>
>> All,
>>
>> Since the upgrade to Pike we've noticed virtual machines suddenly losing
>> network connectivity. This issue seems to sometimes fix itself or when we
>> restart the  neutron-linuxbridge-agent service on the hypervisors. We
>> are doing our best to track down why this is happening and how to fix it.
>> Since we're not monitoring every host on the cluster, it's difficult for us
>> to know when it happens so if you do have a problem with one of your VMs,
>> please let us know either via IRC in #osuosl on Freenode, or via a support
>> email.
>>
>> I'll be sending further updates as we have them.
>>
>> Thanks for your patience!
>>
>> --
>> Lance Albertson
>> Director
>> Oregon State University | Open Source Lab
>>
>
>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>


-- 
Lance Albertson
Director
Oregon State University | Open Source Lab

Re: [osuosl-openpower] Ongoing VM network connectivity issues since Pike upgrade

Posted by Lance Albertson <la...@osuosl.org>.
Sending another update on this to just the OpenPOWER cluster users.

This problem is still unfortunately happening, even on the nodes which have
been rebooted. To reboot the nodes, I need to live migrate all of the VMs
onto other nodes. Normally this isn't an issue but I noticed yesterday that
some instances were failing during the migration. Upon further
investigation, I noticed that some of these VMs were running on both the
old and new nodes which is not a good thing for file systems. I've already
fixed a few VMs that were in this state, but I'm still working through
others which still might be in a bad state. I'm doing a force filesystem
check in a rescue mode before booting the systems. I'll let you know once
I'm done going through all the VMs in case I missed anything. In the
meantime, I'm not going to do any more live migrations until I get this
resolved.

On the original issue, I unfortunately have not made any progress on
narrowing down what is causing the issue. One option is to go ahead with
the Queens upgrade to see if the problem persists or not. But I'd feel much
better if I got this fixed before we attempted the upgrade.

I'll continue looking into this this week.

Thanks for your patience.

On Tue, May 21, 2019 at 12:16 PM Lance Albertson <la...@osuosl.org> wrote:

> All,
>
> I wanted to send you an update on where we are at on this issue. So far
> I've narrowed down the problem to happening when a VM using a private
> network is removed causing certain iptable rules on the hypervisor to get
> out of order. It only seems to effect inbound connections to the VM as
> outbound seems to still work. I haven't been able to easily reproduce the
> issue unfortunately which makes it difficult to troubleshoot. I've looked
> through the source code and also looked online to see if anyone else had
> run into this without success.
>
> I've rebooted all of the hypervisors on our x86 cluster and two on our ppc
> cluster (which was needed for the MDS updates). So far on the nodes that
> have been rebooted we haven't seen any issues, but I need to let those run
> for a few days to verify that theory. These machines were also due for a
> reboot also because of the CentOS 7.5 -> 7.6 upgrade so perhaps it's
> related to that.
>
> At any rate, I've deployed a temporary cronjob on the nodes that haven't
> been rebooted which should "fix" the networking issue. I have it set to run
> every minute so that the downtime should be minimal.
>
> I'll send another update as I have one.
>
> Thanks-
>
> On Thu, May 16, 2019 at 8:58 AM Lance Albertson <la...@osuosl.org> wrote:
>
>> All,
>>
>> Since the upgrade to Pike we've noticed virtual machines suddenly losing
>> network connectivity. This issue seems to sometimes fix itself or when we
>> restart the  neutron-linuxbridge-agent service on the hypervisors. We
>> are doing our best to track down why this is happening and how to fix it.
>> Since we're not monitoring every host on the cluster, it's difficult for us
>> to know when it happens so if you do have a problem with one of your VMs,
>> please let us know either via IRC in #osuosl on Freenode, or via a support
>> email.
>>
>> I'll be sending further updates as we have them.
>>
>> Thanks for your patience!
>>
>> --
>> Lance Albertson
>> Director
>> Oregon State University | Open Source Lab
>>
>
>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>


-- 
Lance Albertson
Director
Oregon State University | Open Source Lab

Re: [osuosl-openpower] Ongoing VM network connectivity issues since Pike upgrade

Posted by Lance Albertson <la...@osuosl.org>.
All,

I wanted to send you an update on where we are at on this issue. So far
I've narrowed down the problem to happening when a VM using a private
network is removed causing certain iptable rules on the hypervisor to get
out of order. It only seems to effect inbound connections to the VM as
outbound seems to still work. I haven't been able to easily reproduce the
issue unfortunately which makes it difficult to troubleshoot. I've looked
through the source code and also looked online to see if anyone else had
run into this without success.

I've rebooted all of the hypervisors on our x86 cluster and two on our ppc
cluster (which was needed for the MDS updates). So far on the nodes that
have been rebooted we haven't seen any issues, but I need to let those run
for a few days to verify that theory. These machines were also due for a
reboot also because of the CentOS 7.5 -> 7.6 upgrade so perhaps it's
related to that.

At any rate, I've deployed a temporary cronjob on the nodes that haven't
been rebooted which should "fix" the networking issue. I have it set to run
every minute so that the downtime should be minimal.

I'll send another update as I have one.

Thanks-

On Thu, May 16, 2019 at 8:58 AM Lance Albertson <la...@osuosl.org> wrote:

> All,
>
> Since the upgrade to Pike we've noticed virtual machines suddenly losing
> network connectivity. This issue seems to sometimes fix itself or when we
> restart the  neutron-linuxbridge-agent service on the hypervisors. We are
> doing our best to track down why this is happening and how to fix it. Since
> we're not monitoring every host on the cluster, it's difficult for us to
> know when it happens so if you do have a problem with one of your VMs,
> please let us know either via IRC in #osuosl on Freenode, or via a support
> email.
>
> I'll be sending further updates as we have them.
>
> Thanks for your patience!
>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>


-- 
Lance Albertson
Director
Oregon State University | Open Source Lab