You are viewing a plain text version of this content. The canonical link for it is here.

Posted to builds@apache.org by Lance Albertson <la...@osuosl.org> on 2022/09/13 18:08:41 UTC

Re: [Hosting] Core network switch reboot

All,

I wanted to pass along more information on where we're at and our current
plans to try and work around this issue.

Without going deep into the history of our core network infrastructure, we
have two core "routers" that are both aging and we're in the process of
replacing them with something newer.

Previously, our uplink was connected through our Cisco 6509. This switch
has several 1G line cards that half of our servers are directly connected
to.

The other core switch is a Cisco Nexus 6001 which has three fabric
extenders which provide 1G connectivity to the other half of our servers.
When we migrated over to the LinkOregon network, we moved the uplink over
to this Nexus 6k as it was much easier to get LR optics for it.

Unfortunately this Nexus 6k has started kernel panicking and rebooting in
the past several months multiple times causing these outages. Much of our
downlink 10G switches are connected to this Nexus 6k which means there's a
larger impact when it goes down.

A few years ago a high speed trading company donated us a pallet full of
Arista switches and I've been slowly adding to our infrastructure. Even
though they are EOL, they still work very well and we haven't had any
problems with them. And since I have a lot of them, I can easily replace
one if one goes bad.

My current plan is to set up one of these Arista switches and move all of
the current 10G connections to it. This way, at least we can reduce the
impact if/when this Nexus 6k switch reboots again. In theory, it should
only affect the servers directly connected to the FEX switches if it
reboots again.

I reached out to the OSU IT community and they graciously donated two
10G-LR optical modules so that I can put this plan in place without having
to wait to ship modules.

Current plan for today:
- Setup new Arista switch
- Move upstream connectivity to LinkOregon to it
- Move all downstream 10G links to this router

I will send another email when I plan to do the actual outages for the cut
over.

Longer term plans
- Work with vendors to replace our aging core network infrastructure with
something that's still supported and we can afford
- Look into getting redundancy put into place so that we don't have this
issue anymore
- Migrate off of the older equipment

If anyone on this list has connections to Arista or any other major edge
networking vendor, please let me know. That will certainly help our
situation in the long term!

I had already started working on a plan to replace these systems but it
seems my time may have run out (at least for the Nexus 6k switch).

Thanks all for your patience and support!

On Mon, Sep 12, 2022 at 11:54 PM Lance Albertson <la...@osuosl.org> wrote:

> Sadly this just happened again about 50 minutes ago. We may need to do
> some emergency firmware patching tomorrow. As a backup plan, I'm also
> formulating a plan to add another switch to try and minimize the impact of
> this troublesome switch.
>
> Once I gather some additional information tomorrow morning, I'll send an
> update on what we're planning to do.
>
> Thanks again for your patience.
>
> On Mon, Sep 12, 2022 at 3:14 PM Lance Albertson <la...@osuosl.org> wrote:
>
>> This happened again at approximately 10AM PDT. Since we moved our uplink
>> to this switch, everything went down while the switch rebooted.
>>
>> We're still planning on doing an upgrade but don't have a date yet for
>> that. We'll hopefully get that going soon.
>>
>> Thanks for your patience.
>>
>> On Wed, Aug 24, 2022 at 7:40 AM Lance Albertson <la...@osuosl.org> wrote:
>>
>>> Unfortunately this just happened again overnight. We may need to
>>> schedule another outage to perform some software upgrade on this switch so
>>> that this stops happening. We'll send an announcement out once we have
>>> everything in place to do that upgrade.
>>>
>>> Thanks-
>>>
>>> On Wed, May 25, 2022 at 11:22 PM Lance Albertson <la...@osuosl.org>
>>> wrote:
>>>
>>>> All,
>>>>
>>>> It appears that one of our core network switches had a kernel panic and
>>>> rebooted which caused widespread outages throughout our infrastructure. As
>>>> of right now, everything appears to be back to normal but please let me
>>>> know if that isn't the case by sending an email to support@osuosl.org.
>>>>
>>>> Apologies for the outage and we'll be looking into why this switch had
>>>> a kernel panic in the first place.
>>>>
>>>> Thanks-
>>>>
>>>
-- 
Lance Albertson
Director
Oregon State University | Open Source Lab

Re: [Hosting] Core network switch reboot

Posted by Lance Albertson <la...@osuosl.org>.

This has been completed and everything seems to be working fine.

Now keep in mind, the troublesome switch could reboot again until we figure
out why it's happening. If it does, it's impact should be smaller than
before at least.

Thanks!

On Tue, Sep 13, 2022 at 2:42 PM Lance Albertson <la...@osuosl.org> wrote:

> I have the "new" switch setup and ready to go. I'm currently planning on
> doing this switch in about 20 minutes (3pm PDT). You will see a set of
> outages as I plan to do the following:
>
> 1. Move LinkOregon uplink to "new" switch
> 2. Move oslsw3 uplink to "new" switch
> 3. Move oslsw1 uplink to "new" switch
> 4. Move remaining backend 10g switches
>
> If anything goes wrong, I should be able to quickly revert the change.
>
> On Tue, Sep 13, 2022 at 11:08 AM Lance Albertson <la...@osuosl.org> wrote:
>
>> All,
>>
>> I wanted to pass along more information on where we're at and our current
>> plans to try and work around this issue.
>>
>> Without going deep into the history of our core network infrastructure,
>> we have two core "routers" that are both aging and we're in the process of
>> replacing them with something newer.
>>
>> Previously, our uplink was connected through our Cisco 6509. This switch
>> has several 1G line cards that half of our servers are directly connected
>> to.
>>
>> The other core switch is a Cisco Nexus 6001 which has three fabric
>> extenders which provide 1G connectivity to the other half of our servers.
>> When we migrated over to the LinkOregon network, we moved the uplink over
>> to this Nexus 6k as it was much easier to get LR optics for it.
>>
>> Unfortunately this Nexus 6k has started kernel panicking and rebooting in
>> the past several months multiple times causing these outages. Much of our
>> downlink 10G switches are connected to this Nexus 6k which means there's a
>> larger impact when it goes down.
>>
>> A few years ago a high speed trading company donated us a pallet full of
>> Arista switches and I've been slowly adding to our infrastructure. Even
>> though they are EOL, they still work very well and we haven't had any
>> problems with them. And since I have a lot of them, I can easily replace
>> one if one goes bad.
>>
>> My current plan is to set up one of these Arista switches and move all of
>> the current 10G connections to it. This way, at least we can reduce the
>> impact if/when this Nexus 6k switch reboots again. In theory, it should
>> only affect the servers directly connected to the FEX switches if it
>> reboots again.
>>
>> I reached out to the OSU IT community and they graciously donated two
>> 10G-LR optical modules so that I can put this plan in place without having
>> to wait to ship modules.
>>
>> Current plan for today:
>> - Setup new Arista switch
>> - Move upstream connectivity to LinkOregon to it
>> - Move all downstream 10G links to this router
>>
>> I will send another email when I plan to do the actual outages for the
>> cut over.
>>
>> Longer term plans
>> - Work with vendors to replace our aging core network infrastructure with
>> something that's still supported and we can afford
>> - Look into getting redundancy put into place so that we don't have this
>> issue anymore
>> - Migrate off of the older equipment
>>
>> If anyone on this list has connections to Arista or any other major edge
>> networking vendor, please let me know. That will certainly help our
>> situation in the long term!
>>
>> I had already started working on a plan to replace these systems but it
>> seems my time may have run out (at least for the Nexus 6k switch).
>>
>> Thanks all for your patience and support!
>>
>> On Mon, Sep 12, 2022 at 11:54 PM Lance Albertson <la...@osuosl.org>
>> wrote:
>>
>>> Sadly this just happened again about 50 minutes ago. We may need to do
>>> some emergency firmware patching tomorrow. As a backup plan, I'm also
>>> formulating a plan to add another switch to try and minimize the impact of
>>> this troublesome switch.
>>>
>>> Once I gather some additional information tomorrow morning, I'll send an
>>> update on what we're planning to do.
>>>
>>> Thanks again for your patience.
>>>
>>> On Mon, Sep 12, 2022 at 3:14 PM Lance Albertson <la...@osuosl.org>
>>> wrote:
>>>
>>>> This happened again at approximately 10AM PDT. Since we moved our
>>>> uplink to this switch, everything went down while the switch rebooted.
>>>>
>>>> We're still planning on doing an upgrade but don't have a date yet for
>>>> that. We'll hopefully get that going soon.
>>>>
>>>> Thanks for your patience.
>>>>
>>>> On Wed, Aug 24, 2022 at 7:40 AM Lance Albertson <la...@osuosl.org>
>>>> wrote:
>>>>
>>>>> Unfortunately this just happened again overnight. We may need to
>>>>> schedule another outage to perform some software upgrade on this switch so
>>>>> that this stops happening. We'll send an announcement out once we have
>>>>> everything in place to do that upgrade.
>>>>>
>>>>> Thanks-
>>>>>
>>>>> On Wed, May 25, 2022 at 11:22 PM Lance Albertson <la...@osuosl.org>
>>>>> wrote:
>>>>>
>>>>>> All,
>>>>>>
>>>>>> It appears that one of our core network switches had a kernel panic
>>>>>> and rebooted which caused widespread outages throughout our infrastructure.
>>>>>> As of right now, everything appears to be back to normal but please let me
>>>>>> know if that isn't the case by sending an email to support@osuosl.org
>>>>>> .
>>>>>>
>>>>>> Apologies for the outage and we'll be looking into why this switch
>>>>>> had a kernel panic in the first place.
>>>>>>
>>>>>> Thanks-
>>>>>>
>>>>>
>> --
>> Lance Albertson
>> Director
>> Oregon State University | Open Source Lab
>>
>
>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>


-- 
Lance Albertson
Director
Oregon State University | Open Source Lab

Re: [Hosting] Core network switch reboot

Posted by Lance Albertson <la...@osuosl.org>.

I have the "new" switch setup and ready to go. I'm currently planning on
doing this switch in about 20 minutes (3pm PDT). You will see a set of
outages as I plan to do the following:

1. Move LinkOregon uplink to "new" switch
2. Move oslsw3 uplink to "new" switch
3. Move oslsw1 uplink to "new" switch
4. Move remaining backend 10g switches

If anything goes wrong, I should be able to quickly revert the change.

On Tue, Sep 13, 2022 at 11:08 AM Lance Albertson <la...@osuosl.org> wrote:

> All,
>
> I wanted to pass along more information on where we're at and our current
> plans to try and work around this issue.
>
> Without going deep into the history of our core network infrastructure, we
> have two core "routers" that are both aging and we're in the process of
> replacing them with something newer.
>
> Previously, our uplink was connected through our Cisco 6509. This switch
> has several 1G line cards that half of our servers are directly connected
> to.
>
> The other core switch is a Cisco Nexus 6001 which has three fabric
> extenders which provide 1G connectivity to the other half of our servers.
> When we migrated over to the LinkOregon network, we moved the uplink over
> to this Nexus 6k as it was much easier to get LR optics for it.
>
> Unfortunately this Nexus 6k has started kernel panicking and rebooting in
> the past several months multiple times causing these outages. Much of our
> downlink 10G switches are connected to this Nexus 6k which means there's a
> larger impact when it goes down.
>
> A few years ago a high speed trading company donated us a pallet full of
> Arista switches and I've been slowly adding to our infrastructure. Even
> though they are EOL, they still work very well and we haven't had any
> problems with them. And since I have a lot of them, I can easily replace
> one if one goes bad.
>
> My current plan is to set up one of these Arista switches and move all of
> the current 10G connections to it. This way, at least we can reduce the
> impact if/when this Nexus 6k switch reboots again. In theory, it should
> only affect the servers directly connected to the FEX switches if it
> reboots again.
>
> I reached out to the OSU IT community and they graciously donated two
> 10G-LR optical modules so that I can put this plan in place without having
> to wait to ship modules.
>
> Current plan for today:
> - Setup new Arista switch
> - Move upstream connectivity to LinkOregon to it
> - Move all downstream 10G links to this router
>
> I will send another email when I plan to do the actual outages for the cut
> over.
>
> Longer term plans
> - Work with vendors to replace our aging core network infrastructure with
> something that's still supported and we can afford
> - Look into getting redundancy put into place so that we don't have this
> issue anymore
> - Migrate off of the older equipment
>
> If anyone on this list has connections to Arista or any other major edge
> networking vendor, please let me know. That will certainly help our
> situation in the long term!
>
> I had already started working on a plan to replace these systems but it
> seems my time may have run out (at least for the Nexus 6k switch).
>
> Thanks all for your patience and support!
>
> On Mon, Sep 12, 2022 at 11:54 PM Lance Albertson <la...@osuosl.org> wrote:
>
>> Sadly this just happened again about 50 minutes ago. We may need to do
>> some emergency firmware patching tomorrow. As a backup plan, I'm also
>> formulating a plan to add another switch to try and minimize the impact of
>> this troublesome switch.
>>
>> Once I gather some additional information tomorrow morning, I'll send an
>> update on what we're planning to do.
>>
>> Thanks again for your patience.
>>
>> On Mon, Sep 12, 2022 at 3:14 PM Lance Albertson <la...@osuosl.org> wrote:
>>
>>> This happened again at approximately 10AM PDT. Since we moved our uplink
>>> to this switch, everything went down while the switch rebooted.
>>>
>>> We're still planning on doing an upgrade but don't have a date yet for
>>> that. We'll hopefully get that going soon.
>>>
>>> Thanks for your patience.
>>>
>>> On Wed, Aug 24, 2022 at 7:40 AM Lance Albertson <la...@osuosl.org>
>>> wrote:
>>>
>>>> Unfortunately this just happened again overnight. We may need to
>>>> schedule another outage to perform some software upgrade on this switch so
>>>> that this stops happening. We'll send an announcement out once we have
>>>> everything in place to do that upgrade.
>>>>
>>>> Thanks-
>>>>
>>>> On Wed, May 25, 2022 at 11:22 PM Lance Albertson <la...@osuosl.org>
>>>> wrote:
>>>>
>>>>> All,
>>>>>
>>>>> It appears that one of our core network switches had a kernel panic
>>>>> and rebooted which caused widespread outages throughout our infrastructure.
>>>>> As of right now, everything appears to be back to normal but please let me
>>>>> know if that isn't the case by sending an email to support@osuosl.org.
>>>>>
>>>>> Apologies for the outage and we'll be looking into why this switch had
>>>>> a kernel panic in the first place.
>>>>>
>>>>> Thanks-
>>>>>
>>>>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>


-- 
Lance Albertson
Director
Oregon State University | Open Source Lab