You are viewing a plain text version of this content. The canonical link for it is here.
Posted to builds@apache.org by Lance Albertson <la...@osuosl.org> on 2021/08/01 15:15:56 UTC

[Hosting] Unplanned Power Event

It seems as though we had an unplanned power event that happened in our
primary data center early this morning at 3:03AM PDT (1003 UTC) that
affected one of the two power feeds. Virtually every system that has a dual
power supply should have remained online. The one exception is some systems
located in a row that are only being fed by that power feed which include:

- power8-aix
- pieta.debian.org
- gcc2-power8
- All Buildbot/RTEMS systems
- gcc113
- gcc114
- gcc115
- gcc116
- gcc117
- gcc118

I believe every system that we monitor should be back online but there
might be others we aren't monitoring that are still down. If that's the
case, please send an email to support and we'll take a look at it as soon
as possible.

I'm still waiting to hear back about what happened and why it happened and
will pass that information along once I learn more.

Thanks for your patience.

-- 
Lance Albertson
Director
Oregon State University | Open Source Lab

Re: [Hosting] Unplanned Power Event

Posted by Lance Albertson <la...@osuosl.org>.
FYI: It looks like we had another power event that impacted our primary
data center along with our OpenCompute hosts in another datacenter. I'm
taking a look to see what might be down but this time it seems to be not
nearly as widespread. I don't think we had any issues with any of the OSL
managed services.

Please let me know if you do have any issues.

Thanks-

On Tue, Aug 3, 2021 at 3:26 PM Lance Albertson <la...@osuosl.org> wrote:

> I received an update on the issues we had in the primary data center. It
> appears that there was a battery cell problem on one of the UPS's. Previous
> to the outage OSU issued a Purchase Order for battery replacements and are
> waiting for them to arrive to schedule the installation. The projected
> arrival date for the batteries is September 10th. When they arrive we are
> scheduling the install as a priority.
> In the meantime, this may happen again however I did fix a few systems we
> had issues with related to how their power was configured.
>
> If you have any questions or concerns please let me know.
>
> Thank you!
>
> On Sun, Aug 1, 2021 at 12:28 PM Lance Albertson <la...@osuosl.org> wrote:
>
>> I got word that this outage was more campus wide which included impacting
>> the OpenCompute hosts. I went through those hosts and ensured they are back
>> online but let me know if I missed anything.
>>
>> OSU will be sending in a tech in a few days to see why the UPS didn't
>> fail over properly in our primary datacenter which caused the power event.
>> I'm also going to spot check a few hosts' power when I go in on Tuesday to
>> ensure power is split properly between the power feeds. If you had any
>> hosts that went down with dual power, please let me know ASAP so I can add
>> it to the list of hosts to check.
>>
>> Thanks for your patience!
>>
>> On Sun, Aug 1, 2021 at 8:15 AM Lance Albertson <la...@osuosl.org> wrote:
>>
>>> It seems as though we had an unplanned power event that happened in our
>>> primary data center early this morning at 3:03AM PDT (1003 UTC) that
>>> affected one of the two power feeds. Virtually every system that has a dual
>>> power supply should have remained online. The one exception is some systems
>>> located in a row that are only being fed by that power feed which include:
>>>
>>> - power8-aix
>>> - pieta.debian.org
>>> - gcc2-power8
>>> - All Buildbot/RTEMS systems
>>> - gcc113
>>> - gcc114
>>> - gcc115
>>> - gcc116
>>> - gcc117
>>> - gcc118
>>>
>>> I believe every system that we monitor should be back online but there
>>> might be others we aren't monitoring that are still down. If that's the
>>> case, please send an email to support and we'll take a look at it as soon
>>> as possible.
>>>
>>> I'm still waiting to hear back about what happened and why it happened
>>> and will pass that information along once I learn more.
>>>
>>> Thanks for your patience.
>>>
>>> --
>>> Lance Albertson
>>> Director
>>> Oregon State University | Open Source Lab
>>>
>>
>>
>> --
>> Lance Albertson
>> Director
>> Oregon State University | Open Source Lab
>>
>
>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>


-- 
Lance Albertson
Director
Oregon State University | Open Source Lab

Re: [Hosting] Unplanned Power Event

Posted by Lance Albertson <la...@osuosl.org>.
I received an update on the issues we had in the primary data center. It
appears that there was a battery cell problem on one of the UPS's. Previous
to the outage OSU issued a Purchase Order for battery replacements and are
waiting for them to arrive to schedule the installation. The projected
arrival date for the batteries is September 10th. When they arrive we are
scheduling the install as a priority.
In the meantime, this may happen again however I did fix a few systems we
had issues with related to how their power was configured.

If you have any questions or concerns please let me know.

Thank you!

On Sun, Aug 1, 2021 at 12:28 PM Lance Albertson <la...@osuosl.org> wrote:

> I got word that this outage was more campus wide which included impacting
> the OpenCompute hosts. I went through those hosts and ensured they are back
> online but let me know if I missed anything.
>
> OSU will be sending in a tech in a few days to see why the UPS didn't fail
> over properly in our primary datacenter which caused the power event. I'm
> also going to spot check a few hosts' power when I go in on Tuesday to
> ensure power is split properly between the power feeds. If you had any
> hosts that went down with dual power, please let me know ASAP so I can add
> it to the list of hosts to check.
>
> Thanks for your patience!
>
> On Sun, Aug 1, 2021 at 8:15 AM Lance Albertson <la...@osuosl.org> wrote:
>
>> It seems as though we had an unplanned power event that happened in our
>> primary data center early this morning at 3:03AM PDT (1003 UTC) that
>> affected one of the two power feeds. Virtually every system that has a dual
>> power supply should have remained online. The one exception is some systems
>> located in a row that are only being fed by that power feed which include:
>>
>> - power8-aix
>> - pieta.debian.org
>> - gcc2-power8
>> - All Buildbot/RTEMS systems
>> - gcc113
>> - gcc114
>> - gcc115
>> - gcc116
>> - gcc117
>> - gcc118
>>
>> I believe every system that we monitor should be back online but there
>> might be others we aren't monitoring that are still down. If that's the
>> case, please send an email to support and we'll take a look at it as soon
>> as possible.
>>
>> I'm still waiting to hear back about what happened and why it happened
>> and will pass that information along once I learn more.
>>
>> Thanks for your patience.
>>
>> --
>> Lance Albertson
>> Director
>> Oregon State University | Open Source Lab
>>
>
>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>


-- 
Lance Albertson
Director
Oregon State University | Open Source Lab

Re: [Hosting] Unplanned Power Event

Posted by Lance Albertson <la...@osuosl.org>.
I got word that this outage was more campus wide which included impacting
the OpenCompute hosts. I went through those hosts and ensured they are back
online but let me know if I missed anything.

OSU will be sending in a tech in a few days to see why the UPS didn't fail
over properly in our primary datacenter which caused the power event. I'm
also going to spot check a few hosts' power when I go in on Tuesday to
ensure power is split properly between the power feeds. If you had any
hosts that went down with dual power, please let me know ASAP so I can add
it to the list of hosts to check.

Thanks for your patience!

On Sun, Aug 1, 2021 at 8:15 AM Lance Albertson <la...@osuosl.org> wrote:

> It seems as though we had an unplanned power event that happened in our
> primary data center early this morning at 3:03AM PDT (1003 UTC) that
> affected one of the two power feeds. Virtually every system that has a dual
> power supply should have remained online. The one exception is some systems
> located in a row that are only being fed by that power feed which include:
>
> - power8-aix
> - pieta.debian.org
> - gcc2-power8
> - All Buildbot/RTEMS systems
> - gcc113
> - gcc114
> - gcc115
> - gcc116
> - gcc117
> - gcc118
>
> I believe every system that we monitor should be back online but there
> might be others we aren't monitoring that are still down. If that's the
> case, please send an email to support and we'll take a look at it as soon
> as possible.
>
> I'm still waiting to hear back about what happened and why it happened and
> will pass that information along once I learn more.
>
> Thanks for your patience.
>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>


-- 
Lance Albertson
Director
Oregon State University | Open Source Lab