You are viewing a plain text version of this content. The canonical link for it is here.
Posted to builds@apache.org by Lance Albertson <la...@osuosl.org> on 2018/06/14 17:00:10 UTC

[Hosting] ftp-osl storage upgrade (full rebuild required) - Jun 18, 2018 9:30AM PDT (Jun 18 1630 UTC)

Service(s) affected: ftp.osuosl.org

During the outage, the master syncing node for our FTP cluster (ftp-osl)
will be offline which means any updates to our software mirrors will be
delayed.

Outage Window:
Start: Mon, Jun 18 9:30AM PDT (Mon Jun 18 1630 UTC)
End: Mon, Jun 18 3:00PM PDT (Mon Jun 18 2200 UTC)

Reason for outage:

Our FTP cluster is starting to run low on disk space and we will be adding
additional hard drives to the system. Our system currently has 9.375T of
disk space and we're planning on upgrading it to 18.75T (this takes into
account the RAID6 configuration)

Unfortunately, due to the nature of the how the disk arrays are configured,
we will not be able to grow the RAID array without a complete rebuild. This
means we're going to have to re-copy all 8.8TB of data off of the machine
and back onto it. Since this task is rather large and time consuming we've
come up with a better alternative so that we don't have our master FTP
server offline for very long.

We have just recently built a new Ceph cluster for some new storage needs
at the OSL and we are going to temporarily use this cluster to serve the
ftp-osl content. I've already copied the content onto a new volume and have
tested it enough to feel it can handle the load. This should make the
transition plan much easier and quicker than initially.This server is
already out of DNS rotation and we are planning on keeping it out of
rotation until this process is complete to reduce the I/O load.

So here's the plan thus far starting on Monday:

1. Stopping all services on the system and doing one final rsync to the
Ceph volume
2. Rebooting machine and destroying the current RAID and creating a new one
with the new disks
3. Reinstall the OS
4. Bootstrap machine without FTP components initially, setup ceph volume
5. Deploy FTP components after Ceph volume is setup and ready to go
6. Ensure inter FTP node syncing is working using the Ceph volume
7. Sync data from Ceph volume back over to local disks (I'm guessing this
will take 18-24 hours)
8. Once sync is complete, shutdown all services and switch the mount point
over to the local disks
9. Profit!

I would like to thank IBM for donating the hard drives needed for this
upgrade.

We will plan on doing the storage upgrades on our two other nodes (ftp-nyc
& ftp-chi) soon, however we won't be using the Ceph cluster for this since
they are remote. The current plan is to take one machine out for several
days and sync the data back between the nodes. I will send another outage
announcement for those two nodes once we're ready for that. We still need
to ship the drives to the locations and work with the local data centers to
get them installed.

Projects affected: Any project using our FTP cluster as a master syncing
point

-- 
Lance Albertson
Director
Oregon State University | Open Source Lab

Re: [Hosting] ftp-osl storage upgrade (full rebuild required) - Jun 18, 2018 9:30AM PDT (Jun 18 1630 UTC)

Posted by Lance Albertson <la...@osuosl.org>.
This has been completed and I've put ftp-osl back into rotation! Thanks for
your patience.

On Fri, Jun 22, 2018 at 12:44 PM, Lance Albertson <la...@osuosl.org> wrote:

> The sync has been completed and I will be switching this over to the local
> drives at 1:30PM PDT (2030 UTC) today. I'm going to also reboot the machine
> so that it's running back on the normal CentOS kernel instead of our custom
> mainline kernel we needed for Ceph. This outage should only last for about
> 10 minutes while the machine reboots.
>
> This does not affect anything pointed at ftp.osuosl.org, only ftp-osl
> (which is out of rotation).
>
> Thanks-
>
> On Tue, Jun 19, 2018 at 8:57 AM, Lance Albertson <la...@osuosl.org> wrote:
>
>> It's taking longer than I expected to sync the data back to the local
>> disks. This is due to the fact that the system is also rebuilding two RAID6
>> arrays which I forgot to account for. This is also making the system more
>> slower than I expected. At this rate it might take a few days to copy all
>> of the data back. Hopefully once the RAID6 arrays have finished rebuilding,
>> the I/O rate will speed up the syncing. Both arrays are currently at 55%
>> and 47% and we've transferred over 993G of 8.8T of data to the local disks.
>>
>> I will send another update once I'm ready switch the system back over.
>>
>> Thanks-
>>
>> On Mon, Jun 18, 2018 at 3:49 PM, Lance Albertson <la...@osuosl.org>
>> wrote:
>>
>>> I just wanted to send you all an update on where we're at in the process.
>>>
>>> As of right now, ftp-osl is back online and serving it's content from
>>> the the Ceph volume. I've gone ahead and kicked off a few manual syncs to
>>> catch everything up however if you're using us as a master I recommend you
>>> kick off an update job right now. I'm also currently copying the content to
>>> the local disks which I expect to run through tomorrow sometime.
>>>
>>> The rebuild took a little bit longer than originally planned due to some
>>> issues I ran into building the new RAID array. My original plan didn't work
>>> so I had to go with plan B which took a little longer. Plan B resulted in
>>> creating two separate RAID6 arrays which means I lost about 2T in capacity
>>> from my original plan.
>>>
>>> I'm keeping ftp-osl out of the public rotation for now since it's I/O
>>> throughput isn't likely as good as before since it's serving the content
>>> via Ceph.
>>>
>>> I'll send another update tomorrow when I'm ready to switch back over to
>>> local storage. Please let me know if you notice any issues.
>>>
>>> Thanks-
>>>
>>> On Thu, Jun 14, 2018 at 3:52 PM, Lance Albertson <la...@osuosl.org>
>>> wrote:
>>>
>>>> I had a few questions regarding this outages that I wanted to clarify
>>>> for everyone.
>>>>
>>>> 1. There should be no outage during the 5.5 hour outage window for
>>>> anything pointed to ftp.osuosl.org (unless your DNS is directly
>>>> pointing at ftp-osl.osuosl.org)
>>>> 2. During the 18-24hr sync from ceph to local storage, ftp-osl should
>>>> have normal read/write operations. There might be a little bit of I/O
>>>> performance hit during that window but it's hard to tell. There will be a
>>>> short (likely 5 min) outage to read/writes on ftp-osl when I do the final
>>>> switch back to local storage however.
>>>>
>>>> On Thu, Jun 14, 2018 at 10:00 AM, Lance Albertson <la...@osuosl.org>
>>>> wrote:
>>>>
>>>>> Service(s) affected: ftp.osuosl.org
>>>>>
>>>>> During the outage, the master syncing node for our FTP cluster
>>>>> (ftp-osl) will be offline which means any updates to our software mirrors
>>>>> will be delayed.
>>>>>
>>>>> Outage Window:
>>>>> Start: Mon, Jun 18 9:30AM PDT (Mon Jun 18 1630 UTC)
>>>>> End: Mon, Jun 18 3:00PM PDT (Mon Jun 18 2200 UTC)
>>>>>
>>>>> Reason for outage:
>>>>>
>>>>> Our FTP cluster is starting to run low on disk space and we will be
>>>>> adding additional hard drives to the system. Our system currently has
>>>>> 9.375T of disk space and we're planning on upgrading it to 18.75T (this
>>>>> takes into account the RAID6 configuration)
>>>>>
>>>>> Unfortunately, due to the nature of the how the disk arrays are
>>>>> configured, we will not be able to grow the RAID array without a complete
>>>>> rebuild. This means we're going to have to re-copy all 8.8TB of data off of
>>>>> the machine and back onto it. Since this task is rather large and time
>>>>> consuming we've come up with a better alternative so that we don't have our
>>>>> master FTP server offline for very long.
>>>>>
>>>>> We have just recently built a new Ceph cluster for some new storage
>>>>> needs at the OSL and we are going to temporarily use this cluster to serve
>>>>> the ftp-osl content. I've already copied the content onto a new volume and
>>>>> have tested it enough to feel it can handle the load. This should make the
>>>>> transition plan much easier and quicker than initially.This server is
>>>>> already out of DNS rotation and we are planning on keeping it out of
>>>>> rotation until this process is complete to reduce the I/O load.
>>>>>
>>>>> So here's the plan thus far starting on Monday:
>>>>>
>>>>> 1. Stopping all services on the system and doing one final rsync to
>>>>> the Ceph volume
>>>>> 2. Rebooting machine and destroying the current RAID and creating a
>>>>> new one with the new disks
>>>>> 3. Reinstall the OS
>>>>> 4. Bootstrap machine without FTP components initially, setup ceph
>>>>> volume
>>>>> 5. Deploy FTP components after Ceph volume is setup and ready to go
>>>>> 6. Ensure inter FTP node syncing is working using the Ceph volume
>>>>> 7. Sync data from Ceph volume back over to local disks (I'm guessing
>>>>> this will take 18-24 hours)
>>>>> 8. Once sync is complete, shutdown all services and switch the mount
>>>>> point over to the local disks
>>>>> 9. Profit!
>>>>>
>>>>> I would like to thank IBM for donating the hard drives needed for this
>>>>> upgrade.
>>>>>
>>>>> We will plan on doing the storage upgrades on our two other nodes
>>>>> (ftp-nyc & ftp-chi) soon, however we won't be using the Ceph cluster for
>>>>> this since they are remote. The current plan is to take one machine out for
>>>>> several days and sync the data back between the nodes. I will send another
>>>>> outage announcement for those two nodes once we're ready for that. We still
>>>>> need to ship the drives to the locations and work with the local data
>>>>> centers to get them installed.
>>>>>
>>>>> Projects affected: Any project using our FTP cluster as a master
>>>>> syncing point
>>>>>
>>>>
>> --
>> Lance Albertson
>> Director
>> Oregon State University | Open Source Lab
>>
>
>
>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>



-- 
Lance Albertson
Director
Oregon State University | Open Source Lab

Re: [Hosting] ftp-osl storage upgrade (full rebuild required) - Jun 18, 2018 9:30AM PDT (Jun 18 1630 UTC)

Posted by Lance Albertson <la...@osuosl.org>.
The sync has been completed and I will be switching this over to the local
drives at 1:30PM PDT (2030 UTC) today. I'm going to also reboot the machine
so that it's running back on the normal CentOS kernel instead of our custom
mainline kernel we needed for Ceph. This outage should only last for about
10 minutes while the machine reboots.

This does not affect anything pointed at ftp.osuosl.org, only ftp-osl
(which is out of rotation).

Thanks-

On Tue, Jun 19, 2018 at 8:57 AM, Lance Albertson <la...@osuosl.org> wrote:

> It's taking longer than I expected to sync the data back to the local
> disks. This is due to the fact that the system is also rebuilding two RAID6
> arrays which I forgot to account for. This is also making the system more
> slower than I expected. At this rate it might take a few days to copy all
> of the data back. Hopefully once the RAID6 arrays have finished rebuilding,
> the I/O rate will speed up the syncing. Both arrays are currently at 55%
> and 47% and we've transferred over 993G of 8.8T of data to the local disks.
>
> I will send another update once I'm ready switch the system back over.
>
> Thanks-
>
> On Mon, Jun 18, 2018 at 3:49 PM, Lance Albertson <la...@osuosl.org> wrote:
>
>> I just wanted to send you all an update on where we're at in the process.
>>
>> As of right now, ftp-osl is back online and serving it's content from the
>> the Ceph volume. I've gone ahead and kicked off a few manual syncs to catch
>> everything up however if you're using us as a master I recommend you kick
>> off an update job right now. I'm also currently copying the content to the
>> local disks which I expect to run through tomorrow sometime.
>>
>> The rebuild took a little bit longer than originally planned due to some
>> issues I ran into building the new RAID array. My original plan didn't work
>> so I had to go with plan B which took a little longer. Plan B resulted in
>> creating two separate RAID6 arrays which means I lost about 2T in capacity
>> from my original plan.
>>
>> I'm keeping ftp-osl out of the public rotation for now since it's I/O
>> throughput isn't likely as good as before since it's serving the content
>> via Ceph.
>>
>> I'll send another update tomorrow when I'm ready to switch back over to
>> local storage. Please let me know if you notice any issues.
>>
>> Thanks-
>>
>> On Thu, Jun 14, 2018 at 3:52 PM, Lance Albertson <la...@osuosl.org>
>> wrote:
>>
>>> I had a few questions regarding this outages that I wanted to clarify
>>> for everyone.
>>>
>>> 1. There should be no outage during the 5.5 hour outage window for
>>> anything pointed to ftp.osuosl.org (unless your DNS is directly
>>> pointing at ftp-osl.osuosl.org)
>>> 2. During the 18-24hr sync from ceph to local storage, ftp-osl should
>>> have normal read/write operations. There might be a little bit of I/O
>>> performance hit during that window but it's hard to tell. There will be a
>>> short (likely 5 min) outage to read/writes on ftp-osl when I do the final
>>> switch back to local storage however.
>>>
>>> On Thu, Jun 14, 2018 at 10:00 AM, Lance Albertson <la...@osuosl.org>
>>> wrote:
>>>
>>>> Service(s) affected: ftp.osuosl.org
>>>>
>>>> During the outage, the master syncing node for our FTP cluster
>>>> (ftp-osl) will be offline which means any updates to our software mirrors
>>>> will be delayed.
>>>>
>>>> Outage Window:
>>>> Start: Mon, Jun 18 9:30AM PDT (Mon Jun 18 1630 UTC)
>>>> End: Mon, Jun 18 3:00PM PDT (Mon Jun 18 2200 UTC)
>>>>
>>>> Reason for outage:
>>>>
>>>> Our FTP cluster is starting to run low on disk space and we will be
>>>> adding additional hard drives to the system. Our system currently has
>>>> 9.375T of disk space and we're planning on upgrading it to 18.75T (this
>>>> takes into account the RAID6 configuration)
>>>>
>>>> Unfortunately, due to the nature of the how the disk arrays are
>>>> configured, we will not be able to grow the RAID array without a complete
>>>> rebuild. This means we're going to have to re-copy all 8.8TB of data off of
>>>> the machine and back onto it. Since this task is rather large and time
>>>> consuming we've come up with a better alternative so that we don't have our
>>>> master FTP server offline for very long.
>>>>
>>>> We have just recently built a new Ceph cluster for some new storage
>>>> needs at the OSL and we are going to temporarily use this cluster to serve
>>>> the ftp-osl content. I've already copied the content onto a new volume and
>>>> have tested it enough to feel it can handle the load. This should make the
>>>> transition plan much easier and quicker than initially.This server is
>>>> already out of DNS rotation and we are planning on keeping it out of
>>>> rotation until this process is complete to reduce the I/O load.
>>>>
>>>> So here's the plan thus far starting on Monday:
>>>>
>>>> 1. Stopping all services on the system and doing one final rsync to the
>>>> Ceph volume
>>>> 2. Rebooting machine and destroying the current RAID and creating a new
>>>> one with the new disks
>>>> 3. Reinstall the OS
>>>> 4. Bootstrap machine without FTP components initially, setup ceph volume
>>>> 5. Deploy FTP components after Ceph volume is setup and ready to go
>>>> 6. Ensure inter FTP node syncing is working using the Ceph volume
>>>> 7. Sync data from Ceph volume back over to local disks (I'm guessing
>>>> this will take 18-24 hours)
>>>> 8. Once sync is complete, shutdown all services and switch the mount
>>>> point over to the local disks
>>>> 9. Profit!
>>>>
>>>> I would like to thank IBM for donating the hard drives needed for this
>>>> upgrade.
>>>>
>>>> We will plan on doing the storage upgrades on our two other nodes
>>>> (ftp-nyc & ftp-chi) soon, however we won't be using the Ceph cluster for
>>>> this since they are remote. The current plan is to take one machine out for
>>>> several days and sync the data back between the nodes. I will send another
>>>> outage announcement for those two nodes once we're ready for that. We still
>>>> need to ship the drives to the locations and work with the local data
>>>> centers to get them installed.
>>>>
>>>> Projects affected: Any project using our FTP cluster as a master
>>>> syncing point
>>>>
>>>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>



-- 
Lance Albertson
Director
Oregon State University | Open Source Lab

Re: [Hosting] ftp-osl storage upgrade (full rebuild required) - Jun 18, 2018 9:30AM PDT (Jun 18 1630 UTC)

Posted by Lance Albertson <la...@osuosl.org>.
It's taking longer than I expected to sync the data back to the local
disks. This is due to the fact that the system is also rebuilding two RAID6
arrays which I forgot to account for. This is also making the system more
slower than I expected. At this rate it might take a few days to copy all
of the data back. Hopefully once the RAID6 arrays have finished rebuilding,
the I/O rate will speed up the syncing. Both arrays are currently at 55%
and 47% and we've transferred over 993G of 8.8T of data to the local disks.

I will send another update once I'm ready switch the system back over.

Thanks-

On Mon, Jun 18, 2018 at 3:49 PM, Lance Albertson <la...@osuosl.org> wrote:

> I just wanted to send you all an update on where we're at in the process.
>
> As of right now, ftp-osl is back online and serving it's content from the
> the Ceph volume. I've gone ahead and kicked off a few manual syncs to catch
> everything up however if you're using us as a master I recommend you kick
> off an update job right now. I'm also currently copying the content to the
> local disks which I expect to run through tomorrow sometime.
>
> The rebuild took a little bit longer than originally planned due to some
> issues I ran into building the new RAID array. My original plan didn't work
> so I had to go with plan B which took a little longer. Plan B resulted in
> creating two separate RAID6 arrays which means I lost about 2T in capacity
> from my original plan.
>
> I'm keeping ftp-osl out of the public rotation for now since it's I/O
> throughput isn't likely as good as before since it's serving the content
> via Ceph.
>
> I'll send another update tomorrow when I'm ready to switch back over to
> local storage. Please let me know if you notice any issues.
>
> Thanks-
>
> On Thu, Jun 14, 2018 at 3:52 PM, Lance Albertson <la...@osuosl.org> wrote:
>
>> I had a few questions regarding this outages that I wanted to clarify for
>> everyone.
>>
>> 1. There should be no outage during the 5.5 hour outage window for
>> anything pointed to ftp.osuosl.org (unless your DNS is directly pointing
>> at ftp-osl.osuosl.org)
>> 2. During the 18-24hr sync from ceph to local storage, ftp-osl should
>> have normal read/write operations. There might be a little bit of I/O
>> performance hit during that window but it's hard to tell. There will be a
>> short (likely 5 min) outage to read/writes on ftp-osl when I do the final
>> switch back to local storage however.
>>
>> On Thu, Jun 14, 2018 at 10:00 AM, Lance Albertson <la...@osuosl.org>
>> wrote:
>>
>>> Service(s) affected: ftp.osuosl.org
>>>
>>> During the outage, the master syncing node for our FTP cluster (ftp-osl)
>>> will be offline which means any updates to our software mirrors will be
>>> delayed.
>>>
>>> Outage Window:
>>> Start: Mon, Jun 18 9:30AM PDT (Mon Jun 18 1630 UTC)
>>> End: Mon, Jun 18 3:00PM PDT (Mon Jun 18 2200 UTC)
>>>
>>> Reason for outage:
>>>
>>> Our FTP cluster is starting to run low on disk space and we will be
>>> adding additional hard drives to the system. Our system currently has
>>> 9.375T of disk space and we're planning on upgrading it to 18.75T (this
>>> takes into account the RAID6 configuration)
>>>
>>> Unfortunately, due to the nature of the how the disk arrays are
>>> configured, we will not be able to grow the RAID array without a complete
>>> rebuild. This means we're going to have to re-copy all 8.8TB of data off of
>>> the machine and back onto it. Since this task is rather large and time
>>> consuming we've come up with a better alternative so that we don't have our
>>> master FTP server offline for very long.
>>>
>>> We have just recently built a new Ceph cluster for some new storage
>>> needs at the OSL and we are going to temporarily use this cluster to serve
>>> the ftp-osl content. I've already copied the content onto a new volume and
>>> have tested it enough to feel it can handle the load. This should make the
>>> transition plan much easier and quicker than initially.This server is
>>> already out of DNS rotation and we are planning on keeping it out of
>>> rotation until this process is complete to reduce the I/O load.
>>>
>>> So here's the plan thus far starting on Monday:
>>>
>>> 1. Stopping all services on the system and doing one final rsync to the
>>> Ceph volume
>>> 2. Rebooting machine and destroying the current RAID and creating a new
>>> one with the new disks
>>> 3. Reinstall the OS
>>> 4. Bootstrap machine without FTP components initially, setup ceph volume
>>> 5. Deploy FTP components after Ceph volume is setup and ready to go
>>> 6. Ensure inter FTP node syncing is working using the Ceph volume
>>> 7. Sync data from Ceph volume back over to local disks (I'm guessing
>>> this will take 18-24 hours)
>>> 8. Once sync is complete, shutdown all services and switch the mount
>>> point over to the local disks
>>> 9. Profit!
>>>
>>> I would like to thank IBM for donating the hard drives needed for this
>>> upgrade.
>>>
>>> We will plan on doing the storage upgrades on our two other nodes
>>> (ftp-nyc & ftp-chi) soon, however we won't be using the Ceph cluster for
>>> this since they are remote. The current plan is to take one machine out for
>>> several days and sync the data back between the nodes. I will send another
>>> outage announcement for those two nodes once we're ready for that. We still
>>> need to ship the drives to the locations and work with the local data
>>> centers to get them installed.
>>>
>>> Projects affected: Any project using our FTP cluster as a master syncing
>>> point
>>>
>>
-- 
Lance Albertson
Director
Oregon State University | Open Source Lab

Re: [Hosting] ftp-osl storage upgrade (full rebuild required) - Jun 18, 2018 9:30AM PDT (Jun 18 1630 UTC)

Posted by Lance Albertson <la...@osuosl.org>.
I just wanted to send you all an update on where we're at in the process.

As of right now, ftp-osl is back online and serving it's content from the
the Ceph volume. I've gone ahead and kicked off a few manual syncs to catch
everything up however if you're using us as a master I recommend you kick
off an update job right now. I'm also currently copying the content to the
local disks which I expect to run through tomorrow sometime.

The rebuild took a little bit longer than originally planned due to some
issues I ran into building the new RAID array. My original plan didn't work
so I had to go with plan B which took a little longer. Plan B resulted in
creating two separate RAID6 arrays which means I lost about 2T in capacity
from my original plan.

I'm keeping ftp-osl out of the public rotation for now since it's I/O
throughput isn't likely as good as before since it's serving the content
via Ceph.

I'll send another update tomorrow when I'm ready to switch back over to
local storage. Please let me know if you notice any issues.

Thanks-

On Thu, Jun 14, 2018 at 3:52 PM, Lance Albertson <la...@osuosl.org> wrote:

> I had a few questions regarding this outages that I wanted to clarify for
> everyone.
>
> 1. There should be no outage during the 5.5 hour outage window for
> anything pointed to ftp.osuosl.org (unless your DNS is directly pointing
> at ftp-osl.osuosl.org)
> 2. During the 18-24hr sync from ceph to local storage, ftp-osl should have
> normal read/write operations. There might be a little bit of I/O
> performance hit during that window but it's hard to tell. There will be a
> short (likely 5 min) outage to read/writes on ftp-osl when I do the final
> switch back to local storage however.
>
> On Thu, Jun 14, 2018 at 10:00 AM, Lance Albertson <la...@osuosl.org>
> wrote:
>
>> Service(s) affected: ftp.osuosl.org
>>
>> During the outage, the master syncing node for our FTP cluster (ftp-osl)
>> will be offline which means any updates to our software mirrors will be
>> delayed.
>>
>> Outage Window:
>> Start: Mon, Jun 18 9:30AM PDT (Mon Jun 18 1630 UTC)
>> End: Mon, Jun 18 3:00PM PDT (Mon Jun 18 2200 UTC)
>>
>> Reason for outage:
>>
>> Our FTP cluster is starting to run low on disk space and we will be
>> adding additional hard drives to the system. Our system currently has
>> 9.375T of disk space and we're planning on upgrading it to 18.75T (this
>> takes into account the RAID6 configuration)
>>
>> Unfortunately, due to the nature of the how the disk arrays are
>> configured, we will not be able to grow the RAID array without a complete
>> rebuild. This means we're going to have to re-copy all 8.8TB of data off of
>> the machine and back onto it. Since this task is rather large and time
>> consuming we've come up with a better alternative so that we don't have our
>> master FTP server offline for very long.
>>
>> We have just recently built a new Ceph cluster for some new storage needs
>> at the OSL and we are going to temporarily use this cluster to serve the
>> ftp-osl content. I've already copied the content onto a new volume and have
>> tested it enough to feel it can handle the load. This should make the
>> transition plan much easier and quicker than initially.This server is
>> already out of DNS rotation and we are planning on keeping it out of
>> rotation until this process is complete to reduce the I/O load.
>>
>> So here's the plan thus far starting on Monday:
>>
>> 1. Stopping all services on the system and doing one final rsync to the
>> Ceph volume
>> 2. Rebooting machine and destroying the current RAID and creating a new
>> one with the new disks
>> 3. Reinstall the OS
>> 4. Bootstrap machine without FTP components initially, setup ceph volume
>> 5. Deploy FTP components after Ceph volume is setup and ready to go
>> 6. Ensure inter FTP node syncing is working using the Ceph volume
>> 7. Sync data from Ceph volume back over to local disks (I'm guessing this
>> will take 18-24 hours)
>> 8. Once sync is complete, shutdown all services and switch the mount
>> point over to the local disks
>> 9. Profit!
>>
>> I would like to thank IBM for donating the hard drives needed for this
>> upgrade.
>>
>> We will plan on doing the storage upgrades on our two other nodes
>> (ftp-nyc & ftp-chi) soon, however we won't be using the Ceph cluster for
>> this since they are remote. The current plan is to take one machine out for
>> several days and sync the data back between the nodes. I will send another
>> outage announcement for those two nodes once we're ready for that. We still
>> need to ship the drives to the locations and work with the local data
>> centers to get them installed.
>>
>> Projects affected: Any project using our FTP cluster as a master syncing
>> point
>>
>> --
>> Lance Albertson
>> Director
>> Oregon State University | Open Source Lab
>>
>
>
>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>



-- 
Lance Albertson
Director
Oregon State University | Open Source Lab

Re: [Hosting] ftp-osl storage upgrade (full rebuild required) - Jun 18, 2018 9:30AM PDT (Jun 18 1630 UTC)

Posted by Lance Albertson <la...@osuosl.org>.
I had a few questions regarding this outages that I wanted to clarify for
everyone.

1. There should be no outage during the 5.5 hour outage window for anything
pointed to ftp.osuosl.org (unless your DNS is directly pointing at
ftp-osl.osuosl.org)
2. During the 18-24hr sync from ceph to local storage, ftp-osl should have
normal read/write operations. There might be a little bit of I/O
performance hit during that window but it's hard to tell. There will be a
short (likely 5 min) outage to read/writes on ftp-osl when I do the final
switch back to local storage however.

On Thu, Jun 14, 2018 at 10:00 AM, Lance Albertson <la...@osuosl.org> wrote:

> Service(s) affected: ftp.osuosl.org
>
> During the outage, the master syncing node for our FTP cluster (ftp-osl)
> will be offline which means any updates to our software mirrors will be
> delayed.
>
> Outage Window:
> Start: Mon, Jun 18 9:30AM PDT (Mon Jun 18 1630 UTC)
> End: Mon, Jun 18 3:00PM PDT (Mon Jun 18 2200 UTC)
>
> Reason for outage:
>
> Our FTP cluster is starting to run low on disk space and we will be adding
> additional hard drives to the system. Our system currently has 9.375T of
> disk space and we're planning on upgrading it to 18.75T (this takes into
> account the RAID6 configuration)
>
> Unfortunately, due to the nature of the how the disk arrays are
> configured, we will not be able to grow the RAID array without a complete
> rebuild. This means we're going to have to re-copy all 8.8TB of data off of
> the machine and back onto it. Since this task is rather large and time
> consuming we've come up with a better alternative so that we don't have our
> master FTP server offline for very long.
>
> We have just recently built a new Ceph cluster for some new storage needs
> at the OSL and we are going to temporarily use this cluster to serve the
> ftp-osl content. I've already copied the content onto a new volume and have
> tested it enough to feel it can handle the load. This should make the
> transition plan much easier and quicker than initially.This server is
> already out of DNS rotation and we are planning on keeping it out of
> rotation until this process is complete to reduce the I/O load.
>
> So here's the plan thus far starting on Monday:
>
> 1. Stopping all services on the system and doing one final rsync to the
> Ceph volume
> 2. Rebooting machine and destroying the current RAID and creating a new
> one with the new disks
> 3. Reinstall the OS
> 4. Bootstrap machine without FTP components initially, setup ceph volume
> 5. Deploy FTP components after Ceph volume is setup and ready to go
> 6. Ensure inter FTP node syncing is working using the Ceph volume
> 7. Sync data from Ceph volume back over to local disks (I'm guessing this
> will take 18-24 hours)
> 8. Once sync is complete, shutdown all services and switch the mount point
> over to the local disks
> 9. Profit!
>
> I would like to thank IBM for donating the hard drives needed for this
> upgrade.
>
> We will plan on doing the storage upgrades on our two other nodes (ftp-nyc
> & ftp-chi) soon, however we won't be using the Ceph cluster for this since
> they are remote. The current plan is to take one machine out for several
> days and sync the data back between the nodes. I will send another outage
> announcement for those two nodes once we're ready for that. We still need
> to ship the drives to the locations and work with the local data centers to
> get them installed.
>
> Projects affected: Any project using our FTP cluster as a master syncing
> point
>
> --
> Lance Albertson
> Director
> Oregon State University | Open Source Lab
>



-- 
Lance Albertson
Director
Oregon State University | Open Source Lab