You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@arrow.apache.org by Wes McKinney <we...@gmail.com> on 2020/03/11 22:16:31 UTC

[DISCUSS] Leveraging cloud computing resources for Arrow test workloads

hi folks,

There has periodically been a discussion about employing dedicated
compute resources to serve our testing needs beyond what can be
accomplished in free / public CI services like GitHub Actions,
Appveyor, etc. For example:

* Workloads requiring a CUDA-capable GPU
* Tests requiring a lot of memory
* ARM architecture

While physical machines can be hooked up to some CI/CD services like
Github Actions and Buildkite, I believe we should not be 100%
dependent on the availability of such hardware (the recent tornado in
Nashville is a good example of what can go wrong).

At some point it will make sense to be able to provision cloud hosts
(either temporary spot instances or persistent nodes) to meet these
needs. This brings up several questions:

* Who's going to pay for it? Perhaps Amazon, Google, or Microsoft can
donate cloud compute credits to the project
* What kind of devops tooling would be appropriate to provision and
manage the instances, scaling up and down based on need?
* What CI/CD platform would be appropriate to dispatch work to the
cloud nodes (taking into consideration the high costs of sysadmin, and
seeking to minimize nodes sitting unused)?

This will probably take time to work out and there is significant
engineering involved in achieving any solution, but it would be good
to have all the options on the table with a frank analysis of the
pros/cons and costs (both in money and volunteer time) involved.

Thanks,
Wes

Re: [DISCUSS] Leveraging cloud computing resources for Arrow test workloads

Posted by Micah Kornfield <em...@gmail.com>.

OK, I'll try to do a little bit more investigation to see if I can get some
basic integration setup (probably won't have bandwidth for at least two
weeks).

On Sun, Mar 15, 2020 at 3:41 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 15/03/2020 à 04:57, Wes McKinney a écrit :
> > On Sat, Mar 14, 2020, 10:52 PM Micah Kornfield <em...@gmail.com>
> > wrote:
> >
> >> Hi Antoine,
> >> Could you clarify what you mean by:
> >>
> >>> Given our current resource utilization on Github Actions, it seems that
> >>> even a non-auto-scaling setup could be useful.
> >>
> >>
> >> I could interpret it in a couple of ways ...
> >>
> >
> > I think he means that we would not have difficulty keeping some
> persistent
> > nodes fully (or at least > 50%) utilized during regular working hours.
>
> Right.  And we have a non-trivial number of "nightly" jobs (depending on
> where you are on Earth) as well :-)
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] Leveraging cloud computing resources for Arrow test workloads

Posted by Antoine Pitrou <an...@python.org>.

Le 15/03/2020 à 04:57, Wes McKinney a écrit :
> On Sat, Mar 14, 2020, 10:52 PM Micah Kornfield <em...@gmail.com>
> wrote:
> 
>> Hi Antoine,
>> Could you clarify what you mean by:
>>
>>> Given our current resource utilization on Github Actions, it seems that
>>> even a non-auto-scaling setup could be useful.
>>
>>
>> I could interpret it in a couple of ways ...
>>
> 
> I think he means that we would not have difficulty keeping some persistent
> nodes fully (or at least > 50%) utilized during regular working hours.

Right.  And we have a non-trivial number of "nightly" jobs (depending on
where you are on Earth) as well :-)

Regards

Antoine.

Re: [DISCUSS] Leveraging cloud computing resources for Arrow test workloads

Posted by Wes McKinney <we...@gmail.com>.

On Sat, Mar 14, 2020, 10:52 PM Micah Kornfield <em...@gmail.com>
wrote:

> Hi Antoine,
> Could you clarify what you mean by:
>
> > Given our current resource utilization on Github Actions, it seems that
> > even a non-auto-scaling setup could be useful.
>
>
> I could interpret it in a couple of ways ...
>

I think he means that we would not have difficulty keeping some persistent
nodes fully (or at least > 50%) utilized during regular working hours.


> Thanks,
> Micah
>
> On Fri, Mar 13, 2020 at 7:36 AM Antoine Pitrou <an...@python.org> wrote:
>
> >
> > Le 13/03/2020 à 01:45, Brian Hulette a écrit :
> > > * What kind of devops tooling would be appropriate to provision and
> > > manage the instances, scaling up and down based on need?
> > > * What CI/CD platform would be appropriate to dispatch work to the
> > > cloud nodes (taking into consideration the high costs of sysadmin, and
> > > seeking to minimize nodes sitting unused)?
> > >
> > > I looked into solutions for running CI/CD workers on GCP a (very)
> little
> > > bit and just wanted to shared some findings.
> > > Appveyor claims it can auto-scale GCE instances [1] but I don't think
> it
> > > would go beyond 5 concurrent "self-hosted" jobs [2]. Would that be a
> > > problem?
> > > BuildKite has documentation about running agents on a scalable GKE
> > cluster
> > > [3], but unfortunately no way to auto-scale based on the backlog. We
> > could
> > > maybe roll our own/contribute something based on their AWS scaler [4].
> >
> > Given our current resource utilization on Github Actions, it seems that
> > even a non-auto-scaling setup could be useful.
> >
> > Regards
> >
> > Antoine.
> >
>

Re: [DISCUSS] Leveraging cloud computing resources for Arrow test workloads

Posted by Micah Kornfield <em...@gmail.com>.

Hi Antoine,
Could you clarify what you mean by:

> Given our current resource utilization on Github Actions, it seems that
> even a non-auto-scaling setup could be useful.


I could interpret it in a couple of ways ...

Thanks,
Micah

On Fri, Mar 13, 2020 at 7:36 AM Antoine Pitrou <an...@python.org> wrote:

>
> Le 13/03/2020 à 01:45, Brian Hulette a écrit :
> > * What kind of devops tooling would be appropriate to provision and
> > manage the instances, scaling up and down based on need?
> > * What CI/CD platform would be appropriate to dispatch work to the
> > cloud nodes (taking into consideration the high costs of sysadmin, and
> > seeking to minimize nodes sitting unused)?
> >
> > I looked into solutions for running CI/CD workers on GCP a (very) little
> > bit and just wanted to shared some findings.
> > Appveyor claims it can auto-scale GCE instances [1] but I don't think it
> > would go beyond 5 concurrent "self-hosted" jobs [2]. Would that be a
> > problem?
> > BuildKite has documentation about running agents on a scalable GKE
> cluster
> > [3], but unfortunately no way to auto-scale based on the backlog. We
> could
> > maybe roll our own/contribute something based on their AWS scaler [4].
>
> Given our current resource utilization on Github Actions, it seems that
> even a non-auto-scaling setup could be useful.
>
> Regards
>
> Antoine.
>

Re: [DISCUSS] Leveraging cloud computing resources for Arrow test workloads

Posted by Antoine Pitrou <an...@python.org>.

Le 13/03/2020 à 01:45, Brian Hulette a écrit :
> * What kind of devops tooling would be appropriate to provision and
> manage the instances, scaling up and down based on need?
> * What CI/CD platform would be appropriate to dispatch work to the
> cloud nodes (taking into consideration the high costs of sysadmin, and
> seeking to minimize nodes sitting unused)?
> 
> I looked into solutions for running CI/CD workers on GCP a (very) little
> bit and just wanted to shared some findings.
> Appveyor claims it can auto-scale GCE instances [1] but I don't think it
> would go beyond 5 concurrent "self-hosted" jobs [2]. Would that be a
> problem?
> BuildKite has documentation about running agents on a scalable GKE cluster
> [3], but unfortunately no way to auto-scale based on the backlog. We could
> maybe roll our own/contribute something based on their AWS scaler [4].

Given our current resource utilization on Github Actions, it seems that
even a non-auto-scaling setup could be useful.

Regards

Antoine.

Re: [DISCUSS] Leveraging cloud computing resources for Arrow test workloads

Posted by Brian Hulette <hu...@gmail.com>.

* What kind of devops tooling would be appropriate to provision and
manage the instances, scaling up and down based on need?
* What CI/CD platform would be appropriate to dispatch work to the
cloud nodes (taking into consideration the high costs of sysadmin, and
seeking to minimize nodes sitting unused)?

I looked into solutions for running CI/CD workers on GCP a (very) little
bit and just wanted to shared some findings.
Appveyor claims it can auto-scale GCE instances [1] but I don't think it
would go beyond 5 concurrent "self-hosted" jobs [2]. Would that be a
problem?
BuildKite has documentation about running agents on a scalable GKE cluster
[3], but unfortunately no way to auto-scale based on the backlog. We could
maybe roll our own/contribute something based on their AWS scaler [4].

[1] https://www.appveyor.com/docs/byoc/gce/
[2] https://www.appveyor.com/pricing/
[3]
https://buildkite.com/docs/agent/v3/gcloud#running-the-agent-on-google-kubernetes-engine
[4] https://github.com/buildkite/buildkite-agent-scaler

On Wed, Mar 11, 2020 at 7:49 PM Micah Kornfield <em...@gmail.com>
wrote:

> >
> > * Who's going to pay for it? Perhaps Amazon, Google, or Microsoft can
> > donate cloud compute credits to the project
>
> Google has offered a donation of GCP credits based on some estimates I made
> last year when we were facing Travis CI issues. I'm happy to try to do some
> integration work to help make this happen.
>
> For the other questions, I'm happy to do some research, but also happy if
> someone else would like to take up the work here.  I think one blocker in
> the past has been restrictions from Apache Infra, is there any
> documentation on what is and is not supported on that front?
>
> Thanks,
> Micah
> On Wed, Mar 11, 2020 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi folks,
> >
> > There has periodically been a discussion about employing dedicated
> > compute resources to serve our testing needs beyond what can be
> > accomplished in free / public CI services like GitHub Actions,
> > Appveyor, etc. For example:
> >
> > * Workloads requiring a CUDA-capable GPU
> > * Tests requiring a lot of memory
> > * ARM architecture
> >
> > While physical machines can be hooked up to some CI/CD services like
> > Github Actions and Buildkite, I believe we should not be 100%
> > dependent on the availability of such hardware (the recent tornado in
> > Nashville is a good example of what can go wrong).
> >
> > At some point it will make sense to be able to provision cloud hosts
> > (either temporary spot instances or persistent nodes) to meet these
> > needs. This brings up several questions:
> >
> > * Who's going to pay for it? Perhaps Amazon, Google, or Microsoft can
> > donate cloud compute credits to the project
> > * What kind of devops tooling would be appropriate to provision and
> > manage the instances, scaling up and down based on need?
> > * What CI/CD platform would be appropriate to dispatch work to the
> > cloud nodes (taking into consideration the high costs of sysadmin, and
> > seeking to minimize nodes sitting unused)?
> >
> > This will probably take time to work out and there is significant
> > engineering involved in achieving any solution, but it would be good
> > to have all the options on the table with a frank analysis of the
> > pros/cons and costs (both in money and volunteer time) involved.
> >
> > Thanks,
> > Wes
> >
>

Re: [DISCUSS] Leveraging cloud computing resources for Arrow test workloads

Posted by Micah Kornfield <em...@gmail.com>.

>
> * Who's going to pay for it? Perhaps Amazon, Google, or Microsoft can
> donate cloud compute credits to the project

Google has offered a donation of GCP credits based on some estimates I made
last year when we were facing Travis CI issues. I'm happy to try to do some
integration work to help make this happen.

For the other questions, I'm happy to do some research, but also happy if
someone else would like to take up the work here.  I think one blocker in
the past has been restrictions from Apache Infra, is there any
documentation on what is and is not supported on that front?

Thanks,
Micah
On Wed, Mar 11, 2020 at 3:17 PM Wes McKinney <we...@gmail.com> wrote:

> hi folks,
>
> There has periodically been a discussion about employing dedicated
> compute resources to serve our testing needs beyond what can be
> accomplished in free / public CI services like GitHub Actions,
> Appveyor, etc. For example:
>
> * Workloads requiring a CUDA-capable GPU
> * Tests requiring a lot of memory
> * ARM architecture
>
> While physical machines can be hooked up to some CI/CD services like
> Github Actions and Buildkite, I believe we should not be 100%
> dependent on the availability of such hardware (the recent tornado in
> Nashville is a good example of what can go wrong).
>
> At some point it will make sense to be able to provision cloud hosts
> (either temporary spot instances or persistent nodes) to meet these
> needs. This brings up several questions:
>
> * Who's going to pay for it? Perhaps Amazon, Google, or Microsoft can
> donate cloud compute credits to the project
> * What kind of devops tooling would be appropriate to provision and
> manage the instances, scaling up and down based on need?
> * What CI/CD platform would be appropriate to dispatch work to the
> cloud nodes (taking into consideration the high costs of sysadmin, and
> seeking to minimize nodes sitting unused)?
>
> This will probably take time to work out and there is significant
> engineering involved in achieving any solution, but it would be good
> to have all the options on the table with a frank analysis of the
> pros/cons and costs (both in money and volunteer time) involved.
>
> Thanks,
> Wes
>