You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@nuttx.apache.org by Nathan Hartman <ha...@gmail.com> on 2021/03/30 18:33:47 UTC

All PRs stuck in "Queued -- Waiting to run this check..."

Hi,

Does anyone know why the GitHub PR prechecks don't seem to be running?

In particular, it seems that these four:

* Build / Fetch-Source (pull_request)
* Build Documentation / build-html (pull_request)
* Check / check (pull_request)
* Lint / YAML (pull_request)

are all stuck in "Queued -- Waiting to run this check..."

for all PRs.

Thanks,
Nathan

Re: All PRs stuck in "Queued -- Waiting to run this check..."

Posted by "Matias N." <ma...@imap.cc>.

We're already using ccache, but it is used intra run (ie, for arm-01, sim, etc). We tried persisting it across runs (from different PRs) but it seems there was issues with that.

What I suggest means we have one cache pero board/config. The scripting shouldn't be too difficult but it requires much more storage. Not sure if we would be able to store it all.

Re: All PRs stuck in "Queued -- Waiting to run this check..."

Posted by Nathan Hartman <ha...@gmail.com>.

On Wed, Mar 31, 2021 at 1:13 PM Matias N. <ma...@imap.cc> wrote:
> My reasoning with the per board+config ccache is that it should more or less detect
> something like this without hardcoding any rules: only files that are impacted by any
> changes in the current PR would be rebuilt, any other file would be instantanously
> obtained from cache. This is mostly what I observe when I use ccache locally: going
> to make menuconfig indeed rebuilds everything, but it is quite faster than an initial
> build with 100% miss rate.

Are there any customizations needed in the NuttX build system to use ccache?

Nathan

Re: All PRs stuck in "Queued -- Waiting to run this check..."

Posted by "Matias N." <ma...@imap.cc>.

On Wed, Mar 31, 2021, at 11:26, Nathan Hartman wrote:
> I wonder if we need the tests to be a bit smarter in restricting what they
> test.
> 
> For example if the change only affects files in arch/arm/stm32h7 then only
> stm32h7 configs will be tested.

That can be difficult to do. A header might define something that ultimately reaches
hardware-specific code and make it fail.

My reasoning with the per board+config ccache is that it should more or less detect
something like this without hardcoding any rules: only files that are impacted by any
changes in the current PR would be rebuilt, any other file would be instantanously 
obtained from cache. This is mostly what I observe when I use ccache locally: going
to make menuconfig indeed rebuilds everything, but it is quite faster than an initial
build with 100% miss rate.

We should also eventually check if we're not adding some cache-breaking definition
(encode the date of the build in nuttx/config.h for example).

> 
> Also if the PR only affects Documentation then there's no need to build any
> board configs.

That's already being done. Any chances contained under Documentation/ only
rebuild docs.

> 
> Changes in areas like sched will affect all configs so the tests cannot be
> restricted. But it seems to me that most PRs are in arch, so even if some
> PRs still have to run the full set of tests, it will still be a big
> optimization.
> 
> Nathan
>

Re: All PRs stuck in "Queued -- Waiting to run this check..."

Posted by Nathan Hartman <ha...@gmail.com>.

On Wed, Mar 31, 2021 at 9:05 AM Matias N. <ma...@imap.cc> wrote:

> Then situation is a bit better now, since we did not continue to submit as
> many Para. However, the builds are still lagging (there are ones from 20hr
> ago running). I see many of the new "cancelling duplicates" jobs queued but
> they have not yet run.
> I have cancelled a few myself. In case I cancelled a running one by
> mistake, please apologize. I would suggest we give the system a while to
> catch up.
>
> In the meantime, if you are force pushing a PR, please rebase first.
>
> Also, I agree about simplifying macOS build if it will help while we get
> to q better situation.

I wonder if we need the tests to be a bit smarter in restricting what they
test.

For example if the change only affects files in arch/arm/stm32h7 then only
stm32h7 configs will be tested.

Also if the PR only affects Documentation then there's no need to build any
board configs.

Changes in areas like sched will affect all configs so the tests cannot be
restricted. But it seems to me that most PRs are in arch, so even if some
PRs still have to run the full set of tests, it will still be a big
optimization.

Nathan

Re: All PRs stuck in "Queued -- Waiting to run this check..."

Posted by "Matias N." <ma...@imap.cc>.

On Wed, Mar 31, 2021, at 09:33, Abdelatif Guettouche wrote:
> > Also, I agree about simplifying macOS build if it will help while we get to q better situation.
> 
> We can have a special test list for macOS that includes just a few
> configs from the simulator and other chips.

I think just dropping one of the three runs we can have a large gain. The "other" list of macOS jobs takes about 143m, while the other two ("sim" and "arm-12") take about 84m. For reference, all linux jobs take about 30-50m. Linux's "other" is also longer (60m). Maybe we can try to fine tune that one as well (find faster toolchains, remove some configs, etc).

Best,
Matias

Re: All PRs stuck in "Queued -- Waiting to run this check..."

Posted by Abdelatif Guettouche <ab...@gmail.com>.

> Also, I agree about simplifying macOS build if it will help while we get to q better situation.

We can have a special test list for macOS that includes just a few
configs from the simulator and other chips.

> There has been some talk about supporting non hosted runners, but there are
some security issues that still need to be worked out.

> The problem is that rolling our own can be quite a pain to maintain IMHO
(unless someone has access to some powerful high-availabilty spare machine).
We would also have to redo all CI handling since it wouldn't be GitHub's.

How about runners hosted and maintained by another organisation?  Is
this something that we can have?
This could include runners with "targets" where we can do functional
tests in hardware instead of the simulator or QEMU.

On Wed, Mar 31, 2021 at 3:05 PM Matias N. <ma...@imap.cc> wrote:
>
> Then situation is a bit better now, since we did not continue to submit as many Para. However, the builds are still lagging (there are ones from 20hr ago running). I see many of the new "cancelling duplicates" jobs queued but they have not yet run.
> I have cancelled a few myself. In case I cancelled a running one by mistake, please apologize. I would suggest we give the system a while to catch up.
>
> In the meantime, if you are force pushing a PR, please rebase first.
>
> Also, I agree about simplifying macOS build if it will help while we get to q better situation.
>
> Best,
> Matias

All PRs stuck in "Queued -- Waiting to run this check..."

Posted by "Matias N." <ma...@imap.cc>.

Then situation is a bit better now, since we did not continue to submit as many Para. However, the builds are still lagging (there are ones from 20hr ago running). I see many of the new "cancelling duplicates" jobs queued but they have not yet run.
I have cancelled a few myself. In case I cancelled a running one by mistake, please apologize. I would suggest we give the system a while to catch up.

In the meantime, if you are force pushing a PR, please rebase first.

Also, I agree about simplifying macOS build if it will help while we get to q better situation.

Best,
Matias

Re: All PRs stuck in "Queued -- Waiting to run this check..."

Posted by Abdelatif Guettouche <ab...@gmail.com>.

> Anyway, I see the "other" includes a very long list of very different platforms. Maybe splitting it into avr, risc-v, xtensa could help?

When Xtensa was merged with "others" it had only 3 configs: nsh,
ostest and smp.  Now it contains around 30, and surely more to come..
I agree that we should reconsider this and split the "others" group.

On Wed, Mar 31, 2021 at 4:16 PM Matias N. <ma...@imap.cc> wrote:
>
> On Tue, Mar 30, 2021, at 21:44, Brennan Ashton wrote:
> > We were sharing cache but ran into some strange issues with collisions and
> > disabled it unfortunately.
>
> Do you remember the exact issue?
>
> What if we had one ccache cache per <board>:<config> run shared across runs? (not sure if that would be too much storage) So it would be like if you are locally running under the same conditions and just rebuild. If the specific board/config combination is not affected by the current PR (no relevant config is changed nor the changed code is part of that build) it should be fairly fast.
>
> Anyway, I see the "other" includes a very long list of very different platforms. Maybe splitting it into avr, risc-v, xtensa could help?
>
> Best,
> Matias

Re: All PRs stuck in "Queued -- Waiting to run this check..."

Posted by "Matias N." <ma...@imap.cc>.

On Tue, Mar 30, 2021, at 21:44, Brennan Ashton wrote:
> We were sharing cache but ran into some strange issues with collisions and
> disabled it unfortunately.

Do you remember the exact issue?

What if we had one ccache cache per <board>:<config> run shared across runs? (not sure if that would be too much storage) So it would be like if you are locally running under the same conditions and just rebuild. If the specific board/config combination is not affected by the current PR (no relevant config is changed nor the changed code is part of that build) it should be fairly fast.

Anyway, I see the "other" includes a very long list of very different platforms. Maybe splitting it into avr, risc-v, xtensa could help?

Best,
Matias

Re: All PRs stuck in "Queued -- Waiting to run this check..."

Posted by Brennan Ashton <ba...@brennanashton.com>.

Part of the issue here is also limits across the organization. This has
been discussed for a couple months now on the Apache build mailing lists
and the GitHub has been part of them trying to figure out a smart path
forward.

We were sharing cache but ran into some strange issues with collisions and
disabled it unfortunately.

Personally I think we should limit the macOS runs to be configurations that
specifically test the build system which would be a couple of each arch
supported and include things like cpp builds.

I see Matias already started working on the ticket I opened for killing
previously running builds.

There has been some talk about supporting non hosted runners, but there are
some security issues that still need to be worked out. Once again see the
mailing list for more context on this .

--Brennan

On Tue, Mar 30, 2021, 1:38 PM Matias N. <ma...@imap.cc> wrote:

> Most likely a single very powerful machine could be actually quite faster
> than GH
> since we could parallelize much harder and have all resources to ourselves.
> The problem is that rolling our own can be quite a pain to maintain IMHO
> (unless someone has access to some powerful high-availabilty spare
> machine).
> We would also have to redo all CI handling since it wouldn't be GitHub's.
>
> I also looked at alternative CI systems but I think we will consume free
> credits easily.
>
> Indeed the macOS build is really slow. But if we start to cut on tests we
> loose
> the assurance given by the automated testing.
>
> Maybe we could try to share the ccache across runs. I personally have
> never found
> a ccache collision nor any issue (I have ccache enabled for years). I
> remember we
> tried this but I'm not sure if the result was inconclusive.
>
> Anyway, cancelling previous flows should get us back to a better place
> where we "only"
> wait for ~1.5hr for the build to complete.
>
> Best,
> Matias
>
> On Tue, Mar 30, 2021, at 17:17, Alan Carvalho de Assis wrote:
> > We definitely need better server to support the CI, it doesn't have
> > processing power enough to run the CI when there are more than 5 PRs.
> > It doesn't scale well.
> >
> > Also I think we could keep only one test for MacOS because it is too
> > slow! Normally MacOS delays more than 2h to complete.
> >
> > Maybe we could create some distributed CI farm and we could include
> > low power hardware (i.e. Raspberry Pi boards) running from our homes
> > to help it, hehehe.
> >
> > Suggestions are welcome!
> >
> > BR,
> >
> > Alan
> >
> > On 3/30/21, Nathan Hartman <hartman.nathan@gmail.com <mailto:
> hartman.nathan%40gmail.com>> wrote:
> > > On Tue, Mar 30, 2021 at 3:30 PM Matias N. <matias@imap.cc <mailto:
> matias%40imap.cc>> wrote:
> > >>
> > >> It appears we overwhelmed CI. There are a couple of running jobs
> (notably
> > >> one is a macOS run which is taking about 2hrs as of now) but they are
> for
> > >> PRs from 12hs ago at least. There are a multitude of queued runs for
> many
> > >> recent PRs. The problem is that new runs (from force pushes) do not
> cancel
> > >> previous runs so they remain queued apparently.
> > >
> > > Ouch!
> > >
> > >> I will see what can be done to have new pushes cancel new pending
> runs. In
> > >> the meantime we may have to manually cancel all queued workflows. Not
> sure
> > >> if there's a mass cancel to be done.
> > >
> > > Thanks for looking into it. Yes, it would be a good thing if new force
> > > pushes could cancel in-progress runs.
> > >
> > > Thanks,
> > > Nathan
> > >
> >
>

Re: All PRs stuck in "Queued -- Waiting to run this check..."

Posted by "Matias N." <ma...@imap.cc>.

Most likely a single very powerful machine could be actually quite faster than GH
since we could parallelize much harder and have all resources to ourselves.
The problem is that rolling our own can be quite a pain to maintain IMHO
(unless someone has access to some powerful high-availabilty spare machine).
We would also have to redo all CI handling since it wouldn't be GitHub's.

I also looked at alternative CI systems but I think we will consume free credits easily.

Indeed the macOS build is really slow. But if we start to cut on tests we loose
the assurance given by the automated testing. 

Maybe we could try to share the ccache across runs. I personally have never found
a ccache collision nor any issue (I have ccache enabled for years). I remember we
tried this but I'm not sure if the result was inconclusive. 

Anyway, cancelling previous flows should get us back to a better place where we "only"
wait for ~1.5hr for the build to complete.

Best,
Matias

On Tue, Mar 30, 2021, at 17:17, Alan Carvalho de Assis wrote:
> We definitely need better server to support the CI, it doesn't have
> processing power enough to run the CI when there are more than 5 PRs.
> It doesn't scale well.
> 
> Also I think we could keep only one test for MacOS because it is too
> slow! Normally MacOS delays more than 2h to complete.
> 
> Maybe we could create some distributed CI farm and we could include
> low power hardware (i.e. Raspberry Pi boards) running from our homes
> to help it, hehehe.
> 
> Suggestions are welcome!
> 
> BR,
> 
> Alan
> 
> On 3/30/21, Nathan Hartman <hartman.nathan@gmail.com <mailto:hartman.nathan%40gmail.com>> wrote:
> > On Tue, Mar 30, 2021 at 3:30 PM Matias N. <matias@imap.cc <mailto:matias%40imap.cc>> wrote:
> >>
> >> It appears we overwhelmed CI. There are a couple of running jobs (notably
> >> one is a macOS run which is taking about 2hrs as of now) but they are for
> >> PRs from 12hs ago at least. There are a multitude of queued runs for many
> >> recent PRs. The problem is that new runs (from force pushes) do not cancel
> >> previous runs so they remain queued apparently.
> >
> > Ouch!
> >
> >> I will see what can be done to have new pushes cancel new pending runs. In
> >> the meantime we may have to manually cancel all queued workflows. Not sure
> >> if there's a mass cancel to be done.
> >
> > Thanks for looking into it. Yes, it would be a good thing if new force
> > pushes could cancel in-progress runs.
> >
> > Thanks,
> > Nathan
> >
>

Re: All PRs stuck in "Queued -- Waiting to run this check..."

Posted by Alan Carvalho de Assis <ac...@gmail.com>.

We definitely need better server to support the CI, it doesn't have
processing power enough to run the CI when there are more than 5 PRs.
It doesn't scale well.

Also I think we could keep only one test for MacOS because it is too
slow! Normally MacOS delays more than 2h to complete.

Maybe we could create some distributed CI farm and we could include
low power hardware (i.e. Raspberry Pi boards) running from our homes
to help it, hehehe.

Suggestions are welcome!

BR,

Alan

On 3/30/21, Nathan Hartman <ha...@gmail.com> wrote:
> On Tue, Mar 30, 2021 at 3:30 PM Matias N. <ma...@imap.cc> wrote:
>>
>> It appears we overwhelmed CI. There are a couple of running jobs (notably
>> one is a macOS run which is taking about 2hrs as of now) but they are for
>> PRs from 12hs ago at least. There are a multitude of queued runs for many
>> recent PRs. The problem is that new runs (from force pushes) do not cancel
>> previous runs so they remain queued apparently.
>
> Ouch!
>
>> I will see what can be done to have new pushes cancel new pending runs. In
>> the meantime we may have to manually cancel all queued workflows. Not sure
>> if there's a mass cancel to be done.
>
> Thanks for looking into it. Yes, it would be a good thing if new force
> pushes could cancel in-progress runs.
>
> Thanks,
> Nathan
>

Re: All PRs stuck in "Queued -- Waiting to run this check..."

Posted by Abdelatif Guettouche <ab...@gmail.com>.

Looks like Apache's runners are having issues.  Other projects using
Github Actions have stuck queues as well.

On Tue, Mar 30, 2021 at 10:04 PM Nathan Hartman
<ha...@gmail.com> wrote:
>
> On Tue, Mar 30, 2021 at 3:30 PM Matias N. <ma...@imap.cc> wrote:
> >
> > It appears we overwhelmed CI. There are a couple of running jobs (notably one is a macOS run which is taking about 2hrs as of now) but they are for PRs from 12hs ago at least. There are a multitude of queued runs for many recent PRs. The problem is that new runs (from force pushes) do not cancel previous runs so they remain queued apparently.
>
> Ouch!
>
> > I will see what can be done to have new pushes cancel new pending runs. In the meantime we may have to manually cancel all queued workflows. Not sure if there's a mass cancel to be done.
>
> Thanks for looking into it. Yes, it would be a good thing if new force
> pushes could cancel in-progress runs.
>
> Thanks,
> Nathan

Re: All PRs stuck in "Queued -- Waiting to run this check..."

Posted by Nathan Hartman <ha...@gmail.com>.

On Tue, Mar 30, 2021 at 3:30 PM Matias N. <ma...@imap.cc> wrote:
>
> It appears we overwhelmed CI. There are a couple of running jobs (notably one is a macOS run which is taking about 2hrs as of now) but they are for PRs from 12hs ago at least. There are a multitude of queued runs for many recent PRs. The problem is that new runs (from force pushes) do not cancel previous runs so they remain queued apparently.

Ouch!

> I will see what can be done to have new pushes cancel new pending runs. In the meantime we may have to manually cancel all queued workflows. Not sure if there's a mass cancel to be done.

Thanks for looking into it. Yes, it would be a good thing if new force
pushes could cancel in-progress runs.

Thanks,
Nathan

Re: All PRs stuck in "Queued -- Waiting to run this check..."

Posted by "Matias N." <ma...@imap.cc>.

It appears we overwhelmed CI. There are a couple of running jobs (notably one is a macOS run which is taking about 2hrs as of now) but they are for PRs from 12hs ago at least. There are a multitude of queued runs for many recent PRs. The problem is that new runs (from force pushes) do not cancel previous runs so they remain queued apparently.

I will see what can be done to have new pushes cancel new pending runs. In the meantime we may have to manually cancel all queued workflows. Not sure if there's a mass cancel to be done.

Best,
Matias

On Tue, Mar 30, 2021, at 15:33, Nathan Hartman wrote:
> Hi,
> 
> Does anyone know why the GitHub PR prechecks don't seem to be running?
> 
> In particular, it seems that these four:
> 
> * Build / Fetch-Source (pull_request)
> * Build Documentation / build-html (pull_request)
> * Check / check (pull_request)
> * Lint / YAML (pull_request)
> 
> are all stuck in "Queued -- Waiting to run this check..."
> 
> for all PRs.
> 
> Thanks,
> Nathan
>