You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Jarek Potiuk <Ja...@polidea.com> on 2020/08/22 15:12:14 UTC

Faster builds on CI + increased stability + easier to reproduce CI problems

Hello everyone,

Just wanted to let you know that we merged last week quite an overhaul of
the CI architecture we have in Github Actions.

TL;DR; It should be faster, more stable and it should be super-easy to
reproduce any CI failure locally.

We should have quite a bit faster, much more stable - and as a side effect
- easy to diagnose CI builds. There are few PRs left to merge - solving
some teething problems and adding some optimizations and we might need to
implement one workaround for missing GitHub API, but it looks pretty good
after few days of watching.

The gist of the change is that we could start using a new "workflow_run"
feature of GitHub Actions that allows us to only build each image once and
reuse it for all the jobs - previously those images were built (using
latest sources) for every single job. Now they are built only once.

Some stats for average runs (we have way bigger gains in situations where
python released new patch-level version):

   - Prepare image job: 5 minutes 30 seconds -> 1 minute 7 seconds (~80%
   improvement)
   - Longest job time: 34 minutes => 29 minutes 30 seconds (~15%
   improvement in longest job)
   - Build time saved per build (!)  = 27 jobs * 4.5 minutes ~ 2h machine
   build time saved for each build (!)

This change also should improve overall stability. There were a number of
problems where building image failed - this should be now ~ 10 x less
likely to happen as we build images only 3 times instead of ~30.

As a result - we are better citizens, but also it means we should have far
less queuing time in case several PRs start in quick succession.

Also - as a side effect but an important one - we have now a super-easy way
to reproduce any failure in CI. This is the final setup which I thought
about when I implemented Breeze. Now anyone can just log in to GitHub
registry and run this:

`breeze --github-image-id <RUN_ID> --backend <BACKEND> --python <X.Y>`

Then you should be dropped into the EXACT same environment that was used
for a particular failed "run" in Github Actions - including airflow sources
used for that. You do not have to check-out the code etc.

This means that you (or anyone else trying to help) should be able to
re-run most of the failed tests locally and reproduce the failures (and try
to fix them).

Documentation with all the details and command you can use is coming in
https://github.com/apache/airflow/pull/10380 - happy to get some reviews.

J.

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Faster builds on CI + increased stability + easier to reproduce CI problems

Posted by Maxime Beauchemin <ma...@gmail.com>.
Great work! Investments in CI pay dividends to the whole community.

On Sat, Aug 22, 2020 at 8:12 AM Jarek Potiuk <Ja...@polidea.com>
wrote:

> Hello everyone,
>
> Just wanted to let you know that we merged last week quite an overhaul of
> the CI architecture we have in Github Actions.
>
> TL;DR; It should be faster, more stable and it should be super-easy to
> reproduce any CI failure locally.
>
> We should have quite a bit faster, much more stable - and as a side effect
> - easy to diagnose CI builds. There are few PRs left to merge - solving
> some teething problems and adding some optimizations and we might need to
> implement one workaround for missing GitHub API, but it looks pretty good
> after few days of watching.
>
> The gist of the change is that we could start using a new "workflow_run"
> feature of GitHub Actions that allows us to only build each image once and
> reuse it for all the jobs - previously those images were built (using
> latest sources) for every single job. Now they are built only once.
>
> Some stats for average runs (we have way bigger gains in situations where
> python released new patch-level version):
>
>    - Prepare image job: 5 minutes 30 seconds -> 1 minute 7 seconds (~80%
>    improvement)
>    - Longest job time: 34 minutes => 29 minutes 30 seconds (~15%
>    improvement in longest job)
>    - Build time saved per build (!)  = 27 jobs * 4.5 minutes ~ 2h machine
>    build time saved for each build (!)
>
> This change also should improve overall stability. There were a number of
> problems where building image failed - this should be now ~ 10 x less
> likely to happen as we build images only 3 times instead of ~30.
>
> As a result - we are better citizens, but also it means we should have far
> less queuing time in case several PRs start in quick succession.
>
> Also - as a side effect but an important one - we have now a super-easy way
> to reproduce any failure in CI. This is the final setup which I thought
> about when I implemented Breeze. Now anyone can just log in to GitHub
> registry and run this:
>
> `breeze --github-image-id <RUN_ID> --backend <BACKEND> --python <X.Y>`
>
> Then you should be dropped into the EXACT same environment that was used
> for a particular failed "run" in Github Actions - including airflow sources
> used for that. You do not have to check-out the code etc.
>
> This means that you (or anyone else trying to help) should be able to
> re-run most of the failed tests locally and reproduce the failures (and try
> to fix them).
>
> Documentation with all the details and command you can use is coming in
> https://github.com/apache/airflow/pull/10380 - happy to get some reviews.
>
> J.
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>