You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@airflow.apache.org by Jarek Potiuk <Ja...@polidea.com> on 2020/10/11 22:37:44 UTC

Much more stable CI tests (hopefully!)

Hello everyone,

I have really high hopes for the CI change that we implemented over the
weekend. Last few weeks we experienced a lot of stability problems with the
CI, and our builds were rarely "green" - and mostly due to
intermittent/unrelated problems. We've implemented some workarounds and
splitting to a bigger number of smaller jobs that so far has proven to be
much more stable and "greener",

You will see a much bigger number of test checks than you used to (up to
120 or so), but they will be quite a bit faster. Also - if any of the
checks fail fo a good reason, you should be able to find information on how
to reproduce the failures locally in the test output - so that you can fix
it.

We will be watching and fixing any teething problems over the next few
days, but for now - please rebase to the latest master and try it out.

J.

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Much more stable CI tests (hopefully!)

Posted by Jarek Potiuk <Ja...@polidea.com>.
I re-enabled all the optimizations yesterday - as of this morning all PRs
that are only touching some "areas" of the code - should only run relevant
tests and nothing else - so they should be faster than usual. We are still
fighting with the "job limit" the whole ASF has and we applied for some
credits to build the self-hosted runners, but this should help a lot.

An example here: this build by Kaxil - once started - completed under 1m30s
- and did not use precious resources from the job queue:
https://github.com/apache/airflow/pull/11782#pullrequestreview-515783657

J.

On Sun, Oct 18, 2020 at 9:43 PM Jarek Potiuk <Ja...@polidea.com>
wrote:

> Just to let everyone know - our quest on stabilising/speeding up the CI
> continues.
>
> I just merged one of the final (yeah final final final ...) optimization,
> where just a typo correction in .md files /non-doc .rst files should take ~
> 1m to complete.
>
> Yep. You read that right. Some simple changes will not trigger a full test
> suite, just a small relevant subset of which should be really fast.
>
> We will observe it and will see if we need to adjust it and fix any other
> teething issues, but I hope this will be helpful in fighting the current
> job limits we have in the whole Apache organisation before we - hopefully -
> get self-hoster runners in place.
>
> J.
>
> On Tue, Oct 13, 2020 at 9:19 PM Jarek Potiuk <Ja...@polidea.com>
> wrote:
>
>> I do expect some small teething problems again, but I hope the big one is
>> over and I will try to address those problems if they arise. Apologies for
>> that - this was rather difficult to test on "Apache Organization" scale. We
>> are also talking about adding some github custom runners, because we expect
>> the situation will deteriorate in the future if we don't.
>>
>> On Tue, Oct 13, 2020 at 9:14 PM Jarek Potiuk <Ja...@polidea.com>
>> wrote:
>>
>>> There is a bad news and a good news :).
>>>
>>> * The bad one is that the change did not go well with its original
>>> scope. It turned out that many small jobs are not a good idea when you have
>>> 180 slots in a queue and a number (growing) of Apache projects and yours
>>> are competing for those. Seems that our jobs got starved a lot and the
>>> effect was 2-3 hours waiting queues which were growing afternoon when US
>>> started to wake up.
>>>
>>> * The good one is that I just merged a fix to that  - instead of many
>>> small jobs, we grouped several test types in single jobs and we clean-up
>>> between the jobs and reusing the machines. I believe this will be even more
>>> optimized, and uses the same concepts of optimization as before.
>>>
>>> I cancelled all the queued builds and asked people to rebase to latest
>>> master. If you have not done so yet - please do it now!
>>>
>>> J.
>>>
>>>
>>> On Mon, Oct 12, 2020 at 1:27 AM Daniel Imberman <
>>> daniel.imberman@gmail.com> wrote:
>>>
>>>> Thanks Jarek! This was much needed and should lead to a cleaner dev
>>>> process
>>>>
>>>> via Newton Mail
>>>> <https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.51&pv=10.15.6&source=email_footer_2>
>>>>
>>>> On Sun, Oct 11, 2020 at 3:37 PM, Jarek Potiuk <Ja...@polidea.com>
>>>> wrote:
>>>>
>>>> Hello everyone,
>>>>
>>>> I have really high hopes for the CI change that we implemented over the
>>>> weekend. Last few weeks we experienced a lot of stability problems with the
>>>> CI, and our builds were rarely "green" - and mostly due to
>>>> intermittent/unrelated problems. We've implemented some workarounds and
>>>> splitting to a bigger number of smaller jobs that so far has proven to be
>>>> much more stable and "greener",
>>>>
>>>> You will see a much bigger number of test checks than you used to (up
>>>> to 120 or so), but they will be quite a bit faster. Also - if any of the
>>>> checks fail fo a good reason, you should be able to find information on how
>>>> to reproduce the failures locally in the test output - so that you can fix
>>>> it.
>>>>
>>>> We will be watching and fixing any teething problems over the next few
>>>> days, but for now - please rebase to the latest master and try it out.
>>>>
>>>> J.
>>>>
>>>> --
>>>>
>>>> Jarek Potiuk
>>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>>>
>>>> M: +48 660 796 129 <+48660796129>
>>>> [image: Polidea] <https://www.polidea.com/>
>>>>
>>>>
>>>
>>> --
>>>
>>> Jarek Potiuk
>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>>
>>> M: +48 660 796 129 <+48660796129>
>>> [image: Polidea] <https://www.polidea.com/>
>>>
>>>
>>
>> --
>>
>> Jarek Potiuk
>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>
>> M: +48 660 796 129 <+48660796129>
>> [image: Polidea] <https://www.polidea.com/>
>>
>>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Much more stable CI tests (hopefully!)

Posted by Jarek Potiuk <Ja...@polidea.com>.
Just to let everyone know - our quest on stabilising/speeding up the CI
continues.

I just merged one of the final (yeah final final final ...) optimization,
where just a typo correction in .md files /non-doc .rst files should take ~
1m to complete.

Yep. You read that right. Some simple changes will not trigger a full test
suite, just a small relevant subset of which should be really fast.

We will observe it and will see if we need to adjust it and fix any other
teething issues, but I hope this will be helpful in fighting the current
job limits we have in the whole Apache organisation before we - hopefully -
get self-hoster runners in place.

J.

On Tue, Oct 13, 2020 at 9:19 PM Jarek Potiuk <Ja...@polidea.com>
wrote:

> I do expect some small teething problems again, but I hope the big one is
> over and I will try to address those problems if they arise. Apologies for
> that - this was rather difficult to test on "Apache Organization" scale. We
> are also talking about adding some github custom runners, because we expect
> the situation will deteriorate in the future if we don't.
>
> On Tue, Oct 13, 2020 at 9:14 PM Jarek Potiuk <Ja...@polidea.com>
> wrote:
>
>> There is a bad news and a good news :).
>>
>> * The bad one is that the change did not go well with its original scope.
>> It turned out that many small jobs are not a good idea when you have 180
>> slots in a queue and a number (growing) of Apache projects and yours are
>> competing for those. Seems that our jobs got starved a lot and the effect
>> was 2-3 hours waiting queues which were growing afternoon when US started
>> to wake up.
>>
>> * The good one is that I just merged a fix to that  - instead of many
>> small jobs, we grouped several test types in single jobs and we clean-up
>> between the jobs and reusing the machines. I believe this will be even more
>> optimized, and uses the same concepts of optimization as before.
>>
>> I cancelled all the queued builds and asked people to rebase to latest
>> master. If you have not done so yet - please do it now!
>>
>> J.
>>
>>
>> On Mon, Oct 12, 2020 at 1:27 AM Daniel Imberman <
>> daniel.imberman@gmail.com> wrote:
>>
>>> Thanks Jarek! This was much needed and should lead to a cleaner dev
>>> process
>>>
>>> via Newton Mail
>>> <https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.51&pv=10.15.6&source=email_footer_2>
>>>
>>> On Sun, Oct 11, 2020 at 3:37 PM, Jarek Potiuk <Ja...@polidea.com>
>>> wrote:
>>>
>>> Hello everyone,
>>>
>>> I have really high hopes for the CI change that we implemented over the
>>> weekend. Last few weeks we experienced a lot of stability problems with the
>>> CI, and our builds were rarely "green" - and mostly due to
>>> intermittent/unrelated problems. We've implemented some workarounds and
>>> splitting to a bigger number of smaller jobs that so far has proven to be
>>> much more stable and "greener",
>>>
>>> You will see a much bigger number of test checks than you used to (up to
>>> 120 or so), but they will be quite a bit faster. Also - if any of the
>>> checks fail fo a good reason, you should be able to find information on how
>>> to reproduce the failures locally in the test output - so that you can fix
>>> it.
>>>
>>> We will be watching and fixing any teething problems over the next few
>>> days, but for now - please rebase to the latest master and try it out.
>>>
>>> J.
>>>
>>> --
>>>
>>> Jarek Potiuk
>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>>
>>> M: +48 660 796 129 <+48660796129>
>>> [image: Polidea] <https://www.polidea.com/>
>>>
>>>
>>
>> --
>>
>> Jarek Potiuk
>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>
>> M: +48 660 796 129 <+48660796129>
>> [image: Polidea] <https://www.polidea.com/>
>>
>>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Much more stable CI tests (hopefully!)

Posted by Jarek Potiuk <Ja...@polidea.com>.
I do expect some small teething problems again, but I hope the big one is
over and I will try to address those problems if they arise. Apologies for
that - this was rather difficult to test on "Apache Organization" scale. We
are also talking about adding some github custom runners, because we expect
the situation will deteriorate in the future if we don't.

On Tue, Oct 13, 2020 at 9:14 PM Jarek Potiuk <Ja...@polidea.com>
wrote:

> There is a bad news and a good news :).
>
> * The bad one is that the change did not go well with its original scope.
> It turned out that many small jobs are not a good idea when you have 180
> slots in a queue and a number (growing) of Apache projects and yours are
> competing for those. Seems that our jobs got starved a lot and the effect
> was 2-3 hours waiting queues which were growing afternoon when US started
> to wake up.
>
> * The good one is that I just merged a fix to that  - instead of many
> small jobs, we grouped several test types in single jobs and we clean-up
> between the jobs and reusing the machines. I believe this will be even more
> optimized, and uses the same concepts of optimization as before.
>
> I cancelled all the queued builds and asked people to rebase to latest
> master. If you have not done so yet - please do it now!
>
> J.
>
>
> On Mon, Oct 12, 2020 at 1:27 AM Daniel Imberman <da...@gmail.com>
> wrote:
>
>> Thanks Jarek! This was much needed and should lead to a cleaner dev
>> process
>>
>> via Newton Mail
>> <https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.51&pv=10.15.6&source=email_footer_2>
>>
>> On Sun, Oct 11, 2020 at 3:37 PM, Jarek Potiuk <Ja...@polidea.com>
>> wrote:
>>
>> Hello everyone,
>>
>> I have really high hopes for the CI change that we implemented over the
>> weekend. Last few weeks we experienced a lot of stability problems with the
>> CI, and our builds were rarely "green" - and mostly due to
>> intermittent/unrelated problems. We've implemented some workarounds and
>> splitting to a bigger number of smaller jobs that so far has proven to be
>> much more stable and "greener",
>>
>> You will see a much bigger number of test checks than you used to (up to
>> 120 or so), but they will be quite a bit faster. Also - if any of the
>> checks fail fo a good reason, you should be able to find information on how
>> to reproduce the failures locally in the test output - so that you can fix
>> it.
>>
>> We will be watching and fixing any teething problems over the next few
>> days, but for now - please rebase to the latest master and try it out.
>>
>> J.
>>
>> --
>>
>> Jarek Potiuk
>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>
>> M: +48 660 796 129 <+48660796129>
>> [image: Polidea] <https://www.polidea.com/>
>>
>>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Much more stable CI tests (hopefully!)

Posted by Jarek Potiuk <Ja...@polidea.com>.
There is a bad news and a good news :).

* The bad one is that the change did not go well with its original scope.
It turned out that many small jobs are not a good idea when you have 180
slots in a queue and a number (growing) of Apache projects and yours are
competing for those. Seems that our jobs got starved a lot and the effect
was 2-3 hours waiting queues which were growing afternoon when US started
to wake up.

* The good one is that I just merged a fix to that  - instead of many small
jobs, we grouped several test types in single jobs and we clean-up between
the jobs and reusing the machines. I believe this will be even more
optimized, and uses the same concepts of optimization as before.

I cancelled all the queued builds and asked people to rebase to latest
master. If you have not done so yet - please do it now!

J.


On Mon, Oct 12, 2020 at 1:27 AM Daniel Imberman <da...@gmail.com>
wrote:

> Thanks Jarek! This was much needed and should lead to a cleaner dev process
>
> via Newton Mail
> <https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.51&pv=10.15.6&source=email_footer_2>
>
> On Sun, Oct 11, 2020 at 3:37 PM, Jarek Potiuk <Ja...@polidea.com>
> wrote:
>
> Hello everyone,
>
> I have really high hopes for the CI change that we implemented over the
> weekend. Last few weeks we experienced a lot of stability problems with the
> CI, and our builds were rarely "green" - and mostly due to
> intermittent/unrelated problems. We've implemented some workarounds and
> splitting to a bigger number of smaller jobs that so far has proven to be
> much more stable and "greener",
>
> You will see a much bigger number of test checks than you used to (up to
> 120 or so), but they will be quite a bit faster. Also - if any of the
> checks fail fo a good reason, you should be able to find information on how
> to reproduce the failures locally in the test output - so that you can fix
> it.
>
> We will be watching and fixing any teething problems over the next few
> days, but for now - please rebase to the latest master and try it out.
>
> J.
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>
>

-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: Much more stable CI tests (hopefully!)

Posted by Daniel Imberman <da...@gmail.com>.
Thanks Jarek! This was much needed and should lead to a cleaner dev process

via Newton Mail 
[https://cloudmagic.com/k/d/mailapp?ct=dx&cv=10.0.51&pv=10.15.6&source=email_footer_2]
On Sun, Oct 11, 2020 at 3:37 PM, Jarek Potiuk <Ja...@polidea.com> 
wrote:
Hello everyone,
I have really high hopes for the CI change that we implemented over the 
weekend. Last few weeks we experienced a lot of stability problems with the 
CI, and our builds were rarely "green" - and mostly due to 
intermittent/unrelated problems. We've implemented some workarounds and 
splitting to a bigger number of smaller jobs that so far has proven to be 
much more stable and "greener",
You will see a much bigger number of test checks than you used to (up to 
120 or so), but they will be quite a bit faster. Also - if any of the 
checks fail fo a good reason, you should be able to find information on how 
to reproduce the failures locally in the test output - so that you can fix 
it.
We will be watching and fixing any teething problems over the next few 
days, but for now - please rebase to the latest master and try it out.
J.
--
    Jarek Potiuk                                                       
    Polidea [https://www.polidea.com/] | Principal Software Engineer   

M: +48 660 796 129 [tel:+48660796129]   
[https://www.polidea.com/]