You are viewing a plain text version of this content. The canonical link for it is here.
Posted to builds@apache.org by Jarek Potiuk <Ja...@polidea.com> on 2021/01/08 19:09:46 UTC

GA again unreasonably slow (again)

Hello everyone (Gavin, Sander especially),

Over the last few days again the queue for GA got completely blocked. We
have 2-4 jobs in parallel max and our speed of merging PRs dropped to 1 per
4-5 hours.

We really need to find out to solve the problem together with Github
account that we were supposed to meet because it will only get worse.

My colleague waits (
https://github.com/apache/airflow/pull/13409#issuecomment-756364484)  with
5 PRs for my PR to be merged,

I submitted it (again) yesterday morning only to find out in the evening
that it failed in the middle. This morning  I fixed it (I hope) and
submitted it in the morning and it's 8 pm afternoon and till now still I am
at 2/3 of it (30 out of 50 checks green). it usually takes up to 30
minutes to complete.

There are two things that probably with INFRA involvement things could be
improved:

1)  I heavily optimized our setup. I literally run out of optimization
ideas yesterday.

2) We secured our funds for self-hosted runners, however

3) We still cannot use self-hosted runners due to
https://docs.github.com/en/free-pro-team@latest/actions/hosting-your-own-runners/about-self-hosted-runners#self-hosted-runner-security-with-public-repositories
.

Ash from Airflow even prepared a PR:
https://github.com/actions/runner/pull/783 that should allow us to mitigate
the security problem with self-hosted runners for public repos, but we have
not since anything since 2nd of November.

4) we are working (Ash again) on using the PR to make a fork of the runner
and set it up regardless of the approval of GH Actions team, but we are not
sure how secure and robust it will be
https://github.com/ashb/runner/commit/448341ee47c123f0d3d56c0bb1be9d292fc646ee
because we have to basically very quickly automatically rebase our changes
on top of new releases from GitHub. This is madness and it will cost us a
lot of engineering and maintenance time.

5) We (Tobiasz from the Airflow team) even developed this (far from
perfect) dashboard that gathers information about the number of GA
workflows in/progress/queued per project and they clearly show the
situation is getting worse by day:

https://pasteboard.co/JIJa5Xg.png

6) I opened 18 tickets to Github support and pretty much all of them are
either recurring or we found a way to mitigate them:

https://pasteboard.co/JIJbIC9.png

7) I do not even mention the two critical security issues are opened for
Github Actions resulting from the Xmas incident (they are raised through
bounty.github.com and wait for acknowledgment till today.

We are pretty much stuck and there is no viable option it seems. I, again,
literally run out of ideas what we can do. Seems that at least the
"self-hosted security problem" is something that could be addressed without
a heavy investment of either INFRA or GitHub, but we have no leverage on
them.

Is there something we can do via our Github account?  We were supposed to
get meeting with them but it got cancelled.

Can we at least organize the meeting and urge them to fix the security
problem for public self -hosted repositories?

This is not a complaint, this is just crying for HELP ... We are terribly
stuck.


J,


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: GA again unreasonably slow (again)

Posted by Jarek Potiuk <ja...@potiuk.com>.
The preliminary description (Still lacking some recent changes and details)
is here:
https://cwiki.apache.org/confluence/display/INFRA/Self-hosted+GitHub+runners
and you can grab Ash as he mentioned in the comments if you want to get
some more details on it .

On Mon, Feb 8, 2021 at 11:01 PM Chris Lambertus <cm...@apache.org> wrote:

>
>
> > On Feb 8, 2021, at 1:51 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > This uses https://github.com/actions/runner/pull/783 to not have
> > un-trusted users run code (security is based on the actors of the commit
> -
> > commiter’s PRs and direct pushes  are allowed to run builds on
> self-hosted
> > runners) on our hosts, and then a combination of a Github Application,
> AWS
> > Lambda and an AWS Auto-Scaling Group
>
>
> I’d be interested in additional details on how you’ve implemented Lambda
> and AWS Auto-scaling for this.
>
> -Chris
>
>
> >
> > pon., 8 lut 2021, 09:58 użytkownik Antoine Pitrou <an...@python.org>
> > napisał:
> >
> >>
> >> Hi Jarek,
> >>
> >> Thank you for the document.  Could you tell us more about the "custom
> >> security layer" that you implemented?
> >>
> >> Regards
> >>
> >> Antoine.
> >>
> >>
> >> Le 08/02/2021 à 01:44, Jarek Potiuk a écrit :
> >>> For anyone following this thread - some update from the progress we
> have
> >> in
> >>> Airflow on building self-hosted infrastructure for GitHub actions.
> >>>
> >>> Ash from Airflow is really close to finalizing the work on a nice
> >>> auto-scaling framework for self-hosted workers, but also we checked
> what
> >> is
> >>> the best value for money we can get.
> >>>
> >>> I've run some analysis on the performance and tested my hypothesis
> (based
> >>> on earlier experiences) of significant  optimisations we can get.
> >>>
> >>> I've finished my analysis of potential optimizations we can get on our
> CI
> >>> with the Self-Hosted runners that Ash created. I did some performance
> >>> testing and (very crude) comparison of "traditional approach" with
> Local
> >>> SSDs 2 CPU instances running the tests with something I already tested
> >>> several times on various CI arrangements - running tests with
> High-Memory
> >>> instances (8CPU 64 GB Mem) and running everything (including docker
> >> engine)
> >>> in "tmpfs" - huge ramdisk.
> >>> Seems that 1h 20 minutes of test running can be decreased 8x (!)using
> >> this
> >>> approach (and parallelising some tests) at the same time decreasing the
> >>> cost 2x (!). Yep. You heard right. We can have faster builds this way
> and
> >>> pay less for that. Seems that we will be able to decrease the time to
> run
> >>> all tests for one combination to 10 minutes from 1h20 minutes.
> >>> This is possible because Ash and his team did a great job on setting up
> >>> auto-scaling EC2 instance runners on our Amazon EC2 account (we have
> >>> credits from Amazon to run those jobs - also Astronomer offered
> donation
> >> to
> >>> keep it running ). Seems that by utilizing it  we can not only pay less
> >> but
> >>> also get much faster builds.
> >>>
> >>> If you are interested - my document is here. Open for comments - happy
> to
> >>> add you as editors if you want (just send me your gmail address in
> priv).
> >>> It is rather crude, I had no time to put a bit more effort into it due
> to
> >>> some significant changes in my company, but it should be easy to
> compare
> >>> the values and see the actual improvements we can get. There are
> likely a
> >>> few shortcuts there and some of the numbers are "back-of-the-envelope"
> >> and
> >>> we are going to validate them even more when we implement all the
> >>> optimisations, but the conclusions should be pretty sound.
> >>>
> >>>
> >>
> https://docs.google.com/document/d/1ZZeZ4BYMNX7ycGRUKAXv0s6etz1g-90Onn5nRQQHOfE/edit#
> >>>
> >>> J.
> >>>
> >>>
> >>> On Fri, Jan 8, 2021 at 10:02 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >>>
> >>>>
> >>>> We should be able to make an efficient query via GraphQL API right? I
> >> found
> >>>>> the REST API for actions to be a little underwhelming.
> >>>>
> >>>>
> >>>> That was the first thing I checked when we started looking at the
> stats.
> >>>> Unfortunately last time that I checked (and I even opened an issue for
> >>>> that to
> >>>> Github support) there was not a Github Actions GraphQL API.
> >>>>
> >>>> I got a GH support answer "Yeah we know GH API does not have
> >>>> GraphQL support yet, sorry". I think it has not changed since.
> >>>>
> >>>>
> >>>> We have tried to make our builds faster with more caching but it's not
> >> easy
> >>>>> since it's an embedded systems project we need to target a lot of
> >>>>> configurations and most changes impact all builds.
> >>>>>
> >>>>
> >>>> Indeed, I know how much of my time was spent on optimising Airflow GH
> >>>> usage.
> >>>> I think we eventually decreased the usage 10x or more. But it never
> >>>> helped, for a
> >>>> long as currently anyone even accidentally could block all the slots
> in
> >>>> almost no
> >>>> time at all. We have no organisation-wide way to block this and this
> is
> >>>> the problem.
> >>>>
> >>>> Right now I could:
> >>>> a) mine cryptocurrency using PRs to any Apache project
> >>>> b) block the queue for everone
> >>>>
> >>>> I do not have to be even an Apache committer to do that. It's enough
> if
> >>>> just open one PR
> >>>> which is well crafted and spins of 180 jobs that run for 6 hours. It's
> >>>> super-flawed.
> >>>>
> >>>>
> >>>>>
> >>>>> We too would like to would like to take advantage of our own runners
> >> but
> >>>>> more for the ability to do Hardware In the Loop testing but have
> >> avoided
> >>>>> it
> >>>>> for the reasons already mentioned.
> >>>>>
> >>>>
> >>>> Self-hosted runner for now seems to be the only "Reasonable" option
> but
> >>>> the security
> >>>> issues with the current runner are not allowing us to do it.
> >>>>
> >>>>>
> >>>>> --Brennan
> >>>>>
> >>>>
> >>>>
> >>>> --
> >>>> +48 660 796 129
> >>>>
> >>>
> >>>
> >>
>
>

-- 
+48 660 796 129

Re: GA again unreasonably slow (again)

Posted by Chris Lambertus <cm...@apache.org>.

> On Feb 8, 2021, at 1:51 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
> 
> This uses https://github.com/actions/runner/pull/783 to not have
> un-trusted users run code (security is based on the actors of the commit -
> commiter’s PRs and direct pushes  are allowed to run builds on self-hosted
> runners) on our hosts, and then a combination of a Github Application, AWS
> Lambda and an AWS Auto-Scaling Group


I’d be interested in additional details on how you’ve implemented Lambda and AWS Auto-scaling for this.

-Chris


> 
> pon., 8 lut 2021, 09:58 użytkownik Antoine Pitrou <an...@python.org>
> napisał:
> 
>> 
>> Hi Jarek,
>> 
>> Thank you for the document.  Could you tell us more about the "custom
>> security layer" that you implemented?
>> 
>> Regards
>> 
>> Antoine.
>> 
>> 
>> Le 08/02/2021 à 01:44, Jarek Potiuk a écrit :
>>> For anyone following this thread - some update from the progress we have
>> in
>>> Airflow on building self-hosted infrastructure for GitHub actions.
>>> 
>>> Ash from Airflow is really close to finalizing the work on a nice
>>> auto-scaling framework for self-hosted workers, but also we checked what
>> is
>>> the best value for money we can get.
>>> 
>>> I've run some analysis on the performance and tested my hypothesis (based
>>> on earlier experiences) of significant  optimisations we can get.
>>> 
>>> I've finished my analysis of potential optimizations we can get on our CI
>>> with the Self-Hosted runners that Ash created. I did some performance
>>> testing and (very crude) comparison of "traditional approach" with Local
>>> SSDs 2 CPU instances running the tests with something I already tested
>>> several times on various CI arrangements - running tests with High-Memory
>>> instances (8CPU 64 GB Mem) and running everything (including docker
>> engine)
>>> in "tmpfs" - huge ramdisk.
>>> Seems that 1h 20 minutes of test running can be decreased 8x (!)using
>> this
>>> approach (and parallelising some tests) at the same time decreasing the
>>> cost 2x (!). Yep. You heard right. We can have faster builds this way and
>>> pay less for that. Seems that we will be able to decrease the time to run
>>> all tests for one combination to 10 minutes from 1h20 minutes.
>>> This is possible because Ash and his team did a great job on setting up
>>> auto-scaling EC2 instance runners on our Amazon EC2 account (we have
>>> credits from Amazon to run those jobs - also Astronomer offered donation
>> to
>>> keep it running ). Seems that by utilizing it  we can not only pay less
>> but
>>> also get much faster builds.
>>> 
>>> If you are interested - my document is here. Open for comments - happy to
>>> add you as editors if you want (just send me your gmail address in priv).
>>> It is rather crude, I had no time to put a bit more effort into it due to
>>> some significant changes in my company, but it should be easy to compare
>>> the values and see the actual improvements we can get. There are likely a
>>> few shortcuts there and some of the numbers are "back-of-the-envelope"
>> and
>>> we are going to validate them even more when we implement all the
>>> optimisations, but the conclusions should be pretty sound.
>>> 
>>> 
>> https://docs.google.com/document/d/1ZZeZ4BYMNX7ycGRUKAXv0s6etz1g-90Onn5nRQQHOfE/edit#
>>> 
>>> J.
>>> 
>>> 
>>> On Fri, Jan 8, 2021 at 10:02 PM Jarek Potiuk <ja...@potiuk.com> wrote:
>>> 
>>>> 
>>>> We should be able to make an efficient query via GraphQL API right? I
>> found
>>>>> the REST API for actions to be a little underwhelming.
>>>> 
>>>> 
>>>> That was the first thing I checked when we started looking at the stats.
>>>> Unfortunately last time that I checked (and I even opened an issue for
>>>> that to
>>>> Github support) there was not a Github Actions GraphQL API.
>>>> 
>>>> I got a GH support answer "Yeah we know GH API does not have
>>>> GraphQL support yet, sorry". I think it has not changed since.
>>>> 
>>>> 
>>>> We have tried to make our builds faster with more caching but it's not
>> easy
>>>>> since it's an embedded systems project we need to target a lot of
>>>>> configurations and most changes impact all builds.
>>>>> 
>>>> 
>>>> Indeed, I know how much of my time was spent on optimising Airflow GH
>>>> usage.
>>>> I think we eventually decreased the usage 10x or more. But it never
>>>> helped, for a
>>>> long as currently anyone even accidentally could block all the slots in
>>>> almost no
>>>> time at all. We have no organisation-wide way to block this and this is
>>>> the problem.
>>>> 
>>>> Right now I could:
>>>> a) mine cryptocurrency using PRs to any Apache project
>>>> b) block the queue for everone
>>>> 
>>>> I do not have to be even an Apache committer to do that. It's enough if
>>>> just open one PR
>>>> which is well crafted and spins of 180 jobs that run for 6 hours. It's
>>>> super-flawed.
>>>> 
>>>> 
>>>>> 
>>>>> We too would like to would like to take advantage of our own runners
>> but
>>>>> more for the ability to do Hardware In the Loop testing but have
>> avoided
>>>>> it
>>>>> for the reasons already mentioned.
>>>>> 
>>>> 
>>>> Self-hosted runner for now seems to be the only "Reasonable" option but
>>>> the security
>>>> issues with the current runner are not allowing us to do it.
>>>> 
>>>>> 
>>>>> --Brennan
>>>>> 
>>>> 
>>>> 
>>>> --
>>>> +48 660 796 129
>>>> 
>>> 
>>> 
>> 


Re: GA again unreasonably slow (again)

Posted by Jarek Potiuk <ja...@potiuk.com>.
 This uses https://github.com/actions/runner/pull/783 to not have
un-trusted users run code (security is based on the actors of the commit -
commiter’s PRs and direct pushes  are allowed to run builds on self-hosted
runners) on our hosts, and then a combination of a Github Application, AWS
Lambda and an AWS Auto-Scaling Group

pon., 8 lut 2021, 09:58 użytkownik Antoine Pitrou <an...@python.org>
napisał:

>
> Hi Jarek,
>
> Thank you for the document.  Could you tell us more about the "custom
> security layer" that you implemented?
>
> Regards
>
> Antoine.
>
>
> Le 08/02/2021 à 01:44, Jarek Potiuk a écrit :
> > For anyone following this thread - some update from the progress we have
> in
> > Airflow on building self-hosted infrastructure for GitHub actions.
> >
> > Ash from Airflow is really close to finalizing the work on a nice
> > auto-scaling framework for self-hosted workers, but also we checked what
> is
> > the best value for money we can get.
> >
> > I've run some analysis on the performance and tested my hypothesis (based
> > on earlier experiences) of significant  optimisations we can get.
> >
> > I've finished my analysis of potential optimizations we can get on our CI
> > with the Self-Hosted runners that Ash created. I did some performance
> > testing and (very crude) comparison of "traditional approach" with Local
> > SSDs 2 CPU instances running the tests with something I already tested
> > several times on various CI arrangements - running tests with High-Memory
> > instances (8CPU 64 GB Mem) and running everything (including docker
> engine)
> > in "tmpfs" - huge ramdisk.
> > Seems that 1h 20 minutes of test running can be decreased 8x (!)using
> this
> > approach (and parallelising some tests) at the same time decreasing the
> > cost 2x (!). Yep. You heard right. We can have faster builds this way and
> > pay less for that. Seems that we will be able to decrease the time to run
> > all tests for one combination to 10 minutes from 1h20 minutes.
> > This is possible because Ash and his team did a great job on setting up
> > auto-scaling EC2 instance runners on our Amazon EC2 account (we have
> > credits from Amazon to run those jobs - also Astronomer offered donation
> to
> > keep it running ). Seems that by utilizing it  we can not only pay less
> but
> > also get much faster builds.
> >
> > If you are interested - my document is here. Open for comments - happy to
> > add you as editors if you want (just send me your gmail address in priv).
> > It is rather crude, I had no time to put a bit more effort into it due to
> > some significant changes in my company, but it should be easy to compare
> > the values and see the actual improvements we can get. There are likely a
> > few shortcuts there and some of the numbers are "back-of-the-envelope"
> and
> > we are going to validate them even more when we implement all the
> > optimisations, but the conclusions should be pretty sound.
> >
> >
> https://docs.google.com/document/d/1ZZeZ4BYMNX7ycGRUKAXv0s6etz1g-90Onn5nRQQHOfE/edit#
> >
> > J.
> >
> >
> > On Fri, Jan 8, 2021 at 10:02 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> >>
> >> We should be able to make an efficient query via GraphQL API right? I
> found
> >>> the REST API for actions to be a little underwhelming.
> >>
> >>
> >> That was the first thing I checked when we started looking at the stats.
> >> Unfortunately last time that I checked (and I even opened an issue for
> >> that to
> >> Github support) there was not a Github Actions GraphQL API.
> >>
> >> I got a GH support answer "Yeah we know GH API does not have
> >> GraphQL support yet, sorry". I think it has not changed since.
> >>
> >>
> >> We have tried to make our builds faster with more caching but it's not
> easy
> >>> since it's an embedded systems project we need to target a lot of
> >>> configurations and most changes impact all builds.
> >>>
> >>
> >> Indeed, I know how much of my time was spent on optimising Airflow GH
> >> usage.
> >> I think we eventually decreased the usage 10x or more. But it never
> >> helped, for a
> >> long as currently anyone even accidentally could block all the slots in
> >> almost no
> >> time at all. We have no organisation-wide way to block this and this is
> >> the problem.
> >>
> >> Right now I could:
> >> a) mine cryptocurrency using PRs to any Apache project
> >> b) block the queue for everone
> >>
> >> I do not have to be even an Apache committer to do that. It's enough if
> >> just open one PR
> >> which is well crafted and spins of 180 jobs that run for 6 hours. It's
> >> super-flawed.
> >>
> >>
> >>>
> >>> We too would like to would like to take advantage of our own runners
> but
> >>> more for the ability to do Hardware In the Loop testing but have
> avoided
> >>> it
> >>> for the reasons already mentioned.
> >>>
> >>
> >> Self-hosted runner for now seems to be the only "Reasonable" option but
> >> the security
> >> issues with the current runner are not allowing us to do it.
> >>
> >>>
> >>> --Brennan
> >>>
> >>
> >>
> >> --
> >> +48 660 796 129
> >>
> >
> >
>

Re: GA again unreasonably slow (again)

Posted by Antoine Pitrou <an...@python.org>.
Hi Jarek,

Thank you for the document.  Could you tell us more about the "custom
security layer" that you implemented?

Regards

Antoine.


Le 08/02/2021 à 01:44, Jarek Potiuk a écrit :
> For anyone following this thread - some update from the progress we have in
> Airflow on building self-hosted infrastructure for GitHub actions.
> 
> Ash from Airflow is really close to finalizing the work on a nice
> auto-scaling framework for self-hosted workers, but also we checked what is
> the best value for money we can get.
> 
> I've run some analysis on the performance and tested my hypothesis (based
> on earlier experiences) of significant  optimisations we can get.
> 
> I've finished my analysis of potential optimizations we can get on our CI
> with the Self-Hosted runners that Ash created. I did some performance
> testing and (very crude) comparison of "traditional approach" with Local
> SSDs 2 CPU instances running the tests with something I already tested
> several times on various CI arrangements - running tests with High-Memory
> instances (8CPU 64 GB Mem) and running everything (including docker engine)
> in "tmpfs" - huge ramdisk.
> Seems that 1h 20 minutes of test running can be decreased 8x (!)using this
> approach (and parallelising some tests) at the same time decreasing the
> cost 2x (!). Yep. You heard right. We can have faster builds this way and
> pay less for that. Seems that we will be able to decrease the time to run
> all tests for one combination to 10 minutes from 1h20 minutes.
> This is possible because Ash and his team did a great job on setting up
> auto-scaling EC2 instance runners on our Amazon EC2 account (we have
> credits from Amazon to run those jobs - also Astronomer offered donation to
> keep it running ). Seems that by utilizing it  we can not only pay less but
> also get much faster builds.
> 
> If you are interested - my document is here. Open for comments - happy to
> add you as editors if you want (just send me your gmail address in priv).
> It is rather crude, I had no time to put a bit more effort into it due to
> some significant changes in my company, but it should be easy to compare
> the values and see the actual improvements we can get. There are likely a
> few shortcuts there and some of the numbers are "back-of-the-envelope" and
> we are going to validate them even more when we implement all the
> optimisations, but the conclusions should be pretty sound.
> 
> https://docs.google.com/document/d/1ZZeZ4BYMNX7ycGRUKAXv0s6etz1g-90Onn5nRQQHOfE/edit#
> 
> J.
> 
> 
> On Fri, Jan 8, 2021 at 10:02 PM Jarek Potiuk <ja...@potiuk.com> wrote:
> 
>>
>> We should be able to make an efficient query via GraphQL API right? I found
>>> the REST API for actions to be a little underwhelming.
>>
>>
>> That was the first thing I checked when we started looking at the stats.
>> Unfortunately last time that I checked (and I even opened an issue for
>> that to
>> Github support) there was not a Github Actions GraphQL API.
>>
>> I got a GH support answer "Yeah we know GH API does not have
>> GraphQL support yet, sorry". I think it has not changed since.
>>
>>
>> We have tried to make our builds faster with more caching but it's not easy
>>> since it's an embedded systems project we need to target a lot of
>>> configurations and most changes impact all builds.
>>>
>>
>> Indeed, I know how much of my time was spent on optimising Airflow GH
>> usage.
>> I think we eventually decreased the usage 10x or more. But it never
>> helped, for a
>> long as currently anyone even accidentally could block all the slots in
>> almost no
>> time at all. We have no organisation-wide way to block this and this is
>> the problem.
>>
>> Right now I could:
>> a) mine cryptocurrency using PRs to any Apache project
>> b) block the queue for everone
>>
>> I do not have to be even an Apache committer to do that. It's enough if
>> just open one PR
>> which is well crafted and spins of 180 jobs that run for 6 hours. It's
>> super-flawed.
>>
>>
>>>
>>> We too would like to would like to take advantage of our own runners but
>>> more for the ability to do Hardware In the Loop testing but have avoided
>>> it
>>> for the reasons already mentioned.
>>>
>>
>> Self-hosted runner for now seems to be the only "Reasonable" option but
>> the security
>> issues with the current runner are not allowing us to do it.
>>
>>>
>>> --Brennan
>>>
>>
>>
>> --
>> +48 660 796 129
>>
> 
> 

Re: GA again unreasonably slow (again)

Posted by Jarek Potiuk <ja...@potiuk.com>.
The reasoning for selective checks here:
https://github.com/apache/airflow/blob/master/PULL_REQUEST_WORKFLOW.rst
(correct link)

On Tue, Feb 9, 2021 at 7:05 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> | The real hard problem is knowing when a change requires full regression
> and integration testing of all possible platforms.
>
> And here I absolutely agree too. Even more than that - I am a hard
> practitioner of that. This is what we already implemented in Airflow (the
> whole reasoning how and why it is implemented is here:
> https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#)
> (still we have a few optimisations left). We call it "selective checks".
>
> And this is what I already proposed the Pulsar team to implement too -
> just take a look at chapter 4) in their document:
> https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#
>
>
> In Airflow it helped a lot at some point. We got ~70% less load on the
> queue - mainly thanks to selective checks.
>
> Unfortunately - with the shared queue of the ASF, it only helped for some
> two weeks - precisely because we were the only obese who  have done that,
> so while being gentle to others, we have not got the love back. But - this
> is not a complaint,  I think it is a natural thing when you have shared
> resources that you do not pay for. People will not optimise as it is
> huge investment  for them (not only the cost of doing it, but also
> increased complexity). It\s been mentioned several times that Airflow's CI
> is over-engineered, but I think it is simply heavily optimized (which
> brings necessary complexity).
>
> Again - there is no way (and it would be even not fair TBH) to enforce
> this optimisation of their processes for the benefits of others, if they
> have no incentives. This is simply consequence of the model of "free shared
> motorway". No matter how hard you try - you will eventually end-up with
> traffic jams.
>
> J.
>
>
> On Tue, Feb 9, 2021 at 6:40 PM Dave Fisher <wa...@comcast.net> wrote:
>
>> The real hard problem is knowing when a change requires full regression
>> and integration testing of all possible platforms.
>>
>> I think projects are allowing lazy engineering if those making changes
>> don’t know the level of testing needed for their changes.
>>
>> Now with easy lightweight branches all being fully tested ....
>>
>> This is my 10,000 meter view.
>>
>> But then I’m old school and on my first job the mainframe printout
>> included how much the run I made was costing my boss in $.
>>
>> Best Regards,
>> Dave
>>
>> Sent from my iPhone
>>
>> > On Feb 9, 2021, at 9:20 AM, Jarek Potiuk <ja...@potiuk.com> wrote:
>> >
>> > Absolutely agree Matt. Throwing more hardware at "all of the projects"
>> is
>> > definitely not going to help - I was telling that from the beginning -
>> it
>> > is like building free motorways - the more you build, the more traffic
>> > flows and the traffic jams remain. That's why I think reasonable
>> > self-hosted solution that every project owns (including getting the
>> credits
>> > for that) is the only viable solution IMHO - only then you really start
>> > optimising stuff because you own both - the problem and the solution
>> > (and you do not - uncontrollably) impact other projects.
>> >
>> > We've just opened-up  today the self-hosted solution in Airflow -
>> > announcement from Ash here:
>> >
>> https://lists.apache.org/thread.html/r2e398f86479e4cbfca13c22e4499fb0becdbba20dd9d6d47e1ed30bd%40%3Cdev.airflow.apache.org%3E
>> > and we will be working out any "teething problems", once we are past
>> that,
>> >
>> > We are on our way to achieve the goal from the first paragraph - i.e. be
>> > able to control both problem and solution on a per-project basis. And
>> once
>> > we get some learnings - I am sure we will share our solution and
>> findings
>> > more widely with other projects, so that they could apply
>> > similar solutions. This is especially the missing "security piece"
>> which
>> > was a "blocker" so far, but also auto-scaling and tmpfs-optimisation
>> > results (which is a nice side-effect if we can get the 10x improvements
>> in
>> > feedback time eventually (as it seems we can get there).
>> >
>> > We love data @Airflow so we will gather some stats that everyone will be
>> > able to analyse and see how much they can gain from - not only the queue
>> > bottleneck removal but also improving the most important (in my opinion)
>> > metrics for the CI - which is feedback time. I personally think in CI
>> there
>> > are are the only two important metrics: reliability and feedback time.
>> > Nothing else (including cost) matters. But If we get all three improved.
>> > that would be something that we will be happy other projects can also
>> > benefit from.
>> >
>> > J.
>> >
>> >
>> >
>> >> On Tue, Feb 9, 2021 at 3:16 PM Matt Sicker <bo...@gmail.com> wrote:
>> >>
>> >> To be honest, this sounds exactly like the usual CI problem on every
>> >> platform. As your project scales up, CI becomes a Hard Problem. I don’t
>> >> think throwing hardware at it indefinitely works, though your research
>> here
>> >> is finding most of the useful things.
>> >>
>> >>> On Tue, Feb 9, 2021 at 02:21 Jarek Potiuk <ja...@potiuk.com> wrote:
>> >>>
>> >>> The report shows only top contenders. And yes - we know it is flawed -
>> >>> because it shows workflows not jobs (if you read the disclaimers - we
>> >>> simply have not enough API calls quota to get detailed information for
>> >> all
>> >>> projects).
>> >>>
>> >>> So this is anecdotal. I also get no queue when I submit PR at 11 pm.
>> >>> Actually whole Airflow committer team had to switch to the "night
>> shift"
>> >>> because of that. And the most "traffic-heavy" projects - Spark,
>> Pulsar,
>> >>> Superset, Beam, Airflow -  I think some of the top "traffic" projects
>> >>> experience the same issues and several hours queue when they run
>> during
>> >> the
>> >>> EMEA day/US morning.  And we all together try to help each other (for
>> >>> example I helped yesterday the Pulsar team to implement most
>> aggressive
>> >> way
>> >>> of cancelling their workflows
>> https://github.com/apache/pulsar/pull/9503
>> >>> (you can find pretty good explanation why and how it was implemented
>> this
>> >>> way), also we are working together with the Pulsar team to optimize
>> their
>> >>> workflow - there is a document
>> >>>
>> >>>
>> >>
>> https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit
>> >>> where several peopel are adding their suggestions (including myself
>> based
>> >>> on Airflow experiences).
>> >>>
>> >>> And with yetus' 12 (!)  wokflow runs over the last 2 monhts (!)
>> >>> https://pasteboard.co/JNwGLiR.png - indeed you have a high chance you
>> >> have
>> >>> not experienced it, especially that you are the only person committing
>> >>> there. This is hardly representative for other projects that have
>> 100s of
>> >>> committers and 100s of PRs a day. I am not sure if you are aware of
>> >>> that, but those are the most valuable projects for the ASF - as those
>> are
>> >>> the ones that actually build community (Folowing "comunity over code
>> >>> motto). If you have 3 PRs in 3 months and there aare 200 other
>> projects
>> >>> using GA, I think yetus is not going to show up in any meaningful
>> >>> statistics.
>> >>>
>> >>> I am not sure if drawing a conclusion from a project that has 3 PRs
>> in 2
>> >>> months is the best way of drawing conclusions for the overall Apache
>> >>> organisation. I think drawing a conclusion from experiences of 5
>> actually
>> >>> active projects with sometimes even 100 PRs a day is probably better
>> >>> justified (yep - there are such projects).
>> >>> So I would probably agree it has little influence on projects that
>> have
>> >> no
>> >>> traffic. But enormous influence on projects that actually have
>> traffic.
>> >> You
>> >>> have several teams of people scrambling now to  somehow manage their
>> CI
>> >> as
>> >>> it is unbearable now. Is this serious ? I'd say so.
>> >>>
>> >>>        When you see Airflow backed up, maybe you should try
>> submitting a
>> >>> PR to another project yourself to see what happens.
>> >>>
>> >>> I am already spending a TON of my private time trying to help others
>> in
>> >> the
>> >>> community. I would really appreciate a little help from your side. So
>> >> maybe
>> >>> you just submit 2-3 PRs yourself any time Monday - Friday 12pm CET ->
>> 8pm
>> >>> CET - this is where regularly bottlenecks happen. Please let everyone
>> >> know
>> >>> your findings
>> >>>
>> >>> J,
>> >>>
>> >>>
>> >>> On Tue, Feb 9, 2021 at 8:35 AM Allen Wittenauer
>> >>> <aw...@effectivemachines.com.invalid> wrote:
>> >>>
>> >>>>
>> >>>>
>> >>>>> On Feb 8, 2021, at 5:00 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
>> >>>>>
>> >>>>>> I'm not convinced this is true. I have yet to see any of my PRs for
>> >>>>> "non-big" projects getting queued while Spark, Airflow, others are.
>> >>> Thus
>> >>>>> why I think there are only a handful of projects that are getting
>> >> upset
>> >>>>> about this but the rest of us are like "meh whatever."
>> >>>>>
>> >>>>> Do you have any data on that? Or is it just anecdotal evidence?
>> >>>>
>> >>>>        Totally anecdotal.  Like when I literally ran a Yetus PR
>> during
>> >>>> the builds meeting as you were complaining about Airflow having an X
>> >> deep
>> >>>> queue. My PR ran fine, no pause.
>> >>>>
>> >>>>> You can see some analysis and actually even charts here:
>> >>>>>
>> >>>
>> https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
>> >>>>
>> >>>>        Yes, and I don't even see Yetus showing up.  I wonder how many
>> >>>> other projects are getting dropped from the dataset....
>> >>>>
>> >>>>> Maybe you have a very tiny "PR traffic" and it is mostly in the time
>> >>> zone
>> >>>>> that is not affected?
>> >>>>
>> >>>>        True, it has very tiny PR traffic right now.  (Sep/Oct/Nov was
>> >>>> different though)  But if it was one big FIFO queue, our PR jobs
>> would
>> >>> also
>> >>>> get queued.  They aren't even when I go look at one of the other
>> >> projects
>> >>>> that does have queued jobs.
>> >>>>
>> >>>>        When you see Airflow backed up, maybe you should try
>> >> submitting a
>> >>>> PR to another project yourself to see what happens.
>> >>>>
>> >>>>        All I'm saying is: right now, that document feels like it is
>> >>>> _greatly_ overstating the problem and now that you point it out,
>> >> clearly
>> >>>> dropping data.  It is problem, to be sure, but not all GitHub Actions
>> >>>> projects are suffering.  (I wouldn't be surprised if smaller projects
>> >> are
>> >>>> actually fast tracked through the build queue in order to avoid a
>> >> tyranny
>> >>>> of the majority/resource starvation problem... which would be ironic
>> >>> given
>> >>>> how much of an issue that is at the ASF.)
>> >>>
>> >>>
>> >>>
>> >>> --
>> >>> +48 660 796 129
>> >>>
>> >>
>> >
>> >
>> > --
>> > +48 660 796 129
>>
>
>
> --
> +48 660 796 129
>


-- 
+48 660 796 129

Re: GA again unreasonably slow (again)

Posted by Jarek Potiuk <ja...@potiuk.com>.
| The real hard problem is knowing when a change requires full regression
and integration testing of all possible platforms.

And here I absolutely agree too. Even more than that - I am a hard
practitioner of that. This is what we already implemented in Airflow (the
whole reasoning how and why it is implemented is here:
https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#)
(still we have a few optimisations left). We call it "selective checks".

And this is what I already proposed the Pulsar team to implement too - just
take a look at chapter 4) in their document:
https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit#


In Airflow it helped a lot at some point. We got ~70% less load on the
queue - mainly thanks to selective checks.

Unfortunately - with the shared queue of the ASF, it only helped for some
two weeks - precisely because we were the only obese who  have done that,
so while being gentle to others, we have not got the love back. But - this
is not a complaint,  I think it is a natural thing when you have shared
resources that you do not pay for. People will not optimise as it is
huge investment  for them (not only the cost of doing it, but also
increased complexity). It\s been mentioned several times that Airflow's CI
is over-engineered, but I think it is simply heavily optimized (which
brings necessary complexity).

Again - there is no way (and it would be even not fair TBH) to enforce this
optimisation of their processes for the benefits of others, if they have no
incentives. This is simply consequence of the model of "free shared
motorway". No matter how hard you try - you will eventually end-up with
traffic jams.

J.


On Tue, Feb 9, 2021 at 6:40 PM Dave Fisher <wa...@comcast.net> wrote:

> The real hard problem is knowing when a change requires full regression
> and integration testing of all possible platforms.
>
> I think projects are allowing lazy engineering if those making changes
> don’t know the level of testing needed for their changes.
>
> Now with easy lightweight branches all being fully tested ....
>
> This is my 10,000 meter view.
>
> But then I’m old school and on my first job the mainframe printout
> included how much the run I made was costing my boss in $.
>
> Best Regards,
> Dave
>
> Sent from my iPhone
>
> > On Feb 9, 2021, at 9:20 AM, Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > Absolutely agree Matt. Throwing more hardware at "all of the projects"
> is
> > definitely not going to help - I was telling that from the beginning - it
> > is like building free motorways - the more you build, the more traffic
> > flows and the traffic jams remain. That's why I think reasonable
> > self-hosted solution that every project owns (including getting the
> credits
> > for that) is the only viable solution IMHO - only then you really start
> > optimising stuff because you own both - the problem and the solution
> > (and you do not - uncontrollably) impact other projects.
> >
> > We've just opened-up  today the self-hosted solution in Airflow -
> > announcement from Ash here:
> >
> https://lists.apache.org/thread.html/r2e398f86479e4cbfca13c22e4499fb0becdbba20dd9d6d47e1ed30bd%40%3Cdev.airflow.apache.org%3E
> > and we will be working out any "teething problems", once we are past
> that,
> >
> > We are on our way to achieve the goal from the first paragraph - i.e. be
> > able to control both problem and solution on a per-project basis. And
> once
> > we get some learnings - I am sure we will share our solution and findings
> > more widely with other projects, so that they could apply
> > similar solutions. This is especially the missing "security piece"  which
> > was a "blocker" so far, but also auto-scaling and tmpfs-optimisation
> > results (which is a nice side-effect if we can get the 10x improvements
> in
> > feedback time eventually (as it seems we can get there).
> >
> > We love data @Airflow so we will gather some stats that everyone will be
> > able to analyse and see how much they can gain from - not only the queue
> > bottleneck removal but also improving the most important (in my opinion)
> > metrics for the CI - which is feedback time. I personally think in CI
> there
> > are are the only two important metrics: reliability and feedback time.
> > Nothing else (including cost) matters. But If we get all three improved.
> > that would be something that we will be happy other projects can also
> > benefit from.
> >
> > J.
> >
> >
> >
> >> On Tue, Feb 9, 2021 at 3:16 PM Matt Sicker <bo...@gmail.com> wrote:
> >>
> >> To be honest, this sounds exactly like the usual CI problem on every
> >> platform. As your project scales up, CI becomes a Hard Problem. I don’t
> >> think throwing hardware at it indefinitely works, though your research
> here
> >> is finding most of the useful things.
> >>
> >>> On Tue, Feb 9, 2021 at 02:21 Jarek Potiuk <ja...@potiuk.com> wrote:
> >>>
> >>> The report shows only top contenders. And yes - we know it is flawed -
> >>> because it shows workflows not jobs (if you read the disclaimers - we
> >>> simply have not enough API calls quota to get detailed information for
> >> all
> >>> projects).
> >>>
> >>> So this is anecdotal. I also get no queue when I submit PR at 11 pm.
> >>> Actually whole Airflow committer team had to switch to the "night
> shift"
> >>> because of that. And the most "traffic-heavy" projects - Spark, Pulsar,
> >>> Superset, Beam, Airflow -  I think some of the top "traffic" projects
> >>> experience the same issues and several hours queue when they run during
> >> the
> >>> EMEA day/US morning.  And we all together try to help each other (for
> >>> example I helped yesterday the Pulsar team to implement most aggressive
> >> way
> >>> of cancelling their workflows
> https://github.com/apache/pulsar/pull/9503
> >>> (you can find pretty good explanation why and how it was implemented
> this
> >>> way), also we are working together with the Pulsar team to optimize
> their
> >>> workflow - there is a document
> >>>
> >>>
> >>
> https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit
> >>> where several peopel are adding their suggestions (including myself
> based
> >>> on Airflow experiences).
> >>>
> >>> And with yetus' 12 (!)  wokflow runs over the last 2 monhts (!)
> >>> https://pasteboard.co/JNwGLiR.png - indeed you have a high chance you
> >> have
> >>> not experienced it, especially that you are the only person committing
> >>> there. This is hardly representative for other projects that have 100s
> of
> >>> committers and 100s of PRs a day. I am not sure if you are aware of
> >>> that, but those are the most valuable projects for the ASF - as those
> are
> >>> the ones that actually build community (Folowing "comunity over code
> >>> motto). If you have 3 PRs in 3 months and there aare 200 other projects
> >>> using GA, I think yetus is not going to show up in any meaningful
> >>> statistics.
> >>>
> >>> I am not sure if drawing a conclusion from a project that has 3 PRs in
> 2
> >>> months is the best way of drawing conclusions for the overall Apache
> >>> organisation. I think drawing a conclusion from experiences of 5
> actually
> >>> active projects with sometimes even 100 PRs a day is probably better
> >>> justified (yep - there are such projects).
> >>> So I would probably agree it has little influence on projects that have
> >> no
> >>> traffic. But enormous influence on projects that actually have traffic.
> >> You
> >>> have several teams of people scrambling now to  somehow manage their CI
> >> as
> >>> it is unbearable now. Is this serious ? I'd say so.
> >>>
> >>>        When you see Airflow backed up, maybe you should try submitting
> a
> >>> PR to another project yourself to see what happens.
> >>>
> >>> I am already spending a TON of my private time trying to help others in
> >> the
> >>> community. I would really appreciate a little help from your side. So
> >> maybe
> >>> you just submit 2-3 PRs yourself any time Monday - Friday 12pm CET ->
> 8pm
> >>> CET - this is where regularly bottlenecks happen. Please let everyone
> >> know
> >>> your findings
> >>>
> >>> J,
> >>>
> >>>
> >>> On Tue, Feb 9, 2021 at 8:35 AM Allen Wittenauer
> >>> <aw...@effectivemachines.com.invalid> wrote:
> >>>
> >>>>
> >>>>
> >>>>> On Feb 8, 2021, at 5:00 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
> >>>>>
> >>>>>> I'm not convinced this is true. I have yet to see any of my PRs for
> >>>>> "non-big" projects getting queued while Spark, Airflow, others are.
> >>> Thus
> >>>>> why I think there are only a handful of projects that are getting
> >> upset
> >>>>> about this but the rest of us are like "meh whatever."
> >>>>>
> >>>>> Do you have any data on that? Or is it just anecdotal evidence?
> >>>>
> >>>>        Totally anecdotal.  Like when I literally ran a Yetus PR during
> >>>> the builds meeting as you were complaining about Airflow having an X
> >> deep
> >>>> queue. My PR ran fine, no pause.
> >>>>
> >>>>> You can see some analysis and actually even charts here:
> >>>>>
> >>>
> https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
> >>>>
> >>>>        Yes, and I don't even see Yetus showing up.  I wonder how many
> >>>> other projects are getting dropped from the dataset....
> >>>>
> >>>>> Maybe you have a very tiny "PR traffic" and it is mostly in the time
> >>> zone
> >>>>> that is not affected?
> >>>>
> >>>>        True, it has very tiny PR traffic right now.  (Sep/Oct/Nov was
> >>>> different though)  But if it was one big FIFO queue, our PR jobs would
> >>> also
> >>>> get queued.  They aren't even when I go look at one of the other
> >> projects
> >>>> that does have queued jobs.
> >>>>
> >>>>        When you see Airflow backed up, maybe you should try
> >> submitting a
> >>>> PR to another project yourself to see what happens.
> >>>>
> >>>>        All I'm saying is: right now, that document feels like it is
> >>>> _greatly_ overstating the problem and now that you point it out,
> >> clearly
> >>>> dropping data.  It is problem, to be sure, but not all GitHub Actions
> >>>> projects are suffering.  (I wouldn't be surprised if smaller projects
> >> are
> >>>> actually fast tracked through the build queue in order to avoid a
> >> tyranny
> >>>> of the majority/resource starvation problem... which would be ironic
> >>> given
> >>>> how much of an issue that is at the ASF.)
> >>>
> >>>
> >>>
> >>> --
> >>> +48 660 796 129
> >>>
> >>
> >
> >
> > --
> > +48 660 796 129
>


-- 
+48 660 796 129

Re: GA again unreasonably slow (again)

Posted by Dave Fisher <wa...@comcast.net>.
The real hard problem is knowing when a change requires full regression and integration testing of all possible platforms.

I think projects are allowing lazy engineering if those making changes don’t know the level of testing needed for their changes.

Now with easy lightweight branches all being fully tested ....

This is my 10,000 meter view.

But then I’m old school and on my first job the mainframe printout included how much the run I made was costing my boss in $.

Best Regards,
Dave

Sent from my iPhone

> On Feb 9, 2021, at 9:20 AM, Jarek Potiuk <ja...@potiuk.com> wrote:
> 
> Absolutely agree Matt. Throwing more hardware at "all of the projects" is
> definitely not going to help - I was telling that from the beginning - it
> is like building free motorways - the more you build, the more traffic
> flows and the traffic jams remain. That's why I think reasonable
> self-hosted solution that every project owns (including getting the credits
> for that) is the only viable solution IMHO - only then you really start
> optimising stuff because you own both - the problem and the solution
> (and you do not - uncontrollably) impact other projects.
> 
> We've just opened-up  today the self-hosted solution in Airflow -
> announcement from Ash here:
> https://lists.apache.org/thread.html/r2e398f86479e4cbfca13c22e4499fb0becdbba20dd9d6d47e1ed30bd%40%3Cdev.airflow.apache.org%3E
> and we will be working out any "teething problems", once we are past that,
> 
> We are on our way to achieve the goal from the first paragraph - i.e. be
> able to control both problem and solution on a per-project basis. And once
> we get some learnings - I am sure we will share our solution and findings
> more widely with other projects, so that they could apply
> similar solutions. This is especially the missing "security piece"  which
> was a "blocker" so far, but also auto-scaling and tmpfs-optimisation
> results (which is a nice side-effect if we can get the 10x improvements in
> feedback time eventually (as it seems we can get there).
> 
> We love data @Airflow so we will gather some stats that everyone will be
> able to analyse and see how much they can gain from - not only the queue
> bottleneck removal but also improving the most important (in my opinion)
> metrics for the CI - which is feedback time. I personally think in CI there
> are are the only two important metrics: reliability and feedback time.
> Nothing else (including cost) matters. But If we get all three improved.
> that would be something that we will be happy other projects can also
> benefit from.
> 
> J.
> 
> 
> 
>> On Tue, Feb 9, 2021 at 3:16 PM Matt Sicker <bo...@gmail.com> wrote:
>> 
>> To be honest, this sounds exactly like the usual CI problem on every
>> platform. As your project scales up, CI becomes a Hard Problem. I don’t
>> think throwing hardware at it indefinitely works, though your research here
>> is finding most of the useful things.
>> 
>>> On Tue, Feb 9, 2021 at 02:21 Jarek Potiuk <ja...@potiuk.com> wrote:
>>> 
>>> The report shows only top contenders. And yes - we know it is flawed -
>>> because it shows workflows not jobs (if you read the disclaimers - we
>>> simply have not enough API calls quota to get detailed information for
>> all
>>> projects).
>>> 
>>> So this is anecdotal. I also get no queue when I submit PR at 11 pm.
>>> Actually whole Airflow committer team had to switch to the "night shift"
>>> because of that. And the most "traffic-heavy" projects - Spark, Pulsar,
>>> Superset, Beam, Airflow -  I think some of the top "traffic" projects
>>> experience the same issues and several hours queue when they run during
>> the
>>> EMEA day/US morning.  And we all together try to help each other (for
>>> example I helped yesterday the Pulsar team to implement most aggressive
>> way
>>> of cancelling their workflows https://github.com/apache/pulsar/pull/9503
>>> (you can find pretty good explanation why and how it was implemented this
>>> way), also we are working together with the Pulsar team to optimize their
>>> workflow - there is a document
>>> 
>>> 
>> https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit
>>> where several peopel are adding their suggestions (including myself based
>>> on Airflow experiences).
>>> 
>>> And with yetus' 12 (!)  wokflow runs over the last 2 monhts (!)
>>> https://pasteboard.co/JNwGLiR.png - indeed you have a high chance you
>> have
>>> not experienced it, especially that you are the only person committing
>>> there. This is hardly representative for other projects that have 100s of
>>> committers and 100s of PRs a day. I am not sure if you are aware of
>>> that, but those are the most valuable projects for the ASF - as those are
>>> the ones that actually build community (Folowing "comunity over code
>>> motto). If you have 3 PRs in 3 months and there aare 200 other projects
>>> using GA, I think yetus is not going to show up in any meaningful
>>> statistics.
>>> 
>>> I am not sure if drawing a conclusion from a project that has 3 PRs in 2
>>> months is the best way of drawing conclusions for the overall Apache
>>> organisation. I think drawing a conclusion from experiences of 5 actually
>>> active projects with sometimes even 100 PRs a day is probably better
>>> justified (yep - there are such projects).
>>> So I would probably agree it has little influence on projects that have
>> no
>>> traffic. But enormous influence on projects that actually have traffic.
>> You
>>> have several teams of people scrambling now to  somehow manage their CI
>> as
>>> it is unbearable now. Is this serious ? I'd say so.
>>> 
>>>        When you see Airflow backed up, maybe you should try submitting a
>>> PR to another project yourself to see what happens.
>>> 
>>> I am already spending a TON of my private time trying to help others in
>> the
>>> community. I would really appreciate a little help from your side. So
>> maybe
>>> you just submit 2-3 PRs yourself any time Monday - Friday 12pm CET -> 8pm
>>> CET - this is where regularly bottlenecks happen. Please let everyone
>> know
>>> your findings
>>> 
>>> J,
>>> 
>>> 
>>> On Tue, Feb 9, 2021 at 8:35 AM Allen Wittenauer
>>> <aw...@effectivemachines.com.invalid> wrote:
>>> 
>>>> 
>>>> 
>>>>> On Feb 8, 2021, at 5:00 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
>>>>> 
>>>>>> I'm not convinced this is true. I have yet to see any of my PRs for
>>>>> "non-big" projects getting queued while Spark, Airflow, others are.
>>> Thus
>>>>> why I think there are only a handful of projects that are getting
>> upset
>>>>> about this but the rest of us are like "meh whatever."
>>>>> 
>>>>> Do you have any data on that? Or is it just anecdotal evidence?
>>>> 
>>>>        Totally anecdotal.  Like when I literally ran a Yetus PR during
>>>> the builds meeting as you were complaining about Airflow having an X
>> deep
>>>> queue. My PR ran fine, no pause.
>>>> 
>>>>> You can see some analysis and actually even charts here:
>>>>> 
>>> https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
>>>> 
>>>>        Yes, and I don't even see Yetus showing up.  I wonder how many
>>>> other projects are getting dropped from the dataset....
>>>> 
>>>>> Maybe you have a very tiny "PR traffic" and it is mostly in the time
>>> zone
>>>>> that is not affected?
>>>> 
>>>>        True, it has very tiny PR traffic right now.  (Sep/Oct/Nov was
>>>> different though)  But if it was one big FIFO queue, our PR jobs would
>>> also
>>>> get queued.  They aren't even when I go look at one of the other
>> projects
>>>> that does have queued jobs.
>>>> 
>>>>        When you see Airflow backed up, maybe you should try
>> submitting a
>>>> PR to another project yourself to see what happens.
>>>> 
>>>>        All I'm saying is: right now, that document feels like it is
>>>> _greatly_ overstating the problem and now that you point it out,
>> clearly
>>>> dropping data.  It is problem, to be sure, but not all GitHub Actions
>>>> projects are suffering.  (I wouldn't be surprised if smaller projects
>> are
>>>> actually fast tracked through the build queue in order to avoid a
>> tyranny
>>>> of the majority/resource starvation problem... which would be ironic
>>> given
>>>> how much of an issue that is at the ASF.)
>>> 
>>> 
>>> 
>>> --
>>> +48 660 796 129
>>> 
>> 
> 
> 
> -- 
> +48 660 796 129

Re: GA again unreasonably slow (again)

Posted by Jarek Potiuk <ja...@potiuk.com>.
Absolutely agree Matt. Throwing more hardware at "all of the projects" is
definitely not going to help - I was telling that from the beginning - it
is like building free motorways - the more you build, the more traffic
flows and the traffic jams remain. That's why I think reasonable
self-hosted solution that every project owns (including getting the credits
for that) is the only viable solution IMHO - only then you really start
optimising stuff because you own both - the problem and the solution
(and you do not - uncontrollably) impact other projects.

We've just opened-up  today the self-hosted solution in Airflow -
announcement from Ash here:
https://lists.apache.org/thread.html/r2e398f86479e4cbfca13c22e4499fb0becdbba20dd9d6d47e1ed30bd%40%3Cdev.airflow.apache.org%3E
and we will be working out any "teething problems", once we are past that,

We are on our way to achieve the goal from the first paragraph - i.e. be
able to control both problem and solution on a per-project basis. And once
we get some learnings - I am sure we will share our solution and findings
more widely with other projects, so that they could apply
similar solutions. This is especially the missing "security piece"  which
was a "blocker" so far, but also auto-scaling and tmpfs-optimisation
results (which is a nice side-effect if we can get the 10x improvements in
feedback time eventually (as it seems we can get there).

We love data @Airflow so we will gather some stats that everyone will be
able to analyse and see how much they can gain from - not only the queue
bottleneck removal but also improving the most important (in my opinion)
metrics for the CI - which is feedback time. I personally think in CI there
are are the only two important metrics: reliability and feedback time.
Nothing else (including cost) matters. But If we get all three improved.
that would be something that we will be happy other projects can also
benefit from.

J.



On Tue, Feb 9, 2021 at 3:16 PM Matt Sicker <bo...@gmail.com> wrote:

> To be honest, this sounds exactly like the usual CI problem on every
> platform. As your project scales up, CI becomes a Hard Problem. I don’t
> think throwing hardware at it indefinitely works, though your research here
> is finding most of the useful things.
>
> On Tue, Feb 9, 2021 at 02:21 Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > The report shows only top contenders. And yes - we know it is flawed -
> > because it shows workflows not jobs (if you read the disclaimers - we
> > simply have not enough API calls quota to get detailed information for
> all
> > projects).
> >
> > So this is anecdotal. I also get no queue when I submit PR at 11 pm.
> > Actually whole Airflow committer team had to switch to the "night shift"
> > because of that. And the most "traffic-heavy" projects - Spark, Pulsar,
> > Superset, Beam, Airflow -  I think some of the top "traffic" projects
> > experience the same issues and several hours queue when they run during
> the
> > EMEA day/US morning.  And we all together try to help each other (for
> > example I helped yesterday the Pulsar team to implement most aggressive
> way
> > of cancelling their workflows https://github.com/apache/pulsar/pull/9503
> > (you can find pretty good explanation why and how it was implemented this
> > way), also we are working together with the Pulsar team to optimize their
> > workflow - there is a document
> >
> >
> https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit
> > where several peopel are adding their suggestions (including myself based
> > on Airflow experiences).
> >
> > And with yetus' 12 (!)  wokflow runs over the last 2 monhts (!)
> > https://pasteboard.co/JNwGLiR.png - indeed you have a high chance you
> have
> > not experienced it, especially that you are the only person committing
> > there. This is hardly representative for other projects that have 100s of
> > committers and 100s of PRs a day. I am not sure if you are aware of
> > that, but those are the most valuable projects for the ASF - as those are
> > the ones that actually build community (Folowing "comunity over code
> > motto). If you have 3 PRs in 3 months and there aare 200 other projects
> > using GA, I think yetus is not going to show up in any meaningful
> > statistics.
> >
> > I am not sure if drawing a conclusion from a project that has 3 PRs in 2
> > months is the best way of drawing conclusions for the overall Apache
> > organisation. I think drawing a conclusion from experiences of 5 actually
> > active projects with sometimes even 100 PRs a day is probably better
> > justified (yep - there are such projects).
> > So I would probably agree it has little influence on projects that have
> no
> > traffic. But enormous influence on projects that actually have traffic.
> You
> > have several teams of people scrambling now to  somehow manage their CI
> as
> > it is unbearable now. Is this serious ? I'd say so.
> >
> >         When you see Airflow backed up, maybe you should try submitting a
> > PR to another project yourself to see what happens.
> >
> > I am already spending a TON of my private time trying to help others in
> the
> > community. I would really appreciate a little help from your side. So
> maybe
> > you just submit 2-3 PRs yourself any time Monday - Friday 12pm CET -> 8pm
> > CET - this is where regularly bottlenecks happen. Please let everyone
> know
> > your findings
> >
> > J,
> >
> >
> > On Tue, Feb 9, 2021 at 8:35 AM Allen Wittenauer
> > <aw...@effectivemachines.com.invalid> wrote:
> >
> > >
> > >
> > > > On Feb 8, 2021, at 5:00 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
> > > >
> > > >> I'm not convinced this is true. I have yet to see any of my PRs for
> > > > "non-big" projects getting queued while Spark, Airflow, others are.
> > Thus
> > > > why I think there are only a handful of projects that are getting
> upset
> > > > about this but the rest of us are like "meh whatever."
> > > >
> > > > Do you have any data on that? Or is it just anecdotal evidence?
> > >
> > >         Totally anecdotal.  Like when I literally ran a Yetus PR during
> > > the builds meeting as you were complaining about Airflow having an X
> deep
> > > queue. My PR ran fine, no pause.
> > >
> > > > You can see some analysis and actually even charts here:
> > > >
> > https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
> > >
> > >         Yes, and I don't even see Yetus showing up.  I wonder how many
> > > other projects are getting dropped from the dataset....
> > >
> > > > Maybe you have a very tiny "PR traffic" and it is mostly in the time
> > zone
> > > > that is not affected?
> > >
> > >         True, it has very tiny PR traffic right now.  (Sep/Oct/Nov was
> > > different though)  But if it was one big FIFO queue, our PR jobs would
> > also
> > > get queued.  They aren't even when I go look at one of the other
> projects
> > > that does have queued jobs.
> > >
> > >         When you see Airflow backed up, maybe you should try
> submitting a
> > > PR to another project yourself to see what happens.
> > >
> > >         All I'm saying is: right now, that document feels like it is
> > > _greatly_ overstating the problem and now that you point it out,
> clearly
> > > dropping data.  It is problem, to be sure, but not all GitHub Actions
> > > projects are suffering.  (I wouldn't be surprised if smaller projects
> are
> > > actually fast tracked through the build queue in order to avoid a
> tyranny
> > > of the majority/resource starvation problem... which would be ironic
> > given
> > > how much of an issue that is at the ASF.)
> >
> >
> >
> > --
> > +48 660 796 129
> >
>


-- 
+48 660 796 129

Re: GA again unreasonably slow (again)

Posted by Matt Sicker <bo...@gmail.com>.
To be honest, this sounds exactly like the usual CI problem on every
platform. As your project scales up, CI becomes a Hard Problem. I don’t
think throwing hardware at it indefinitely works, though your research here
is finding most of the useful things.

On Tue, Feb 9, 2021 at 02:21 Jarek Potiuk <ja...@potiuk.com> wrote:

> The report shows only top contenders. And yes - we know it is flawed -
> because it shows workflows not jobs (if you read the disclaimers - we
> simply have not enough API calls quota to get detailed information for all
> projects).
>
> So this is anecdotal. I also get no queue when I submit PR at 11 pm.
> Actually whole Airflow committer team had to switch to the "night shift"
> because of that. And the most "traffic-heavy" projects - Spark, Pulsar,
> Superset, Beam, Airflow -  I think some of the top "traffic" projects
> experience the same issues and several hours queue when they run during the
> EMEA day/US morning.  And we all together try to help each other (for
> example I helped yesterday the Pulsar team to implement most aggressive way
> of cancelling their workflows https://github.com/apache/pulsar/pull/9503
> (you can find pretty good explanation why and how it was implemented this
> way), also we are working together with the Pulsar team to optimize their
> workflow - there is a document
>
> https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit
> where several peopel are adding their suggestions (including myself based
> on Airflow experiences).
>
> And with yetus' 12 (!)  wokflow runs over the last 2 monhts (!)
> https://pasteboard.co/JNwGLiR.png - indeed you have a high chance you have
> not experienced it, especially that you are the only person committing
> there. This is hardly representative for other projects that have 100s of
> committers and 100s of PRs a day. I am not sure if you are aware of
> that, but those are the most valuable projects for the ASF - as those are
> the ones that actually build community (Folowing "comunity over code
> motto). If you have 3 PRs in 3 months and there aare 200 other projects
> using GA, I think yetus is not going to show up in any meaningful
> statistics.
>
> I am not sure if drawing a conclusion from a project that has 3 PRs in 2
> months is the best way of drawing conclusions for the overall Apache
> organisation. I think drawing a conclusion from experiences of 5 actually
> active projects with sometimes even 100 PRs a day is probably better
> justified (yep - there are such projects).
> So I would probably agree it has little influence on projects that have no
> traffic. But enormous influence on projects that actually have traffic. You
> have several teams of people scrambling now to  somehow manage their CI as
> it is unbearable now. Is this serious ? I'd say so.
>
>         When you see Airflow backed up, maybe you should try submitting a
> PR to another project yourself to see what happens.
>
> I am already spending a TON of my private time trying to help others in the
> community. I would really appreciate a little help from your side. So maybe
> you just submit 2-3 PRs yourself any time Monday - Friday 12pm CET -> 8pm
> CET - this is where regularly bottlenecks happen. Please let everyone know
> your findings
>
> J,
>
>
> On Tue, Feb 9, 2021 at 8:35 AM Allen Wittenauer
> <aw...@effectivemachines.com.invalid> wrote:
>
> >
> >
> > > On Feb 8, 2021, at 5:00 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
> > >
> > >> I'm not convinced this is true. I have yet to see any of my PRs for
> > > "non-big" projects getting queued while Spark, Airflow, others are.
> Thus
> > > why I think there are only a handful of projects that are getting upset
> > > about this but the rest of us are like "meh whatever."
> > >
> > > Do you have any data on that? Or is it just anecdotal evidence?
> >
> >         Totally anecdotal.  Like when I literally ran a Yetus PR during
> > the builds meeting as you were complaining about Airflow having an X deep
> > queue. My PR ran fine, no pause.
> >
> > > You can see some analysis and actually even charts here:
> > >
> https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
> >
> >         Yes, and I don't even see Yetus showing up.  I wonder how many
> > other projects are getting dropped from the dataset....
> >
> > > Maybe you have a very tiny "PR traffic" and it is mostly in the time
> zone
> > > that is not affected?
> >
> >         True, it has very tiny PR traffic right now.  (Sep/Oct/Nov was
> > different though)  But if it was one big FIFO queue, our PR jobs would
> also
> > get queued.  They aren't even when I go look at one of the other projects
> > that does have queued jobs.
> >
> >         When you see Airflow backed up, maybe you should try submitting a
> > PR to another project yourself to see what happens.
> >
> >         All I'm saying is: right now, that document feels like it is
> > _greatly_ overstating the problem and now that you point it out, clearly
> > dropping data.  It is problem, to be sure, but not all GitHub Actions
> > projects are suffering.  (I wouldn't be surprised if smaller projects are
> > actually fast tracked through the build queue in order to avoid a tyranny
> > of the majority/resource starvation problem... which would be ironic
> given
> > how much of an issue that is at the ASF.)
>
>
>
> --
> +48 660 796 129
>

Re: GA workflow cancellation

Posted by Jarek Potiuk <ja...@potiuk.com>.
I updated the docs slightly:

* https://github.com/potiuk/cancel-workflow-runs#the-action-target-workflow
-> description about the workflow the action acts on
*
https://github.com/potiuk/cancel-workflow-runs#tackling-the-high-queue-strain-situation
-> I added "Tackling the high queue strain situation" chapter - where I
explain how `allDuplicates` mode helps to fight it.

J


On Tue, Feb 9, 2021 at 7:53 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> > Er... so "source" workflow is the triggering workflow, and "target"
> workflow is the cancelling workflow?  Or is it something else?  I'm
> afraid all those explanations are a bit confusing to me :-)
>
> Precisely. Unfortunately GA's features for the "special" workflows are as
> powerful as they are complex, and even they did not introduce a good
> terminology. The whole `workflow_run` concept is fantastic from being
> security conscious, but it is also damn difficult to wrap your head around
> it. It took me quite some time to master it and even now I have trouble
> with explaining it, sorry.
> But yes - the terminology I introduced is: PR workflow ("source workflow")
> with READ-ONLY GITHUB_TOKEN is the one that triggers the "workflow_run"
> Cancel workflow ("target workflow") (with WRITE GITHUB_TOKEN).
>
> > I would gladly try to make a PR, but my Javascript competence is
> approximately zero, unfortunately. :-S
>
> I am not at all Javascript dev. If anything I am nowadays mostly a Python
> and Bash developer (often also a Gmail writer ;) ) , but I do have a good
> deal of C/C++/Java/Groovy/Bash and a bit of Javascript under my belt.
> Actually, the action is written in TypeScript not Javascript (Typescript is
> transpiled to javascript) and developing in Typescript if you have a good
> IDE is easy :). This might be easier than you think.
>
> J.
>
> On Tue, Feb 9, 2021 at 7:34 PM Antoine Pitrou <an...@python.org> wrote:
>
>>
>> Le 09/02/2021 à 19:28, Jarek Potiuk a écrit :
>> > Yes. I noticed this misleading line and I will update the docs shortly.
>> It
>> > should be """"Cancels duplicate runs from all running workflow *runs of
>> the
>> > workflow the action acts on*."""
>> >
>> > The action is written in the way that it always acts on a single
>> workflow.
>> >
>> > By default this is indeed the same workflow as the one run by action.
>> But
>> > there are few ways it can be a different workflow:
>> >
>> > 1) In most cases (especially when you want to enable canceling for Pull
>> > Requests from forks), the action should be run "workflow_run" event. And
>> > this even has two workflows "the source one" (for example pull_request)
>> and
>> > "the target one" (this is the workflow that the cancel action should be
>> > part of). If you specify "${{ github.event.workflow_run.id }}" as
>> > sourceRunId - then the action will act on the "Source" workflow instead
>> of
>> > the default "target one".
>>
>> Er... so "source" workflow is the triggering workflow, and "target"
>> workflow is the cancelling workflow?  Or is it something else?  I'm
>> afraid all those explanations are a bit confusing to me :-)
>>
>> > 2) You can explicitly specify the "workflowFileName" as I did in Pulsar
>> -
>> > then the action will act on the specified workflow.
>> >
>> > It would be possible to add a feature to the action to run on "array of
>> > workflowFileNames" - It is possible, and not very complex, but I decided
>> > not to do it, as it is just a slight improvement over the copy&pasting
>> the
>> > action to workflow file several times (As I did in pulsar).
>>
>> The "array of workflowFileNames" would be quite welcome for us, because
>> right now the amount of copy / pasting is a bit unwiedly, and it's very
>> easy to make an error:
>> https://github.com/apache/arrow/pull/9455/files
>>
>> I would gladly try to make a PR, but my Javascript competence is
>> approximately zero, unfortunately. :-S
>>
>> Best regards
>>
>> Antoine.
>>
>
>
> --
> +48 660 796 129
>


-- 
+48 660 796 129

Re: GA workflow cancellation

Posted by Jarek Potiuk <ja...@potiuk.com>.
> Er... so "source" workflow is the triggering workflow, and "target"
workflow is the cancelling workflow?  Or is it something else?  I'm
afraid all those explanations are a bit confusing to me :-)

Precisely. Unfortunately GA's features for the "special" workflows are as
powerful as they are complex, and even they did not introduce a good
terminology. The whole `workflow_run` concept is fantastic from being
security conscious, but it is also damn difficult to wrap your head around
it. It took me quite some time to master it and even now I have trouble
with explaining it, sorry.
But yes - the terminology I introduced is: PR workflow ("source workflow")
with READ-ONLY GITHUB_TOKEN is the one that triggers the "workflow_run"
Cancel workflow ("target workflow") (with WRITE GITHUB_TOKEN).

> I would gladly try to make a PR, but my Javascript competence is
approximately zero, unfortunately. :-S

I am not at all Javascript dev. If anything I am nowadays mostly a Python
and Bash developer (often also a Gmail writer ;) ) , but I do have a good
deal of C/C++/Java/Groovy/Bash and a bit of Javascript under my belt.
Actually, the action is written in TypeScript not Javascript (Typescript is
transpiled to javascript) and developing in Typescript if you have a good
IDE is easy :). This might be easier than you think.

J.

On Tue, Feb 9, 2021 at 7:34 PM Antoine Pitrou <an...@python.org> wrote:

>
> Le 09/02/2021 à 19:28, Jarek Potiuk a écrit :
> > Yes. I noticed this misleading line and I will update the docs shortly.
> It
> > should be """"Cancels duplicate runs from all running workflow *runs of
> the
> > workflow the action acts on*."""
> >
> > The action is written in the way that it always acts on a single
> workflow.
> >
> > By default this is indeed the same workflow as the one run by action. But
> > there are few ways it can be a different workflow:
> >
> > 1) In most cases (especially when you want to enable canceling for Pull
> > Requests from forks), the action should be run "workflow_run" event. And
> > this even has two workflows "the source one" (for example pull_request)
> and
> > "the target one" (this is the workflow that the cancel action should be
> > part of). If you specify "${{ github.event.workflow_run.id }}" as
> > sourceRunId - then the action will act on the "Source" workflow instead
> of
> > the default "target one".
>
> Er... so "source" workflow is the triggering workflow, and "target"
> workflow is the cancelling workflow?  Or is it something else?  I'm
> afraid all those explanations are a bit confusing to me :-)
>
> > 2) You can explicitly specify the "workflowFileName" as I did in Pulsar -
> > then the action will act on the specified workflow.
> >
> > It would be possible to add a feature to the action to run on "array of
> > workflowFileNames" - It is possible, and not very complex, but I decided
> > not to do it, as it is just a slight improvement over the copy&pasting
> the
> > action to workflow file several times (As I did in pulsar).
>
> The "array of workflowFileNames" would be quite welcome for us, because
> right now the amount of copy / pasting is a bit unwiedly, and it's very
> easy to make an error:
> https://github.com/apache/arrow/pull/9455/files
>
> I would gladly try to make a PR, but my Javascript competence is
> approximately zero, unfortunately. :-S
>
> Best regards
>
> Antoine.
>


-- 
+48 660 796 129

Re: GA workflow cancellation

Posted by Antoine Pitrou <an...@python.org>.
Le 09/02/2021 à 19:28, Jarek Potiuk a écrit :
> Yes. I noticed this misleading line and I will update the docs shortly. It
> should be """"Cancels duplicate runs from all running workflow *runs of the
> workflow the action acts on*."""
> 
> The action is written in the way that it always acts on a single workflow.
> 
> By default this is indeed the same workflow as the one run by action. But
> there are few ways it can be a different workflow:
> 
> 1) In most cases (especially when you want to enable canceling for Pull
> Requests from forks), the action should be run "workflow_run" event. And
> this even has two workflows "the source one" (for example pull_request) and
> "the target one" (this is the workflow that the cancel action should be
> part of). If you specify "${{ github.event.workflow_run.id }}" as
> sourceRunId - then the action will act on the "Source" workflow instead of
> the default "target one".

Er... so "source" workflow is the triggering workflow, and "target"
workflow is the cancelling workflow?  Or is it something else?  I'm
afraid all those explanations are a bit confusing to me :-)

> 2) You can explicitly specify the "workflowFileName" as I did in Pulsar -
> then the action will act on the specified workflow.
> 
> It would be possible to add a feature to the action to run on "array of
> workflowFileNames" - It is possible, and not very complex, but I decided
> not to do it, as it is just a slight improvement over the copy&pasting the
> action to workflow file several times (As I did in pulsar).

The "array of workflowFileNames" would be quite welcome for us, because
right now the amount of copy / pasting is a bit unwiedly, and it's very
easy to make an error:
https://github.com/apache/arrow/pull/9455/files

I would gladly try to make a PR, but my Javascript competence is
approximately zero, unfortunately. :-S

Best regards

Antoine.

Re: GA workflow cancellation

Posted by Jarek Potiuk <ja...@potiuk.com>.
Yes. I noticed this misleading line and I will update the docs shortly. It
should be """"Cancels duplicate runs from all running workflow *runs of the
workflow the action acts on*."""

The action is written in the way that it always acts on a single workflow.

By default this is indeed the same workflow as the one run by action. But
there are few ways it can be a different workflow:

1) In most cases (especially when you want to enable canceling for Pull
Requests from forks), the action should be run "workflow_run" event. And
this even has two workflows "the source one" (for example pull_request) and
"the target one" (this is the workflow that the cancel action should be
part of). If you specify "${{ github.event.workflow_run.id }}" as
sourceRunId - then the action will act on the "Source" workflow instead of
the default "target one".
2) You can explicitly specify the "workflowFileName" as I did in Pulsar -
then the action will act on the specified workflow.

It would be possible to add a feature to the action to run on "array of
workflowFileNames" - It is possible, and not very complex, but I decided
not to do it, as it is just a slight improvement over the copy&pasting the
action to workflow file several times (As I did in pulsar). The "workflow
per action" however can have the benefit that you can clearly see in the
Gitihub Action output the logs separated per workflow - which makes it
easier to debug and reason about.
However similar results could be achieved by using log grouping
https://docs.github.com/en/actions/reference/workflow-commands-for-github-actions#grouping-log-lines
which I did not realize back then. So if someone would like to make PR and
add such feature - I am happy to review and merge it! You could also
potentially query for all the workflows using the API, but this is rather
dangerous IMHO - for example in Pulsar I had to exclude one of the
workflows, because it was only triggered by creating/updating an issue and
my action failed when trying to query it. Probably this could be detected
and fixed, but there are many types of workflows in Github Actions, so
spelling them out explicitly is better Idea IMHO.

Regarding "queuing the cancelling workflow. Indeed, cancel workflows will
be themselves queued - this is why "alllDuplicates" cancel mode was
introduced - I explained exactly how it works and how the "high strain"
situation is handled in this comment:
https://github.com/apache/pulsar/pull/9503#issuecomment-774644408  - I am
moving part of the explanation to the documentation of the action soon.

J.


On Tue, Feb 9, 2021 at 11:43 AM Antoine Pitrou <an...@python.org> wrote:

>
> Hi Jarek,
>
> Le 09/02/2021 à 09:21, Jarek Potiuk a écrit :
> > And we all together try to help each other (for
> > example I helped yesterday the Pulsar team to implement most aggressive
> way
> > of cancelling their workflows https://github.com/apache/pulsar/pull/9503
> > (you can find pretty good explanation why and how it was implemented this
> > way),
>
> Thanks for doing this.  I'm trying to experiment with your action and
> there's something a bit confusing in the documentation.  It says for the
> "allDuplicates" action:
>
> """Cancels duplicate runs from all running workflows."""
>
> However, it doesn't seem to cancel duplicates from all running
> workflows, only from the triggering workflow.  Is that true?
>
> Would there be a way to really cancel duplicates accross all workflows,
> not only the triggering workflow, without listing all the workflow files
> explicitly like you did for Pulsar?
>
>
> Ah, another concern: since the cancelling workflow is itself a workflow,
> it may be subject to the same queueing as other workflows... meaning, if
> the build queue is clogged, it will run too late to have the maximum
> effect?
>
> Thank you
>
> Regards
>
> Antoine.
>


-- 
+48 660 796 129

Re: GA workflow cancellation

Posted by Antoine Pitrou <an...@python.org>.
Hi Jarek,

Le 09/02/2021 à 09:21, Jarek Potiuk a écrit :
> And we all together try to help each other (for
> example I helped yesterday the Pulsar team to implement most aggressive way
> of cancelling their workflows https://github.com/apache/pulsar/pull/9503
> (you can find pretty good explanation why and how it was implemented this
> way),

Thanks for doing this.  I'm trying to experiment with your action and
there's something a bit confusing in the documentation.  It says for the
"allDuplicates" action:

"""Cancels duplicate runs from all running workflows."""

However, it doesn't seem to cancel duplicates from all running
workflows, only from the triggering workflow.  Is that true?

Would there be a way to really cancel duplicates accross all workflows,
not only the triggering workflow, without listing all the workflow files
explicitly like you did for Pulsar?


Ah, another concern: since the cancelling workflow is itself a workflow,
it may be subject to the same queueing as other workflows... meaning, if
the build queue is clogged, it will run too late to have the maximum effect?

Thank you

Regards

Antoine.

Re: GA again unreasonably slow (again)

Posted by Jarek Potiuk <ja...@potiuk.com>.
The report shows only top contenders. And yes - we know it is flawed -
because it shows workflows not jobs (if you read the disclaimers - we
simply have not enough API calls quota to get detailed information for all
projects).

So this is anecdotal. I also get no queue when I submit PR at 11 pm.
Actually whole Airflow committer team had to switch to the "night shift"
because of that. And the most "traffic-heavy" projects - Spark, Pulsar,
Superset, Beam, Airflow -  I think some of the top "traffic" projects
experience the same issues and several hours queue when they run during the
EMEA day/US morning.  And we all together try to help each other (for
example I helped yesterday the Pulsar team to implement most aggressive way
of cancelling their workflows https://github.com/apache/pulsar/pull/9503
(you can find pretty good explanation why and how it was implemented this
way), also we are working together with the Pulsar team to optimize their
workflow - there is a document
https://docs.google.com/document/d/1FNEWD3COdnNGMiryO9qBUW_83qtzAhqjDI5wwmPD-YE/edit
where several peopel are adding their suggestions (including myself based
on Airflow experiences).

And with yetus' 12 (!)  wokflow runs over the last 2 monhts (!)
https://pasteboard.co/JNwGLiR.png - indeed you have a high chance you have
not experienced it, especially that you are the only person committing
there. This is hardly representative for other projects that have 100s of
committers and 100s of PRs a day. I am not sure if you are aware of
that, but those are the most valuable projects for the ASF - as those are
the ones that actually build community (Folowing "comunity over code
motto). If you have 3 PRs in 3 months and there aare 200 other projects
using GA, I think yetus is not going to show up in any meaningful
statistics.

I am not sure if drawing a conclusion from a project that has 3 PRs in 2
months is the best way of drawing conclusions for the overall Apache
organisation. I think drawing a conclusion from experiences of 5 actually
active projects with sometimes even 100 PRs a day is probably better
justified (yep - there are such projects).
So I would probably agree it has little influence on projects that have no
traffic. But enormous influence on projects that actually have traffic. You
have several teams of people scrambling now to  somehow manage their CI as
it is unbearable now. Is this serious ? I'd say so.

        When you see Airflow backed up, maybe you should try submitting a
PR to another project yourself to see what happens.

I am already spending a TON of my private time trying to help others in the
community. I would really appreciate a little help from your side. So maybe
you just submit 2-3 PRs yourself any time Monday - Friday 12pm CET -> 8pm
CET - this is where regularly bottlenecks happen. Please let everyone know
your findings

J,


On Tue, Feb 9, 2021 at 8:35 AM Allen Wittenauer
<aw...@effectivemachines.com.invalid> wrote:

>
>
> > On Feb 8, 2021, at 5:00 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> >> I'm not convinced this is true. I have yet to see any of my PRs for
> > "non-big" projects getting queued while Spark, Airflow, others are.  Thus
> > why I think there are only a handful of projects that are getting upset
> > about this but the rest of us are like "meh whatever."
> >
> > Do you have any data on that? Or is it just anecdotal evidence?
>
>         Totally anecdotal.  Like when I literally ran a Yetus PR during
> the builds meeting as you were complaining about Airflow having an X deep
> queue. My PR ran fine, no pause.
>
> > You can see some analysis and actually even charts here:
> > https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status
>
>         Yes, and I don't even see Yetus showing up.  I wonder how many
> other projects are getting dropped from the dataset....
>
> > Maybe you have a very tiny "PR traffic" and it is mostly in the time zone
> > that is not affected?
>
>         True, it has very tiny PR traffic right now.  (Sep/Oct/Nov was
> different though)  But if it was one big FIFO queue, our PR jobs would also
> get queued.  They aren't even when I go look at one of the other projects
> that does have queued jobs.
>
>         When you see Airflow backed up, maybe you should try submitting a
> PR to another project yourself to see what happens.
>
>         All I'm saying is: right now, that document feels like it is
> _greatly_ overstating the problem and now that you point it out, clearly
> dropping data.  It is problem, to be sure, but not all GitHub Actions
> projects are suffering.  (I wouldn't be surprised if smaller projects are
> actually fast tracked through the build queue in order to avoid a tyranny
> of the majority/resource starvation problem... which would be ironic given
> how much of an issue that is at the ASF.)



-- 
+48 660 796 129

Re: GA again unreasonably slow (again)

Posted by Allen Wittenauer <aw...@effectivemachines.com.INVALID>.

> On Feb 8, 2021, at 5:00 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
> 
>> I'm not convinced this is true. I have yet to see any of my PRs for
> "non-big" projects getting queued while Spark, Airflow, others are.  Thus
> why I think there are only a handful of projects that are getting upset
> about this but the rest of us are like "meh whatever."
> 
> Do you have any data on that? Or is it just anecdotal evidence?

	Totally anecdotal.  Like when I literally ran a Yetus PR during the builds meeting as you were complaining about Airflow having an X deep queue. My PR ran fine, no pause.

> You can see some analysis and actually even charts here:
> https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status

	Yes, and I don't even see Yetus showing up.  I wonder how many other projects are getting dropped from the dataset....

> Maybe you have a very tiny "PR traffic" and it is mostly in the time zone
> that is not affected?

	True, it has very tiny PR traffic right now.  (Sep/Oct/Nov was different though)  But if it was one big FIFO queue, our PR jobs would also get queued.  They aren't even when I go look at one of the other projects that does have queued jobs.

	When you see Airflow backed up, maybe you should try submitting a PR to another project yourself to see what happens.

	All I'm saying is: right now, that document feels like it is _greatly_ overstating the problem and now that you point it out, clearly dropping data.  It is problem, to be sure, but not all GitHub Actions projects are suffering.  (I wouldn't be surprised if smaller projects are actually fast tracked through the build queue in order to avoid a tyranny of the majority/resource starvation problem... which would be ironic given how much of an issue that is at the ASF.)

Re: GA again unreasonably slow (again)

Posted by Jarek Potiuk <ja...@potiuk.com>.
>  I'm not convinced this is true. I have yet to see any of my PRs for
"non-big" projects getting queued while Spark, Airflow, others are.  Thus
why I think there are only a handful of projects that are getting upset
about this but the rest of us are like "meh whatever."

Do you have any data on that? Or is it just anecdotal evidence?

As opposed to anecdotal evidence we build a system to gather statistics on
that - you can see some charts resulting from that showing that we have
regularly started to experience ~300 / ~400 workflows queued during the
EMEA day.
You can see some analysis and actually even charts here:
https://cwiki.apache.org/confluence/display/BUILDS/GitHub+Actions+status

Maybe you have a very tiny "PR traffic" and it is mostly in the time zone
that is not affected?

What project are you talking about Allen? What time zone are you in and how
often your PRs are running in different time-zones ?
Could you please provide some statistics on that ? It seems surprising that
you have different experiences and I would love to get to the bottom of it.

Can you help with understanding why you are not experiencing it? That might
be helpful for others who do.


J



On Tue, Feb 9, 2021 at 1:33 AM Allen Wittenauer
<aw...@effectivemachines.com.invalid> wrote:

>
> > On Feb 7, 2021, at 4:44 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > If you are interested - my document is here. Open for comments - happy to
> > add you as editors if you want (just send me your gmail address in priv).
> > It is rather crude, I had no time to put a bit more effort into it due to
> > some significant changes in my company, but it should be easy to compare
> > the values and see the actual improvements we can get. There are likely a
> > few shortcuts there and some of the numbers are "back-of-the-envelope"
> and
> > we are going to validate them even more when we implement all the
> > optimisations, but the conclusions should be pretty sound.
> >
> >
> https://docs.google.com/document/d/1ZZeZ4BYMNX7ycGRUKAXv0s6etz1g-90Onn5nRQQHOfE/edit#
>
>
>
> "For Apache projects, starting December 2020 we are experiencing a high
> strain of GitHub Actions jobs. All Apache projects are sharing 180 jobs and
> as more projects are using GitHub Actions the job queue becomes a serious
> bottleneck. "
>
>         I'm not convinced this is true. I have yet to see any of my PRs
> for "non-big" projects getting queued while Spark, Airflow, others are.
> Thus why I think there are only a handful of projects that are getting
> upset about this but the rest of us are like "meh whatever."
>
>
>
>

-- 
+48 660 796 129

Re: GA again unreasonably slow (again)

Posted by Allen Wittenauer <aw...@effectivemachines.com.INVALID>.
> On Feb 7, 2021, at 4:44 PM, Jarek Potiuk <ja...@potiuk.com> wrote:
> 
> If you are interested - my document is here. Open for comments - happy to
> add you as editors if you want (just send me your gmail address in priv).
> It is rather crude, I had no time to put a bit more effort into it due to
> some significant changes in my company, but it should be easy to compare
> the values and see the actual improvements we can get. There are likely a
> few shortcuts there and some of the numbers are "back-of-the-envelope" and
> we are going to validate them even more when we implement all the
> optimisations, but the conclusions should be pretty sound.
> 
> https://docs.google.com/document/d/1ZZeZ4BYMNX7ycGRUKAXv0s6etz1g-90Onn5nRQQHOfE/edit#



"For Apache projects, starting December 2020 we are experiencing a high strain of GitHub Actions jobs. All Apache projects are sharing 180 jobs and as more projects are using GitHub Actions the job queue becomes a serious bottleneck. "

	I'm not convinced this is true. I have yet to see any of my PRs for "non-big" projects getting queued while Spark, Airflow, others are.  Thus why I think there are only a handful of projects that are getting upset about this but the rest of us are like "meh whatever."




Re: GA again unreasonably slow (again)

Posted by Jarek Potiuk <ja...@potiuk.com>.
For anyone following this thread - some update from the progress we have in
Airflow on building self-hosted infrastructure for GitHub actions.

Ash from Airflow is really close to finalizing the work on a nice
auto-scaling framework for self-hosted workers, but also we checked what is
the best value for money we can get.

I've run some analysis on the performance and tested my hypothesis (based
on earlier experiences) of significant  optimisations we can get.

I've finished my analysis of potential optimizations we can get on our CI
with the Self-Hosted runners that Ash created. I did some performance
testing and (very crude) comparison of "traditional approach" with Local
SSDs 2 CPU instances running the tests with something I already tested
several times on various CI arrangements - running tests with High-Memory
instances (8CPU 64 GB Mem) and running everything (including docker engine)
in "tmpfs" - huge ramdisk.
Seems that 1h 20 minutes of test running can be decreased 8x (!)using this
approach (and parallelising some tests) at the same time decreasing the
cost 2x (!). Yep. You heard right. We can have faster builds this way and
pay less for that. Seems that we will be able to decrease the time to run
all tests for one combination to 10 minutes from 1h20 minutes.
This is possible because Ash and his team did a great job on setting up
auto-scaling EC2 instance runners on our Amazon EC2 account (we have
credits from Amazon to run those jobs - also Astronomer offered donation to
keep it running ). Seems that by utilizing it  we can not only pay less but
also get much faster builds.

If you are interested - my document is here. Open for comments - happy to
add you as editors if you want (just send me your gmail address in priv).
It is rather crude, I had no time to put a bit more effort into it due to
some significant changes in my company, but it should be easy to compare
the values and see the actual improvements we can get. There are likely a
few shortcuts there and some of the numbers are "back-of-the-envelope" and
we are going to validate them even more when we implement all the
optimisations, but the conclusions should be pretty sound.

https://docs.google.com/document/d/1ZZeZ4BYMNX7ycGRUKAXv0s6etz1g-90Onn5nRQQHOfE/edit#

J.


On Fri, Jan 8, 2021 at 10:02 PM Jarek Potiuk <ja...@potiuk.com> wrote:

>
> We should be able to make an efficient query via GraphQL API right? I found
>> the REST API for actions to be a little underwhelming.
>
>
> That was the first thing I checked when we started looking at the stats.
> Unfortunately last time that I checked (and I even opened an issue for
> that to
> Github support) there was not a Github Actions GraphQL API.
>
> I got a GH support answer "Yeah we know GH API does not have
> GraphQL support yet, sorry". I think it has not changed since.
>
>
> We have tried to make our builds faster with more caching but it's not easy
>> since it's an embedded systems project we need to target a lot of
>> configurations and most changes impact all builds.
>>
>
> Indeed, I know how much of my time was spent on optimising Airflow GH
> usage.
> I think we eventually decreased the usage 10x or more. But it never
> helped, for a
> long as currently anyone even accidentally could block all the slots in
> almost no
> time at all. We have no organisation-wide way to block this and this is
> the problem.
>
> Right now I could:
> a) mine cryptocurrency using PRs to any Apache project
> b) block the queue for everone
>
> I do not have to be even an Apache committer to do that. It's enough if
> just open one PR
> which is well crafted and spins of 180 jobs that run for 6 hours. It's
> super-flawed.
>
>
>>
>> We too would like to would like to take advantage of our own runners but
>> more for the ability to do Hardware In the Loop testing but have avoided
>> it
>> for the reasons already mentioned.
>>
>
> Self-hosted runner for now seems to be the only "Reasonable" option but
> the security
> issues with the current runner are not allowing us to do it.
>
>>
>> --Brennan
>>
>
>
> --
> +48 660 796 129
>


-- 
+48 660 796 129

Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by Matt Sicker <bo...@gmail.com>.
It must have been many, many years since you last looked at Jenkins.
They've supported pipelines (stored in your code repository) since at
least 2015 or so.

https://www.jenkins.io/doc/book/pipeline/

On Fri, 8 Jan 2021 at 15:17, Antoine Pitrou <an...@python.org> wrote:
>
>
> Hi,
>
> On Fri, 8 Jan 2021 12:49:03 -0800
> Chris Lambertus <cm...@public.gmane.org> wrote:
> >
> > Have you considered the internal and fully supported ASF Jenkins
> and/or Buildbot infrastructure? Infra has little control over the free
> open source offerings, but we have significantly more resources we can
> bring to bear on own on CI systems.
> >
> > What are the gaps in the ASF CI systems that are pushing people onto
> less viable platforms such as GA?
>
> While being a PMC and core developer for Apache Arrow, I'm going to
> give a personal opinion here:
>
> - Jenkins I think many people have had bad experiences in the past with
>   (I remember trying to use Jenkins and the obnoxious Web-based
>   configuration threw me off)
>
> - at Apache Arrow, we once used our own Buildbot instance, but there
>   are problems with it. The main problem is that you can't edit the
>   configuration on the fly to add / change builds: especially, you
>   can't do so on a PR and have that validated before merging. That's
>   because any build configuration change requires restarting the
>   buildmaster.
>
> What platforms like Github Actions, Travis-CI, etc. give you (apart
> from free execution resources) is that the CI configuration is
> versioned along your code source. This is a very valuable model,
> especially when you're in a fast-changing stage.
>
> Regards
>
> Antoine.

Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by Jarek Potiuk <ja...@potiuk.com>.
Happy to close the thread.

However, Is there any way we can track the progress of what's happening?

I am happy to help and do whatever needed to move it forward but the number
of "I do not know" makes it difficult, especially at the time like that
when it hits us hard and we do not have any other choice but crying for
help in desperation.

J

On Sun, Jan 10, 2021 at 7:21 PM Gavin McDonald <gm...@apache.org> wrote:

> Hi All,
>
> 1. Infra has internal contacts. Infra is in contact with GH.
> 2. Github is a Sponsor, kindly donating their services
>     for free, let's not abuse that fact.
> 3. This is a PUBLIC mailing list and quite frankly some of
>      this discussion should not be taking place here, please
>      move to the private committers only users@infra.a.o
>      We can post a summary back to this thread when appropriate.
> 4. I have read every single word in this and other recent threads,
>     my silence here does not mean nothing is happening.
>
> Thanks all for your opinions, let's close this thread down here.
>
>
>
> On Sun, Jan 10, 2021 at 6:23 PM Dave Fisher <wa...@apache.org> wrote:
>
> > Jarek,
> >
> > I would suggest you have a direct chat with Greg Stein.
> >
> > Best Regards,
> > Dave
> >
> > Sent from my iPhone
> >
> > > On Jan 10, 2021, at 9:08 AM, Jarek Potiuk <ja...@potiuk.com> wrote:
> > >
> > > On Sun, Jan 10, 2021 at 5:28 PM Matt Sicker <bo...@gmail.com> wrote:
> > >
> > >> If we can get GA to handle our use case properly, that would be
> awesome.
> > >> Being in the security engineering domain, though, I’m generally
> > pessimistic
> > >> about security, so please excuse my cynicism.
> > >>
> > >
> > > I totally understand. i am a bit more optimistic, especially that we
> > > potentially could
> > > throw some heavyweight - like a number of Serious Apache project
> making a
> > > common and coordinated action of "We can either praise GitHub for their
> > > cooperation" or "They are not secure and not willing to improve".
> > > That is a publicity they would either love (the former) or hate (the
> > > latter).
> > >
> > > I am happy to help in any way I can - represent INFRA in talks,
> describe
> > the
> > > problems and propose solutions, word the "carrot" and "stick  opttions
> > and
> > > even prepare how they could look like - I coudl take part in
> discussions
> > > with GitHub, maybe even escalate this to Microsoft if they will not
> show
> > > they are cooperating - butI have no legitimation for doing so.
> > > I have no power to throw the weight of ASF in the discussion. But I
> would
> > > love to do that and lead that if only I had this kind of power at least
> > > delegated
> > > to me and provided with the means of contacting GitHub and representing
> > > the ASF (but I doubt anyone would give me that power, it is a bit risky
> > as
> > > with big power I would have no big responsibility.
> > >
> > > Tough call - I am not sure how else I can help INFRA/ASF to help me and
> > > others.
> > >
> > > J.
> > >
> > >
> > >>
> > >>> On Sun, Jan 10, 2021 at 03:43 Jarek Potiuk <Jarek.Potiuk@polidea.com
> >
> > >>> wrote:
> > >>>
> > >>> I have a feeling (though I cannot know for sure)
> > >>> that you are underestimating the power of an organization like ASF in
> > >>> actually 'stating' their requirements and 'expectations' towards
> > GitHub.
> > >>>
> > >>> I am now an engineer, but I used to be CTO, CEO, Head of IT, Head of
> > >>> Technology
> > >>> and I know that a lot can be achieved by proper communication,
> stating
> > >> your
> > >>> expectations clearly and follow-up and pushing when you are dealing
> > with
> > >>> partners like that - and engineering excellence or security
> perfection
> > is
> > >>> not the only
> > >>> the thing that matters. Usability, maintenance, streamlining
> > development
> > >>> matter and if you
> > >>> have "good enough security", they are more important for users.
> > >>>
> > >>> I know if you look at it from an "infrastructure security Jenkins"
> > point
> > >> of
> > >>> view - the Jenkins
> > >>> you manage is superior when it comes to security.
> > >>> This is perfectly clear, and I have no intention to question that or
> > >>> disagree with you.
> > >>> And yes - in this aspect I fully agree with you.
> > >>>
> > >>> But there are other aspects which I see (and try to explain).
> > >>> While I deeply care about security (as probably you could see from my
> > >>> earlier
> > >>> communication). Just limiting the discussion to "who is more secure"
> > is a
> > >>> terrible,
> > >>> terrible oversimplification.
> > >>>
> > >>> I encourage you to exercise empathy and see it from the side I was
> > >>> explaining -
> > >>> maintenance, features, integration, streamlining development. Those
> are
> > >>> important
> > >>> things for developers. Less important for security engineers of
> course,
> > >> but
> > >>> if
> > >>> we can satisfy security, those are the things that matter.
> > >>>
> > >>> I think currently we have mitigations for all the security problems
> we
> > >>> found at the project
> > >>> level. Also (as I mentioned before) we will have good leverage - via
> > >> social
> > >>> media pressure
> > >>> to push GA into solving those that are 'systemic' problems we found.
> > They
> > >>> are not
> > >>> necessary for our project to solve, but it would simplify your life
> as
> > >> you
> > >>> take care of so
> > >>> many projects. So the security bounties that I opened are not for me
> -
> > >> they
> > >>> are for the
> > >>> ASF as a whole and for the security team of ASF. I exercised a lot of
> > >>> empathy to your
> > >>> team that rather than only solving my problem, I also spend time and
> > >> effort
> > >>> to push
> > >>> GA into solving it for all ASF projects and in the way that ASF infra
> > >>> security will be satisfied.
> > >>> I did not have to do that. Yet I try to think about your needs there.
> > >>>
> > >>> And to be honest I expect something in return. Empathy and
> > understanding
> > >>> other needs
> > >>> I have - performance, usability, streamlining development, minimum
> > >>> engineering effort
> > >>> to solve our problems is the least I can ask for. Help in dealing
> > >>> with GitHub and
> > >>> exercising ASF powers would be great.
> > >>>
> > >>> Maybe with GitHub, the problem is that organizations like ASF do not
> > >>> exercise
> > >>> their leverage and do not clearly state what is essential for them
> > while
> > >>> working with
> > >>> partners like them?
> > >>>
> > >>> Did the ASF explicitly contacted GA and firmly stated that solving
> the
> > >>> problem of
> > >>> self-hosted runnines is an absolute top priority to solve our
> > performance
> > >>> issues?
> > >>>
> > >>> I do not know.
> > >>>
> > >>> Did anyone from ASF contacted GA and firmly stated that the two
> > bounties
> > >> I
> > >>> created are essential for the security team to be able to provide
> > >> security
> > >>> for
> > >>> the organization?
> > >>>
> > >>> I do not know.
> > >>>
> > >>> Did the ASF push GA in any way in this direction  stating that
> > >>> ASF is considering alternatives? (The "stick" in this discussion)
> > >>>
> > >>> I do not know.
> > >>>
> > >>> Did the ASF propose GA that we can endorse their service, write
> blogs,
> > >> and
> > >>> ask
> > >>> the 100s of projects that will use GA to endorse their service
> publicly
> > >>> once they
> > >>> start addressing our firmly stated needs and expectations? (The
> > "carrot"
> > >> in
> > >>> this
> > >>> discussion)
> > >>>
> > >>> I do not know.
> > >>>
> > >>> This is what I would do if I were at INFRA. I am not. I am not even
> an
> > >> ASF
> > >>> member to
> > >>> have more insight and visibility into it.
> > >>>
> > >>> The only thing I can do is to ask for help and see if the ASF Infra
> is
> > >>> willing to help in
> > >>> the situation by exercising the powers that I do not have.
> > >>>
> > >>> For me, this is really a test, whether the ASF has the power to
> > negotiate
> > >>> with such
> > >>> partners. If not - maybe it's time to think that everything
> (including
> > >>> GitHub repos)
> > >>> should be self-hosted by INFRA, because if you are dealing with
> > partners
> > >>> like that
> > >>> you should have some negotiating power, otherwise, you put yourself
> in
> > a
> > >>> loosing
> > >>> position.
> > >>>
> > >>> But again - I do not know much. This is what I would do If I had the
> > >>> powers.
> > >>> On my side, I think I've shown that I do above and beyond what you
> > might
> > >>> expect
> > >>> from a PMC of one of the ASF projects, and asking for help from the
> > >>> organization,
> > >>> I - so far - proudly belong to, is the only thing left I have. I run
> > out
> > >>> of all ammo.
> > >>>
> > >>> So again - please help!
> > >>>
> > >>> J.
> > >>>
> > >>>
> > >>>> On Sat, Jan 9, 2021 at 11:00 PM Matt Sicker <bo...@gmail.com>
> wrote:
> > >>>
> > >>>> I work on the Jenkins security team. We don’t have embarrassing
> > >> security
> > >>>> failures like this anymore, but part of that is due to the added
> > >>> complexity
> > >>>> of a secure configuration. By the time GA meets your security
> > >> standards,
> > >>>> it’ll likely either require non-trivial changes to your CI scripts,
> or
> > >>>> it’ll break various use cases that you otherwise considered to be
> > >>> usability
> > >>>> enhancements. It’s really getting annoying how every complaint you
> > have
> > >>>> about every non-Jenkins system isn’t a problem in Jenkins. We have
> > more
> > >>>> expertise available to customize things in such a way that works for
> > >>>> non-proprietary SaaS that most services are optimized for (which is
> > why
> > >>>> their security models tend to fall short once a large organization
> > like
> > >>>> Apache tries using something).
> > >>>>
> > >>>> Many of the features you’re asking from GA are likely non-trivial
> > >>>> architecture changes they’ll have to make to accommodate the
> > >> non-trivial
> > >>>> use cases we have. Or maybe it isn’t and they’re just incompetent?
> > >>>>
> > >>>> On Sat, Jan 9, 2021 at 05:58 Jarek Potiuk <ja...@potiuk.com> wrote:
> > >>>>
> > >>>>>>
> > >>>>>>>
> > >>>>>>> The multiple threads about how shitty those are in practice for
> > >>> your
> > >>>>>>> needs seem to indicate otherwise. Security and easy learning
> > >> curves
> > >>>>>>> don't seem to get along too well, do they?
> > >>>>>>
> > >>>>>
> > >>>>> The usabilty, integration level (especially GitHub Actions),
> > >>> maintenance
> > >>>>> effort needed
> > >>>>> - thi is far, far superior. If only we could solve one simple
> > >> problem -
> > >>>>> securely running
> > >>>>> the self-hosted runners for GA - all our problems are solved
> > >>> INSTANTLY.
> > >>>>>
> > >>>>> Security issues happen everywhere, at least if they happen in such
> > >>>> services
> > >>>>> you can
> > >>>>> mitigate (we just did it in Airflow- we mitigated all the security
> > >>> issues
> > >>>>> we found),
> > >>>>> open bounty requests (I did - I opened two bounty requests) and
> then
> > >>>>> escalate.
> > >>>>> If I do not hear about my 2 security bounties from GitHub shortly,
> > >>>>> I am going to start a hell of a social media campaign about it
> > >>>>> using all the means I can. I tried to responsibly disclose it but I
> > >> am
> > >>>>> going to write a nice
> > >>>>> blog post about "How to exploit Github Actions" and I am going to
> > >> tell
> > >>>> them
> > >>>>> that before
> > >>>>> I publish it and give them a chance to fix it.
> > >>>>>
> > >>>>> So you have many ways to influence the security of public services
> > >> like
> > >>>>> that. I think it's
> > >>>>> much better than when you have to manage security yourselves.
> > >>>>>
> > >>>>>
> > >>>>>>>
> > >>>>>>> That would all be possible in Jenkins, some of it would be fairly
> > >>>>>>> simple to integrate, others would indeed be non-trivial rewrites.
> > >>>>>>>
> > >>>>>>
> > >>>>>
> > >>>>> Yep. The non-trivial ones I am afraid of. It took me a year to
> > >> perfect
> > >>>> and
> > >>>>> optimise
> > >>>>> a number of steps in our CI and the problem was - it worked really
> > >> well
> > >>>>> until it stopped
> > >>>>> because of uncontrolled increase of usage from other projects and
> no
> > >>>> secure
> > >>>>> way
> > >>>>> to add extra resources needed (even if we have all the funds - we
> now
> > >>>> have
> > >>>>> 8000 USD
> > >>>>> secured from our stakeholders - with outlook for more) to run
> those.
> > >>> But
> > >>>> if
> > >>>>> you add the
> > >>>>> engineering effort needed to migrate, the engineering time for that
> > >>> costs
> > >>>>> FAR more than
> > >>>>> just that - enabling compute resources to use all the engineering
> > >>> efforts
> > >>>>> you've already
> > >>>>> spent. This is no brainer which way is simpler, cheaper and can be
> > >> done
> > >>>>> faster.
> > >>>>> We just need to have a secure way of doing it.
> > >>>>>
> > >>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> You can have your own Jenkins controller for your PMC. This is
> > >>> vastly
> > >>>>>>> simpler for you to administer than a super time-shared
> > >> environment
> > >>>>>>> like GA. CI systems seem to be a dime a dozen nowadays, yet not a
> > >>>>>>> single one seems to implement job scheduling in a sufficiently
> > >>>>>>> customizable way that scales.
> > >>>>>>
> > >>>>>
> > >>>>> I do not want nor need to administer my CI. And I've done that many
> > >>> times
> > >>>>> in
> > >>>>> the past - Jenkins, Bamboo, GitLab - you name it.
> > >>>>> Heck - I built and maintained my first custom CI framework for my
> > >>> company
> > >>>>> some 20 years ago when the "CI" was just being coined.
> > >>>>> With CI as a service, I do not want to do it anymore. At all. CI
> for
> > >> me
> > >>>>> should just 'be there'.
> > >>>>> Great CI is one that you are not aware of its existence until your
> > >> test
> > >>>>> fail - and
> > >>>>> even then you just want to see the logs of your failed tests and
> > >> figure
> > >>>>> out the reason
> > >>>>> This is what you want from CI system. I do not want to learn how
> > >>>>> to manage Jenkins, install plugins, configure that etc. This is not
> > >> my
> > >>>> job,
> > >>>>> nor any
> > >>>>> one in our project. This requires far more than just setting it
> > >>>>> up - it is making sure that it is secure, that it runs 24/7, that
> it
> > >>> gets
> > >>>>> updated etc. etc.
> > >>>>> This is far more complex than 'just use CI'. We have enough trouble
> > >>> with
> > >>>>> setting up
> > >>>>> and maintaining runners (once we get them securely connected).
> > >>>>>
> > >>>>> I know it looks differently from the infrastructure person point of
> > >>> view
> > >>>> -
> > >>>>> running
> > >>>>> Jenkins is pretty much core part of what you do. But for project
> > >>>>> developers, CI
> > >>>>> should just 'work'. This is what I get from GitHub Actions. It just
> > >>>>> 'works'. I have
> > >>>>> to spend 0 effort to maintain it. Sometimes when it does not work,
> it
> > >>>>> pains, but
> > >>>>> then it's their problem to fix - and they have to fix it eventually
> > >>>> because
> > >>>>> they get
> > >>>>> pressure from all their customers. In case I run my own jenkins
> > >> install
> > >>>> and
> > >>>>> administer it - all those problems fall on us. I do not want that.
> > >> This
> > >>>>> moves us
> > >>>>> away from doing what we should - develop our product.
> > >>>>>
> > >>>>>
> > >>>>>>>
> > >>>>>>> Definitely valid points. Any CI migrations are non-trivial,
> > >>>> especially
> > >>>>>>> once you've set up nice workflows. Perhaps there are some
> > >>>> alternatives
> > >>>>>>> that can help bridge the gap if GA still can't meet your needs.
> > >>> I've
> > >>>>>>> seen prow [1] used in various projects in the Kubernetes
> > >>> communities,
> > >>>>>>> and I'm sure there must be plenty of others.
> > >>>>>>
> > >>>>>
> > >>>>> GA meets all my needs. Except one that I am asking ASF to help
> with -
> > >>>>> make GitHub focus on making a secure way of working with
> self-hosted
> > >>>>> runners. That's it . We even (In November) opened a PR to Github
> > >>> Actions
> > >>>>> Runner to enable it:  https://github.com/actions/runner/pull/783
> > >>>>> But we have not heard anything since.
> > >>>>>
> > >>>>> This is what I ask INFRA to help with - put pressure on GitHub to
> > >> make
> > >>> it
> > >>>>> happen. I need nothing more - no money, no Jenkins, nothing like
> > >> that.
> > >>>>> I just want to be able to spend the money we managed to secure.
> > >>>>>
> > >>>>> J.
> > >>>>>
> > >>>>>
> > >>>>> --
> > >>>>> +48 660 796 129
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>>
> > >>> Jarek Potiuk
> > >>> Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >>>
> > >>> M: +48 660 796 129 <+48660796129>
> > >>> [image: Polidea] <https://www.polidea.com/>
> > >>>
> > >>
> > >
> > >
> > > --
> > > +48 660 796 129
> >
> >
>
> --
>
> *Gavin McDonald*
> Systems Administrator
> ASF Infrastructure Team
>


-- 
+48 660 796 129

Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by Gavin McDonald <gm...@apache.org>.
Hi All,

1. Infra has internal contacts. Infra is in contact with GH.
2. Github is a Sponsor, kindly donating their services
    for free, let's not abuse that fact.
3. This is a PUBLIC mailing list and quite frankly some of
     this discussion should not be taking place here, please
     move to the private committers only users@infra.a.o
     We can post a summary back to this thread when appropriate.
4. I have read every single word in this and other recent threads,
    my silence here does not mean nothing is happening.

Thanks all for your opinions, let's close this thread down here.



On Sun, Jan 10, 2021 at 6:23 PM Dave Fisher <wa...@apache.org> wrote:

> Jarek,
>
> I would suggest you have a direct chat with Greg Stein.
>
> Best Regards,
> Dave
>
> Sent from my iPhone
>
> > On Jan 10, 2021, at 9:08 AM, Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > On Sun, Jan 10, 2021 at 5:28 PM Matt Sicker <bo...@gmail.com> wrote:
> >
> >> If we can get GA to handle our use case properly, that would be awesome.
> >> Being in the security engineering domain, though, I’m generally
> pessimistic
> >> about security, so please excuse my cynicism.
> >>
> >
> > I totally understand. i am a bit more optimistic, especially that we
> > potentially could
> > throw some heavyweight - like a number of Serious Apache project making a
> > common and coordinated action of "We can either praise GitHub for their
> > cooperation" or "They are not secure and not willing to improve".
> > That is a publicity they would either love (the former) or hate (the
> > latter).
> >
> > I am happy to help in any way I can - represent INFRA in talks, describe
> the
> > problems and propose solutions, word the "carrot" and "stick  opttions
> and
> > even prepare how they could look like - I coudl take part in discussions
> > with GitHub, maybe even escalate this to Microsoft if they will not show
> > they are cooperating - butI have no legitimation for doing so.
> > I have no power to throw the weight of ASF in the discussion. But I would
> > love to do that and lead that if only I had this kind of power at least
> > delegated
> > to me and provided with the means of contacting GitHub and representing
> > the ASF (but I doubt anyone would give me that power, it is a bit risky
> as
> > with big power I would have no big responsibility.
> >
> > Tough call - I am not sure how else I can help INFRA/ASF to help me and
> > others.
> >
> > J.
> >
> >
> >>
> >>> On Sun, Jan 10, 2021 at 03:43 Jarek Potiuk <Ja...@polidea.com>
> >>> wrote:
> >>>
> >>> I have a feeling (though I cannot know for sure)
> >>> that you are underestimating the power of an organization like ASF in
> >>> actually 'stating' their requirements and 'expectations' towards
> GitHub.
> >>>
> >>> I am now an engineer, but I used to be CTO, CEO, Head of IT, Head of
> >>> Technology
> >>> and I know that a lot can be achieved by proper communication, stating
> >> your
> >>> expectations clearly and follow-up and pushing when you are dealing
> with
> >>> partners like that - and engineering excellence or security perfection
> is
> >>> not the only
> >>> the thing that matters. Usability, maintenance, streamlining
> development
> >>> matter and if you
> >>> have "good enough security", they are more important for users.
> >>>
> >>> I know if you look at it from an "infrastructure security Jenkins"
> point
> >> of
> >>> view - the Jenkins
> >>> you manage is superior when it comes to security.
> >>> This is perfectly clear, and I have no intention to question that or
> >>> disagree with you.
> >>> And yes - in this aspect I fully agree with you.
> >>>
> >>> But there are other aspects which I see (and try to explain).
> >>> While I deeply care about security (as probably you could see from my
> >>> earlier
> >>> communication). Just limiting the discussion to "who is more secure"
> is a
> >>> terrible,
> >>> terrible oversimplification.
> >>>
> >>> I encourage you to exercise empathy and see it from the side I was
> >>> explaining -
> >>> maintenance, features, integration, streamlining development. Those are
> >>> important
> >>> things for developers. Less important for security engineers of course,
> >> but
> >>> if
> >>> we can satisfy security, those are the things that matter.
> >>>
> >>> I think currently we have mitigations for all the security problems we
> >>> found at the project
> >>> level. Also (as I mentioned before) we will have good leverage - via
> >> social
> >>> media pressure
> >>> to push GA into solving those that are 'systemic' problems we found.
> They
> >>> are not
> >>> necessary for our project to solve, but it would simplify your life as
> >> you
> >>> take care of so
> >>> many projects. So the security bounties that I opened are not for me -
> >> they
> >>> are for the
> >>> ASF as a whole and for the security team of ASF. I exercised a lot of
> >>> empathy to your
> >>> team that rather than only solving my problem, I also spend time and
> >> effort
> >>> to push
> >>> GA into solving it for all ASF projects and in the way that ASF infra
> >>> security will be satisfied.
> >>> I did not have to do that. Yet I try to think about your needs there.
> >>>
> >>> And to be honest I expect something in return. Empathy and
> understanding
> >>> other needs
> >>> I have - performance, usability, streamlining development, minimum
> >>> engineering effort
> >>> to solve our problems is the least I can ask for. Help in dealing
> >>> with GitHub and
> >>> exercising ASF powers would be great.
> >>>
> >>> Maybe with GitHub, the problem is that organizations like ASF do not
> >>> exercise
> >>> their leverage and do not clearly state what is essential for them
> while
> >>> working with
> >>> partners like them?
> >>>
> >>> Did the ASF explicitly contacted GA and firmly stated that solving the
> >>> problem of
> >>> self-hosted runnines is an absolute top priority to solve our
> performance
> >>> issues?
> >>>
> >>> I do not know.
> >>>
> >>> Did anyone from ASF contacted GA and firmly stated that the two
> bounties
> >> I
> >>> created are essential for the security team to be able to provide
> >> security
> >>> for
> >>> the organization?
> >>>
> >>> I do not know.
> >>>
> >>> Did the ASF push GA in any way in this direction  stating that
> >>> ASF is considering alternatives? (The "stick" in this discussion)
> >>>
> >>> I do not know.
> >>>
> >>> Did the ASF propose GA that we can endorse their service, write blogs,
> >> and
> >>> ask
> >>> the 100s of projects that will use GA to endorse their service publicly
> >>> once they
> >>> start addressing our firmly stated needs and expectations? (The
> "carrot"
> >> in
> >>> this
> >>> discussion)
> >>>
> >>> I do not know.
> >>>
> >>> This is what I would do if I were at INFRA. I am not. I am not even an
> >> ASF
> >>> member to
> >>> have more insight and visibility into it.
> >>>
> >>> The only thing I can do is to ask for help and see if the ASF Infra is
> >>> willing to help in
> >>> the situation by exercising the powers that I do not have.
> >>>
> >>> For me, this is really a test, whether the ASF has the power to
> negotiate
> >>> with such
> >>> partners. If not - maybe it's time to think that everything (including
> >>> GitHub repos)
> >>> should be self-hosted by INFRA, because if you are dealing with
> partners
> >>> like that
> >>> you should have some negotiating power, otherwise, you put yourself in
> a
> >>> loosing
> >>> position.
> >>>
> >>> But again - I do not know much. This is what I would do If I had the
> >>> powers.
> >>> On my side, I think I've shown that I do above and beyond what you
> might
> >>> expect
> >>> from a PMC of one of the ASF projects, and asking for help from the
> >>> organization,
> >>> I - so far - proudly belong to, is the only thing left I have. I run
> out
> >>> of all ammo.
> >>>
> >>> So again - please help!
> >>>
> >>> J.
> >>>
> >>>
> >>>> On Sat, Jan 9, 2021 at 11:00 PM Matt Sicker <bo...@gmail.com> wrote:
> >>>
> >>>> I work on the Jenkins security team. We don’t have embarrassing
> >> security
> >>>> failures like this anymore, but part of that is due to the added
> >>> complexity
> >>>> of a secure configuration. By the time GA meets your security
> >> standards,
> >>>> it’ll likely either require non-trivial changes to your CI scripts, or
> >>>> it’ll break various use cases that you otherwise considered to be
> >>> usability
> >>>> enhancements. It’s really getting annoying how every complaint you
> have
> >>>> about every non-Jenkins system isn’t a problem in Jenkins. We have
> more
> >>>> expertise available to customize things in such a way that works for
> >>>> non-proprietary SaaS that most services are optimized for (which is
> why
> >>>> their security models tend to fall short once a large organization
> like
> >>>> Apache tries using something).
> >>>>
> >>>> Many of the features you’re asking from GA are likely non-trivial
> >>>> architecture changes they’ll have to make to accommodate the
> >> non-trivial
> >>>> use cases we have. Or maybe it isn’t and they’re just incompetent?
> >>>>
> >>>> On Sat, Jan 9, 2021 at 05:58 Jarek Potiuk <ja...@potiuk.com> wrote:
> >>>>
> >>>>>>
> >>>>>>>
> >>>>>>> The multiple threads about how shitty those are in practice for
> >>> your
> >>>>>>> needs seem to indicate otherwise. Security and easy learning
> >> curves
> >>>>>>> don't seem to get along too well, do they?
> >>>>>>
> >>>>>
> >>>>> The usabilty, integration level (especially GitHub Actions),
> >>> maintenance
> >>>>> effort needed
> >>>>> - thi is far, far superior. If only we could solve one simple
> >> problem -
> >>>>> securely running
> >>>>> the self-hosted runners for GA - all our problems are solved
> >>> INSTANTLY.
> >>>>>
> >>>>> Security issues happen everywhere, at least if they happen in such
> >>>> services
> >>>>> you can
> >>>>> mitigate (we just did it in Airflow- we mitigated all the security
> >>> issues
> >>>>> we found),
> >>>>> open bounty requests (I did - I opened two bounty requests) and then
> >>>>> escalate.
> >>>>> If I do not hear about my 2 security bounties from GitHub shortly,
> >>>>> I am going to start a hell of a social media campaign about it
> >>>>> using all the means I can. I tried to responsibly disclose it but I
> >> am
> >>>>> going to write a nice
> >>>>> blog post about "How to exploit Github Actions" and I am going to
> >> tell
> >>>> them
> >>>>> that before
> >>>>> I publish it and give them a chance to fix it.
> >>>>>
> >>>>> So you have many ways to influence the security of public services
> >> like
> >>>>> that. I think it's
> >>>>> much better than when you have to manage security yourselves.
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>> That would all be possible in Jenkins, some of it would be fairly
> >>>>>>> simple to integrate, others would indeed be non-trivial rewrites.
> >>>>>>>
> >>>>>>
> >>>>>
> >>>>> Yep. The non-trivial ones I am afraid of. It took me a year to
> >> perfect
> >>>> and
> >>>>> optimise
> >>>>> a number of steps in our CI and the problem was - it worked really
> >> well
> >>>>> until it stopped
> >>>>> because of uncontrolled increase of usage from other projects and no
> >>>> secure
> >>>>> way
> >>>>> to add extra resources needed (even if we have all the funds - we now
> >>>> have
> >>>>> 8000 USD
> >>>>> secured from our stakeholders - with outlook for more) to run those.
> >>> But
> >>>> if
> >>>>> you add the
> >>>>> engineering effort needed to migrate, the engineering time for that
> >>> costs
> >>>>> FAR more than
> >>>>> just that - enabling compute resources to use all the engineering
> >>> efforts
> >>>>> you've already
> >>>>> spent. This is no brainer which way is simpler, cheaper and can be
> >> done
> >>>>> faster.
> >>>>> We just need to have a secure way of doing it.
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> You can have your own Jenkins controller for your PMC. This is
> >>> vastly
> >>>>>>> simpler for you to administer than a super time-shared
> >> environment
> >>>>>>> like GA. CI systems seem to be a dime a dozen nowadays, yet not a
> >>>>>>> single one seems to implement job scheduling in a sufficiently
> >>>>>>> customizable way that scales.
> >>>>>>
> >>>>>
> >>>>> I do not want nor need to administer my CI. And I've done that many
> >>> times
> >>>>> in
> >>>>> the past - Jenkins, Bamboo, GitLab - you name it.
> >>>>> Heck - I built and maintained my first custom CI framework for my
> >>> company
> >>>>> some 20 years ago when the "CI" was just being coined.
> >>>>> With CI as a service, I do not want to do it anymore. At all. CI for
> >> me
> >>>>> should just 'be there'.
> >>>>> Great CI is one that you are not aware of its existence until your
> >> test
> >>>>> fail - and
> >>>>> even then you just want to see the logs of your failed tests and
> >> figure
> >>>>> out the reason
> >>>>> This is what you want from CI system. I do not want to learn how
> >>>>> to manage Jenkins, install plugins, configure that etc. This is not
> >> my
> >>>> job,
> >>>>> nor any
> >>>>> one in our project. This requires far more than just setting it
> >>>>> up - it is making sure that it is secure, that it runs 24/7, that it
> >>> gets
> >>>>> updated etc. etc.
> >>>>> This is far more complex than 'just use CI'. We have enough trouble
> >>> with
> >>>>> setting up
> >>>>> and maintaining runners (once we get them securely connected).
> >>>>>
> >>>>> I know it looks differently from the infrastructure person point of
> >>> view
> >>>> -
> >>>>> running
> >>>>> Jenkins is pretty much core part of what you do. But for project
> >>>>> developers, CI
> >>>>> should just 'work'. This is what I get from GitHub Actions. It just
> >>>>> 'works'. I have
> >>>>> to spend 0 effort to maintain it. Sometimes when it does not work, it
> >>>>> pains, but
> >>>>> then it's their problem to fix - and they have to fix it eventually
> >>>> because
> >>>>> they get
> >>>>> pressure from all their customers. In case I run my own jenkins
> >> install
> >>>> and
> >>>>> administer it - all those problems fall on us. I do not want that.
> >> This
> >>>>> moves us
> >>>>> away from doing what we should - develop our product.
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>> Definitely valid points. Any CI migrations are non-trivial,
> >>>> especially
> >>>>>>> once you've set up nice workflows. Perhaps there are some
> >>>> alternatives
> >>>>>>> that can help bridge the gap if GA still can't meet your needs.
> >>> I've
> >>>>>>> seen prow [1] used in various projects in the Kubernetes
> >>> communities,
> >>>>>>> and I'm sure there must be plenty of others.
> >>>>>>
> >>>>>
> >>>>> GA meets all my needs. Except one that I am asking ASF to help with -
> >>>>> make GitHub focus on making a secure way of working with self-hosted
> >>>>> runners. That's it . We even (In November) opened a PR to Github
> >>> Actions
> >>>>> Runner to enable it:  https://github.com/actions/runner/pull/783
> >>>>> But we have not heard anything since.
> >>>>>
> >>>>> This is what I ask INFRA to help with - put pressure on GitHub to
> >> make
> >>> it
> >>>>> happen. I need nothing more - no money, no Jenkins, nothing like
> >> that.
> >>>>> I just want to be able to spend the money we managed to secure.
> >>>>>
> >>>>> J.
> >>>>>
> >>>>>
> >>>>> --
> >>>>> +48 660 796 129
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>>
> >>> Jarek Potiuk
> >>> Polidea <https://www.polidea.com/> | Principal Software Engineer
> >>>
> >>> M: +48 660 796 129 <+48660796129>
> >>> [image: Polidea] <https://www.polidea.com/>
> >>>
> >>
> >
> >
> > --
> > +48 660 796 129
>
>

-- 

*Gavin McDonald*
Systems Administrator
ASF Infrastructure Team

Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by Dave Fisher <wa...@apache.org>.
Jarek,

I would suggest you have a direct chat with Greg Stein.

Best Regards,
Dave

Sent from my iPhone

> On Jan 10, 2021, at 9:08 AM, Jarek Potiuk <ja...@potiuk.com> wrote:
> 
> On Sun, Jan 10, 2021 at 5:28 PM Matt Sicker <bo...@gmail.com> wrote:
> 
>> If we can get GA to handle our use case properly, that would be awesome.
>> Being in the security engineering domain, though, I’m generally pessimistic
>> about security, so please excuse my cynicism.
>> 
> 
> I totally understand. i am a bit more optimistic, especially that we
> potentially could
> throw some heavyweight - like a number of Serious Apache project making a
> common and coordinated action of "We can either praise GitHub for their
> cooperation" or "They are not secure and not willing to improve".
> That is a publicity they would either love (the former) or hate (the
> latter).
> 
> I am happy to help in any way I can - represent INFRA in talks, describe the
> problems and propose solutions, word the "carrot" and "stick  opttions and
> even prepare how they could look like - I coudl take part in discussions
> with GitHub, maybe even escalate this to Microsoft if they will not show
> they are cooperating - butI have no legitimation for doing so.
> I have no power to throw the weight of ASF in the discussion. But I would
> love to do that and lead that if only I had this kind of power at least
> delegated
> to me and provided with the means of contacting GitHub and representing
> the ASF (but I doubt anyone would give me that power, it is a bit risky as
> with big power I would have no big responsibility.
> 
> Tough call - I am not sure how else I can help INFRA/ASF to help me and
> others.
> 
> J.
> 
> 
>> 
>>> On Sun, Jan 10, 2021 at 03:43 Jarek Potiuk <Ja...@polidea.com>
>>> wrote:
>>> 
>>> I have a feeling (though I cannot know for sure)
>>> that you are underestimating the power of an organization like ASF in
>>> actually 'stating' their requirements and 'expectations' towards GitHub.
>>> 
>>> I am now an engineer, but I used to be CTO, CEO, Head of IT, Head of
>>> Technology
>>> and I know that a lot can be achieved by proper communication, stating
>> your
>>> expectations clearly and follow-up and pushing when you are dealing with
>>> partners like that - and engineering excellence or security perfection is
>>> not the only
>>> the thing that matters. Usability, maintenance, streamlining development
>>> matter and if you
>>> have "good enough security", they are more important for users.
>>> 
>>> I know if you look at it from an "infrastructure security Jenkins" point
>> of
>>> view - the Jenkins
>>> you manage is superior when it comes to security.
>>> This is perfectly clear, and I have no intention to question that or
>>> disagree with you.
>>> And yes - in this aspect I fully agree with you.
>>> 
>>> But there are other aspects which I see (and try to explain).
>>> While I deeply care about security (as probably you could see from my
>>> earlier
>>> communication). Just limiting the discussion to "who is more secure" is a
>>> terrible,
>>> terrible oversimplification.
>>> 
>>> I encourage you to exercise empathy and see it from the side I was
>>> explaining -
>>> maintenance, features, integration, streamlining development. Those are
>>> important
>>> things for developers. Less important for security engineers of course,
>> but
>>> if
>>> we can satisfy security, those are the things that matter.
>>> 
>>> I think currently we have mitigations for all the security problems we
>>> found at the project
>>> level. Also (as I mentioned before) we will have good leverage - via
>> social
>>> media pressure
>>> to push GA into solving those that are 'systemic' problems we found. They
>>> are not
>>> necessary for our project to solve, but it would simplify your life as
>> you
>>> take care of so
>>> many projects. So the security bounties that I opened are not for me -
>> they
>>> are for the
>>> ASF as a whole and for the security team of ASF. I exercised a lot of
>>> empathy to your
>>> team that rather than only solving my problem, I also spend time and
>> effort
>>> to push
>>> GA into solving it for all ASF projects and in the way that ASF infra
>>> security will be satisfied.
>>> I did not have to do that. Yet I try to think about your needs there.
>>> 
>>> And to be honest I expect something in return. Empathy and understanding
>>> other needs
>>> I have - performance, usability, streamlining development, minimum
>>> engineering effort
>>> to solve our problems is the least I can ask for. Help in dealing
>>> with GitHub and
>>> exercising ASF powers would be great.
>>> 
>>> Maybe with GitHub, the problem is that organizations like ASF do not
>>> exercise
>>> their leverage and do not clearly state what is essential for them while
>>> working with
>>> partners like them?
>>> 
>>> Did the ASF explicitly contacted GA and firmly stated that solving the
>>> problem of
>>> self-hosted runnines is an absolute top priority to solve our performance
>>> issues?
>>> 
>>> I do not know.
>>> 
>>> Did anyone from ASF contacted GA and firmly stated that the two bounties
>> I
>>> created are essential for the security team to be able to provide
>> security
>>> for
>>> the organization?
>>> 
>>> I do not know.
>>> 
>>> Did the ASF push GA in any way in this direction  stating that
>>> ASF is considering alternatives? (The "stick" in this discussion)
>>> 
>>> I do not know.
>>> 
>>> Did the ASF propose GA that we can endorse their service, write blogs,
>> and
>>> ask
>>> the 100s of projects that will use GA to endorse their service publicly
>>> once they
>>> start addressing our firmly stated needs and expectations? (The "carrot"
>> in
>>> this
>>> discussion)
>>> 
>>> I do not know.
>>> 
>>> This is what I would do if I were at INFRA. I am not. I am not even an
>> ASF
>>> member to
>>> have more insight and visibility into it.
>>> 
>>> The only thing I can do is to ask for help and see if the ASF Infra is
>>> willing to help in
>>> the situation by exercising the powers that I do not have.
>>> 
>>> For me, this is really a test, whether the ASF has the power to negotiate
>>> with such
>>> partners. If not - maybe it's time to think that everything (including
>>> GitHub repos)
>>> should be self-hosted by INFRA, because if you are dealing with partners
>>> like that
>>> you should have some negotiating power, otherwise, you put yourself in a
>>> loosing
>>> position.
>>> 
>>> But again - I do not know much. This is what I would do If I had the
>>> powers.
>>> On my side, I think I've shown that I do above and beyond what you might
>>> expect
>>> from a PMC of one of the ASF projects, and asking for help from the
>>> organization,
>>> I - so far - proudly belong to, is the only thing left I have. I run out
>>> of all ammo.
>>> 
>>> So again - please help!
>>> 
>>> J.
>>> 
>>> 
>>>> On Sat, Jan 9, 2021 at 11:00 PM Matt Sicker <bo...@gmail.com> wrote:
>>> 
>>>> I work on the Jenkins security team. We don’t have embarrassing
>> security
>>>> failures like this anymore, but part of that is due to the added
>>> complexity
>>>> of a secure configuration. By the time GA meets your security
>> standards,
>>>> it’ll likely either require non-trivial changes to your CI scripts, or
>>>> it’ll break various use cases that you otherwise considered to be
>>> usability
>>>> enhancements. It’s really getting annoying how every complaint you have
>>>> about every non-Jenkins system isn’t a problem in Jenkins. We have more
>>>> expertise available to customize things in such a way that works for
>>>> non-proprietary SaaS that most services are optimized for (which is why
>>>> their security models tend to fall short once a large organization like
>>>> Apache tries using something).
>>>> 
>>>> Many of the features you’re asking from GA are likely non-trivial
>>>> architecture changes they’ll have to make to accommodate the
>> non-trivial
>>>> use cases we have. Or maybe it isn’t and they’re just incompetent?
>>>> 
>>>> On Sat, Jan 9, 2021 at 05:58 Jarek Potiuk <ja...@potiuk.com> wrote:
>>>> 
>>>>>> 
>>>>>>> 
>>>>>>> The multiple threads about how shitty those are in practice for
>>> your
>>>>>>> needs seem to indicate otherwise. Security and easy learning
>> curves
>>>>>>> don't seem to get along too well, do they?
>>>>>> 
>>>>> 
>>>>> The usabilty, integration level (especially GitHub Actions),
>>> maintenance
>>>>> effort needed
>>>>> - thi is far, far superior. If only we could solve one simple
>> problem -
>>>>> securely running
>>>>> the self-hosted runners for GA - all our problems are solved
>>> INSTANTLY.
>>>>> 
>>>>> Security issues happen everywhere, at least if they happen in such
>>>> services
>>>>> you can
>>>>> mitigate (we just did it in Airflow- we mitigated all the security
>>> issues
>>>>> we found),
>>>>> open bounty requests (I did - I opened two bounty requests) and then
>>>>> escalate.
>>>>> If I do not hear about my 2 security bounties from GitHub shortly,
>>>>> I am going to start a hell of a social media campaign about it
>>>>> using all the means I can. I tried to responsibly disclose it but I
>> am
>>>>> going to write a nice
>>>>> blog post about "How to exploit Github Actions" and I am going to
>> tell
>>>> them
>>>>> that before
>>>>> I publish it and give them a chance to fix it.
>>>>> 
>>>>> So you have many ways to influence the security of public services
>> like
>>>>> that. I think it's
>>>>> much better than when you have to manage security yourselves.
>>>>> 
>>>>> 
>>>>>>> 
>>>>>>> That would all be possible in Jenkins, some of it would be fairly
>>>>>>> simple to integrate, others would indeed be non-trivial rewrites.
>>>>>>> 
>>>>>> 
>>>>> 
>>>>> Yep. The non-trivial ones I am afraid of. It took me a year to
>> perfect
>>>> and
>>>>> optimise
>>>>> a number of steps in our CI and the problem was - it worked really
>> well
>>>>> until it stopped
>>>>> because of uncontrolled increase of usage from other projects and no
>>>> secure
>>>>> way
>>>>> to add extra resources needed (even if we have all the funds - we now
>>>> have
>>>>> 8000 USD
>>>>> secured from our stakeholders - with outlook for more) to run those.
>>> But
>>>> if
>>>>> you add the
>>>>> engineering effort needed to migrate, the engineering time for that
>>> costs
>>>>> FAR more than
>>>>> just that - enabling compute resources to use all the engineering
>>> efforts
>>>>> you've already
>>>>> spent. This is no brainer which way is simpler, cheaper and can be
>> done
>>>>> faster.
>>>>> We just need to have a secure way of doing it.
>>>>> 
>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> You can have your own Jenkins controller for your PMC. This is
>>> vastly
>>>>>>> simpler for you to administer than a super time-shared
>> environment
>>>>>>> like GA. CI systems seem to be a dime a dozen nowadays, yet not a
>>>>>>> single one seems to implement job scheduling in a sufficiently
>>>>>>> customizable way that scales.
>>>>>> 
>>>>> 
>>>>> I do not want nor need to administer my CI. And I've done that many
>>> times
>>>>> in
>>>>> the past - Jenkins, Bamboo, GitLab - you name it.
>>>>> Heck - I built and maintained my first custom CI framework for my
>>> company
>>>>> some 20 years ago when the "CI" was just being coined.
>>>>> With CI as a service, I do not want to do it anymore. At all. CI for
>> me
>>>>> should just 'be there'.
>>>>> Great CI is one that you are not aware of its existence until your
>> test
>>>>> fail - and
>>>>> even then you just want to see the logs of your failed tests and
>> figure
>>>>> out the reason
>>>>> This is what you want from CI system. I do not want to learn how
>>>>> to manage Jenkins, install plugins, configure that etc. This is not
>> my
>>>> job,
>>>>> nor any
>>>>> one in our project. This requires far more than just setting it
>>>>> up - it is making sure that it is secure, that it runs 24/7, that it
>>> gets
>>>>> updated etc. etc.
>>>>> This is far more complex than 'just use CI'. We have enough trouble
>>> with
>>>>> setting up
>>>>> and maintaining runners (once we get them securely connected).
>>>>> 
>>>>> I know it looks differently from the infrastructure person point of
>>> view
>>>> -
>>>>> running
>>>>> Jenkins is pretty much core part of what you do. But for project
>>>>> developers, CI
>>>>> should just 'work'. This is what I get from GitHub Actions. It just
>>>>> 'works'. I have
>>>>> to spend 0 effort to maintain it. Sometimes when it does not work, it
>>>>> pains, but
>>>>> then it's their problem to fix - and they have to fix it eventually
>>>> because
>>>>> they get
>>>>> pressure from all their customers. In case I run my own jenkins
>> install
>>>> and
>>>>> administer it - all those problems fall on us. I do not want that.
>> This
>>>>> moves us
>>>>> away from doing what we should - develop our product.
>>>>> 
>>>>> 
>>>>>>> 
>>>>>>> Definitely valid points. Any CI migrations are non-trivial,
>>>> especially
>>>>>>> once you've set up nice workflows. Perhaps there are some
>>>> alternatives
>>>>>>> that can help bridge the gap if GA still can't meet your needs.
>>> I've
>>>>>>> seen prow [1] used in various projects in the Kubernetes
>>> communities,
>>>>>>> and I'm sure there must be plenty of others.
>>>>>> 
>>>>> 
>>>>> GA meets all my needs. Except one that I am asking ASF to help with -
>>>>> make GitHub focus on making a secure way of working with self-hosted
>>>>> runners. That's it . We even (In November) opened a PR to Github
>>> Actions
>>>>> Runner to enable it:  https://github.com/actions/runner/pull/783
>>>>> But we have not heard anything since.
>>>>> 
>>>>> This is what I ask INFRA to help with - put pressure on GitHub to
>> make
>>> it
>>>>> happen. I need nothing more - no money, no Jenkins, nothing like
>> that.
>>>>> I just want to be able to spend the money we managed to secure.
>>>>> 
>>>>> J.
>>>>> 
>>>>> 
>>>>> --
>>>>> +48 660 796 129
>>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> 
>>> Jarek Potiuk
>>> Polidea <https://www.polidea.com/> | Principal Software Engineer
>>> 
>>> M: +48 660 796 129 <+48660796129>
>>> [image: Polidea] <https://www.polidea.com/>
>>> 
>> 
> 
> 
> -- 
> +48 660 796 129


Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by Matt Sicker <bo...@gmail.com>.
Do we have anyone here who works at GitHub that could potentially
help? An internal contact would be the easiest (e.g., I was that
initial contact point to get us the CloudBees CI version of Jenkins
here to ease our own Jenkins usage problems). Otherwise, I'd imagine
that INFRA are likely to be important allies here in requesting GitHub
do anything.

On Sun, 10 Jan 2021 at 11:08, Jarek Potiuk <ja...@potiuk.com> wrote:
>
> On Sun, Jan 10, 2021 at 5:28 PM Matt Sicker <bo...@gmail.com> wrote:
>
> > If we can get GA to handle our use case properly, that would be awesome.
> > Being in the security engineering domain, though, I’m generally pessimistic
> > about security, so please excuse my cynicism.
> >
>
> I totally understand. i am a bit more optimistic, especially that we
> potentially could
> throw some heavyweight - like a number of Serious Apache project making a
> common and coordinated action of "We can either praise GitHub for their
> cooperation" or "They are not secure and not willing to improve".
> That is a publicity they would either love (the former) or hate (the
> latter).
>
> I am happy to help in any way I can - represent INFRA in talks, describe the
> problems and propose solutions, word the "carrot" and "stick  opttions and
> even prepare how they could look like - I coudl take part in discussions
> with GitHub, maybe even escalate this to Microsoft if they will not show
> they are cooperating - butI have no legitimation for doing so.
> I have no power to throw the weight of ASF in the discussion. But I would
> love to do that and lead that if only I had this kind of power at least
> delegated
> to me and provided with the means of contacting GitHub and representing
> the ASF (but I doubt anyone would give me that power, it is a bit risky as
> with big power I would have no big responsibility.
>
> Tough call - I am not sure how else I can help INFRA/ASF to help me and
> others.
>
> J.
>
>
> >
> > On Sun, Jan 10, 2021 at 03:43 Jarek Potiuk <Ja...@polidea.com>
> > wrote:
> >
> > > I have a feeling (though I cannot know for sure)
> > > that you are underestimating the power of an organization like ASF in
> > > actually 'stating' their requirements and 'expectations' towards GitHub.
> > >
> > > I am now an engineer, but I used to be CTO, CEO, Head of IT, Head of
> > > Technology
> > > and I know that a lot can be achieved by proper communication, stating
> > your
> > > expectations clearly and follow-up and pushing when you are dealing with
> > > partners like that - and engineering excellence or security perfection is
> > > not the only
> > > the thing that matters. Usability, maintenance, streamlining development
> > > matter and if you
> > > have "good enough security", they are more important for users.
> > >
> > > I know if you look at it from an "infrastructure security Jenkins" point
> > of
> > > view - the Jenkins
> > > you manage is superior when it comes to security.
> > > This is perfectly clear, and I have no intention to question that or
> > > disagree with you.
> > > And yes - in this aspect I fully agree with you.
> > >
> > > But there are other aspects which I see (and try to explain).
> > > While I deeply care about security (as probably you could see from my
> > > earlier
> > > communication). Just limiting the discussion to "who is more secure" is a
> > > terrible,
> > > terrible oversimplification.
> > >
> > > I encourage you to exercise empathy and see it from the side I was
> > > explaining -
> > > maintenance, features, integration, streamlining development. Those are
> > > important
> > > things for developers. Less important for security engineers of course,
> > but
> > > if
> > > we can satisfy security, those are the things that matter.
> > >
> > > I think currently we have mitigations for all the security problems we
> > > found at the project
> > > level. Also (as I mentioned before) we will have good leverage - via
> > social
> > > media pressure
> > > to push GA into solving those that are 'systemic' problems we found. They
> > > are not
> > > necessary for our project to solve, but it would simplify your life as
> > you
> > > take care of so
> > > many projects. So the security bounties that I opened are not for me -
> > they
> > > are for the
> > > ASF as a whole and for the security team of ASF. I exercised a lot of
> > > empathy to your
> > > team that rather than only solving my problem, I also spend time and
> > effort
> > > to push
> > > GA into solving it for all ASF projects and in the way that ASF infra
> > > security will be satisfied.
> > > I did not have to do that. Yet I try to think about your needs there.
> > >
> > > And to be honest I expect something in return. Empathy and understanding
> > > other needs
> > > I have - performance, usability, streamlining development, minimum
> > > engineering effort
> > > to solve our problems is the least I can ask for. Help in dealing
> > > with GitHub and
> > > exercising ASF powers would be great.
> > >
> > > Maybe with GitHub, the problem is that organizations like ASF do not
> > > exercise
> > > their leverage and do not clearly state what is essential for them while
> > > working with
> > > partners like them?
> > >
> > > Did the ASF explicitly contacted GA and firmly stated that solving the
> > > problem of
> > > self-hosted runnines is an absolute top priority to solve our performance
> > > issues?
> > >
> > > I do not know.
> > >
> > > Did anyone from ASF contacted GA and firmly stated that the two bounties
> > I
> > > created are essential for the security team to be able to provide
> > security
> > > for
> > > the organization?
> > >
> > > I do not know.
> > >
> > > Did the ASF push GA in any way in this direction  stating that
> > > ASF is considering alternatives? (The "stick" in this discussion)
> > >
> > > I do not know.
> > >
> > > Did the ASF propose GA that we can endorse their service, write blogs,
> > and
> > > ask
> > > the 100s of projects that will use GA to endorse their service publicly
> > > once they
> > > start addressing our firmly stated needs and expectations? (The "carrot"
> > in
> > > this
> > > discussion)
> > >
> > > I do not know.
> > >
> > > This is what I would do if I were at INFRA. I am not. I am not even an
> > ASF
> > > member to
> > > have more insight and visibility into it.
> > >
> > > The only thing I can do is to ask for help and see if the ASF Infra is
> > > willing to help in
> > > the situation by exercising the powers that I do not have.
> > >
> > > For me, this is really a test, whether the ASF has the power to negotiate
> > > with such
> > > partners. If not - maybe it's time to think that everything (including
> > > GitHub repos)
> > > should be self-hosted by INFRA, because if you are dealing with partners
> > > like that
> > > you should have some negotiating power, otherwise, you put yourself in a
> > > loosing
> > > position.
> > >
> > > But again - I do not know much. This is what I would do If I had the
> > > powers.
> > > On my side, I think I've shown that I do above and beyond what you might
> > > expect
> > > from a PMC of one of the ASF projects, and asking for help from the
> > > organization,
> > > I - so far - proudly belong to, is the only thing left I have. I run out
> > > of all ammo.
> > >
> > > So again - please help!
> > >
> > > J.
> > >
> > >
> > > On Sat, Jan 9, 2021 at 11:00 PM Matt Sicker <bo...@gmail.com> wrote:
> > >
> > > > I work on the Jenkins security team. We don’t have embarrassing
> > security
> > > > failures like this anymore, but part of that is due to the added
> > > complexity
> > > > of a secure configuration. By the time GA meets your security
> > standards,
> > > > it’ll likely either require non-trivial changes to your CI scripts, or
> > > > it’ll break various use cases that you otherwise considered to be
> > > usability
> > > > enhancements. It’s really getting annoying how every complaint you have
> > > > about every non-Jenkins system isn’t a problem in Jenkins. We have more
> > > > expertise available to customize things in such a way that works for
> > > > non-proprietary SaaS that most services are optimized for (which is why
> > > > their security models tend to fall short once a large organization like
> > > > Apache tries using something).
> > > >
> > > > Many of the features you’re asking from GA are likely non-trivial
> > > > architecture changes they’ll have to make to accommodate the
> > non-trivial
> > > > use cases we have. Or maybe it isn’t and they’re just incompetent?
> > > >
> > > > On Sat, Jan 9, 2021 at 05:58 Jarek Potiuk <ja...@potiuk.com> wrote:
> > > >
> > > > > >
> > > > > > >
> > > > > > > The multiple threads about how shitty those are in practice for
> > > your
> > > > > > > needs seem to indicate otherwise. Security and easy learning
> > curves
> > > > > > > don't seem to get along too well, do they?
> > > > > >
> > > > >
> > > > > The usabilty, integration level (especially GitHub Actions),
> > > maintenance
> > > > > effort needed
> > > > > - thi is far, far superior. If only we could solve one simple
> > problem -
> > > > > securely running
> > > > > the self-hosted runners for GA - all our problems are solved
> > > INSTANTLY.
> > > > >
> > > > > Security issues happen everywhere, at least if they happen in such
> > > > services
> > > > > you can
> > > > > mitigate (we just did it in Airflow- we mitigated all the security
> > > issues
> > > > > we found),
> > > > > open bounty requests (I did - I opened two bounty requests) and then
> > > > > escalate.
> > > > > If I do not hear about my 2 security bounties from GitHub shortly,
> > > > > I am going to start a hell of a social media campaign about it
> > > > >  using all the means I can. I tried to responsibly disclose it but I
> > am
> > > > > going to write a nice
> > > > > blog post about "How to exploit Github Actions" and I am going to
> > tell
> > > > them
> > > > > that before
> > > > > I publish it and give them a chance to fix it.
> > > > >
> > > > > So you have many ways to influence the security of public services
> > like
> > > > > that. I think it's
> > > > > much better than when you have to manage security yourselves.
> > > > >
> > > > >
> > > > > > >
> > > > > > > That would all be possible in Jenkins, some of it would be fairly
> > > > > > > simple to integrate, others would indeed be non-trivial rewrites.
> > > > > > >
> > > > > >
> > > > >
> > > > > Yep. The non-trivial ones I am afraid of. It took me a year to
> > perfect
> > > > and
> > > > > optimise
> > > > > a number of steps in our CI and the problem was - it worked really
> > well
> > > > > until it stopped
> > > > > because of uncontrolled increase of usage from other projects and no
> > > > secure
> > > > > way
> > > > > to add extra resources needed (even if we have all the funds - we now
> > > > have
> > > > > 8000 USD
> > > > > secured from our stakeholders - with outlook for more) to run those.
> > > But
> > > > if
> > > > > you add the
> > > > > engineering effort needed to migrate, the engineering time for that
> > > costs
> > > > > FAR more than
> > > > > just that - enabling compute resources to use all the engineering
> > > efforts
> > > > > you've already
> > > > > spent. This is no brainer which way is simpler, cheaper and can be
> > done
> > > > > faster.
> > > > > We just need to have a secure way of doing it.
> > > > >
> > > > >
> > > > > > >
> > > > > > >
> > > > > > > You can have your own Jenkins controller for your PMC. This is
> > > vastly
> > > > > > > simpler for you to administer than a super time-shared
> > environment
> > > > > > > like GA. CI systems seem to be a dime a dozen nowadays, yet not a
> > > > > > > single one seems to implement job scheduling in a sufficiently
> > > > > > > customizable way that scales.
> > > > > >
> > > > >
> > > > > I do not want nor need to administer my CI. And I've done that many
> > > times
> > > > > in
> > > > > the past - Jenkins, Bamboo, GitLab - you name it.
> > > > > Heck - I built and maintained my first custom CI framework for my
> > > company
> > > > > some 20 years ago when the "CI" was just being coined.
> > > > > With CI as a service, I do not want to do it anymore. At all. CI for
> > me
> > > > > should just 'be there'.
> > > > > Great CI is one that you are not aware of its existence until your
> > test
> > > > > fail - and
> > > > > even then you just want to see the logs of your failed tests and
> > figure
> > > > > out the reason
> > > > > This is what you want from CI system. I do not want to learn how
> > > > > to manage Jenkins, install plugins, configure that etc. This is not
> > my
> > > > job,
> > > > > nor any
> > > > > one in our project. This requires far more than just setting it
> > > > > up - it is making sure that it is secure, that it runs 24/7, that it
> > > gets
> > > > > updated etc. etc.
> > > > > This is far more complex than 'just use CI'. We have enough trouble
> > > with
> > > > > setting up
> > > > > and maintaining runners (once we get them securely connected).
> > > > >
> > > > > I know it looks differently from the infrastructure person point of
> > > view
> > > > -
> > > > > running
> > > > > Jenkins is pretty much core part of what you do. But for project
> > > > > developers, CI
> > > > > should just 'work'. This is what I get from GitHub Actions. It just
> > > > > 'works'. I have
> > > > > to spend 0 effort to maintain it. Sometimes when it does not work, it
> > > > > pains, but
> > > > > then it's their problem to fix - and they have to fix it eventually
> > > > because
> > > > > they get
> > > > > pressure from all their customers. In case I run my own jenkins
> > install
> > > > and
> > > > > administer it - all those problems fall on us. I do not want that.
> > This
> > > > > moves us
> > > > > away from doing what we should - develop our product.
> > > > >
> > > > >
> > > > > > >
> > > > > > > Definitely valid points. Any CI migrations are non-trivial,
> > > > especially
> > > > > > > once you've set up nice workflows. Perhaps there are some
> > > > alternatives
> > > > > > > that can help bridge the gap if GA still can't meet your needs.
> > > I've
> > > > > > > seen prow [1] used in various projects in the Kubernetes
> > > communities,
> > > > > > > and I'm sure there must be plenty of others.
> > > > > >
> > > > >
> > > > > GA meets all my needs. Except one that I am asking ASF to help with -
> > > > > make GitHub focus on making a secure way of working with self-hosted
> > > > > runners. That's it . We even (In November) opened a PR to Github
> > > Actions
> > > > > Runner to enable it:  https://github.com/actions/runner/pull/783
> > > > > But we have not heard anything since.
> > > > >
> > > > > This is what I ask INFRA to help with - put pressure on GitHub to
> > make
> > > it
> > > > > happen. I need nothing more - no money, no Jenkins, nothing like
> > that.
> > > > > I just want to be able to spend the money we managed to secure.
> > > > >
> > > > > J.
> > > > >
> > > > >
> > > > > --
> > > > > +48 660 796 129
> > > > >
> > > >
> > >
> > >
> > > --
> > >
> > > Jarek Potiuk
> > > Polidea <https://www.polidea.com/> | Principal Software Engineer
> > >
> > > M: +48 660 796 129 <+48660796129>
> > > [image: Polidea] <https://www.polidea.com/>
> > >
> >
>
>
> --
> +48 660 796 129

Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by Jarek Potiuk <ja...@potiuk.com>.
On Sun, Jan 10, 2021 at 5:28 PM Matt Sicker <bo...@gmail.com> wrote:

> If we can get GA to handle our use case properly, that would be awesome.
> Being in the security engineering domain, though, I’m generally pessimistic
> about security, so please excuse my cynicism.
>

I totally understand. i am a bit more optimistic, especially that we
potentially could
throw some heavyweight - like a number of Serious Apache project making a
common and coordinated action of "We can either praise GitHub for their
cooperation" or "They are not secure and not willing to improve".
That is a publicity they would either love (the former) or hate (the
latter).

I am happy to help in any way I can - represent INFRA in talks, describe the
problems and propose solutions, word the "carrot" and "stick  opttions and
even prepare how they could look like - I coudl take part in discussions
with GitHub, maybe even escalate this to Microsoft if they will not show
they are cooperating - butI have no legitimation for doing so.
I have no power to throw the weight of ASF in the discussion. But I would
love to do that and lead that if only I had this kind of power at least
delegated
to me and provided with the means of contacting GitHub and representing
the ASF (but I doubt anyone would give me that power, it is a bit risky as
with big power I would have no big responsibility.

Tough call - I am not sure how else I can help INFRA/ASF to help me and
others.

J.


>
> On Sun, Jan 10, 2021 at 03:43 Jarek Potiuk <Ja...@polidea.com>
> wrote:
>
> > I have a feeling (though I cannot know for sure)
> > that you are underestimating the power of an organization like ASF in
> > actually 'stating' their requirements and 'expectations' towards GitHub.
> >
> > I am now an engineer, but I used to be CTO, CEO, Head of IT, Head of
> > Technology
> > and I know that a lot can be achieved by proper communication, stating
> your
> > expectations clearly and follow-up and pushing when you are dealing with
> > partners like that - and engineering excellence or security perfection is
> > not the only
> > the thing that matters. Usability, maintenance, streamlining development
> > matter and if you
> > have "good enough security", they are more important for users.
> >
> > I know if you look at it from an "infrastructure security Jenkins" point
> of
> > view - the Jenkins
> > you manage is superior when it comes to security.
> > This is perfectly clear, and I have no intention to question that or
> > disagree with you.
> > And yes - in this aspect I fully agree with you.
> >
> > But there are other aspects which I see (and try to explain).
> > While I deeply care about security (as probably you could see from my
> > earlier
> > communication). Just limiting the discussion to "who is more secure" is a
> > terrible,
> > terrible oversimplification.
> >
> > I encourage you to exercise empathy and see it from the side I was
> > explaining -
> > maintenance, features, integration, streamlining development. Those are
> > important
> > things for developers. Less important for security engineers of course,
> but
> > if
> > we can satisfy security, those are the things that matter.
> >
> > I think currently we have mitigations for all the security problems we
> > found at the project
> > level. Also (as I mentioned before) we will have good leverage - via
> social
> > media pressure
> > to push GA into solving those that are 'systemic' problems we found. They
> > are not
> > necessary for our project to solve, but it would simplify your life as
> you
> > take care of so
> > many projects. So the security bounties that I opened are not for me -
> they
> > are for the
> > ASF as a whole and for the security team of ASF. I exercised a lot of
> > empathy to your
> > team that rather than only solving my problem, I also spend time and
> effort
> > to push
> > GA into solving it for all ASF projects and in the way that ASF infra
> > security will be satisfied.
> > I did not have to do that. Yet I try to think about your needs there.
> >
> > And to be honest I expect something in return. Empathy and understanding
> > other needs
> > I have - performance, usability, streamlining development, minimum
> > engineering effort
> > to solve our problems is the least I can ask for. Help in dealing
> > with GitHub and
> > exercising ASF powers would be great.
> >
> > Maybe with GitHub, the problem is that organizations like ASF do not
> > exercise
> > their leverage and do not clearly state what is essential for them while
> > working with
> > partners like them?
> >
> > Did the ASF explicitly contacted GA and firmly stated that solving the
> > problem of
> > self-hosted runnines is an absolute top priority to solve our performance
> > issues?
> >
> > I do not know.
> >
> > Did anyone from ASF contacted GA and firmly stated that the two bounties
> I
> > created are essential for the security team to be able to provide
> security
> > for
> > the organization?
> >
> > I do not know.
> >
> > Did the ASF push GA in any way in this direction  stating that
> > ASF is considering alternatives? (The "stick" in this discussion)
> >
> > I do not know.
> >
> > Did the ASF propose GA that we can endorse their service, write blogs,
> and
> > ask
> > the 100s of projects that will use GA to endorse their service publicly
> > once they
> > start addressing our firmly stated needs and expectations? (The "carrot"
> in
> > this
> > discussion)
> >
> > I do not know.
> >
> > This is what I would do if I were at INFRA. I am not. I am not even an
> ASF
> > member to
> > have more insight and visibility into it.
> >
> > The only thing I can do is to ask for help and see if the ASF Infra is
> > willing to help in
> > the situation by exercising the powers that I do not have.
> >
> > For me, this is really a test, whether the ASF has the power to negotiate
> > with such
> > partners. If not - maybe it's time to think that everything (including
> > GitHub repos)
> > should be self-hosted by INFRA, because if you are dealing with partners
> > like that
> > you should have some negotiating power, otherwise, you put yourself in a
> > loosing
> > position.
> >
> > But again - I do not know much. This is what I would do If I had the
> > powers.
> > On my side, I think I've shown that I do above and beyond what you might
> > expect
> > from a PMC of one of the ASF projects, and asking for help from the
> > organization,
> > I - so far - proudly belong to, is the only thing left I have. I run out
> > of all ammo.
> >
> > So again - please help!
> >
> > J.
> >
> >
> > On Sat, Jan 9, 2021 at 11:00 PM Matt Sicker <bo...@gmail.com> wrote:
> >
> > > I work on the Jenkins security team. We don’t have embarrassing
> security
> > > failures like this anymore, but part of that is due to the added
> > complexity
> > > of a secure configuration. By the time GA meets your security
> standards,
> > > it’ll likely either require non-trivial changes to your CI scripts, or
> > > it’ll break various use cases that you otherwise considered to be
> > usability
> > > enhancements. It’s really getting annoying how every complaint you have
> > > about every non-Jenkins system isn’t a problem in Jenkins. We have more
> > > expertise available to customize things in such a way that works for
> > > non-proprietary SaaS that most services are optimized for (which is why
> > > their security models tend to fall short once a large organization like
> > > Apache tries using something).
> > >
> > > Many of the features you’re asking from GA are likely non-trivial
> > > architecture changes they’ll have to make to accommodate the
> non-trivial
> > > use cases we have. Or maybe it isn’t and they’re just incompetent?
> > >
> > > On Sat, Jan 9, 2021 at 05:58 Jarek Potiuk <ja...@potiuk.com> wrote:
> > >
> > > > >
> > > > > >
> > > > > > The multiple threads about how shitty those are in practice for
> > your
> > > > > > needs seem to indicate otherwise. Security and easy learning
> curves
> > > > > > don't seem to get along too well, do they?
> > > > >
> > > >
> > > > The usabilty, integration level (especially GitHub Actions),
> > maintenance
> > > > effort needed
> > > > - thi is far, far superior. If only we could solve one simple
> problem -
> > > > securely running
> > > > the self-hosted runners for GA - all our problems are solved
> > INSTANTLY.
> > > >
> > > > Security issues happen everywhere, at least if they happen in such
> > > services
> > > > you can
> > > > mitigate (we just did it in Airflow- we mitigated all the security
> > issues
> > > > we found),
> > > > open bounty requests (I did - I opened two bounty requests) and then
> > > > escalate.
> > > > If I do not hear about my 2 security bounties from GitHub shortly,
> > > > I am going to start a hell of a social media campaign about it
> > > >  using all the means I can. I tried to responsibly disclose it but I
> am
> > > > going to write a nice
> > > > blog post about "How to exploit Github Actions" and I am going to
> tell
> > > them
> > > > that before
> > > > I publish it and give them a chance to fix it.
> > > >
> > > > So you have many ways to influence the security of public services
> like
> > > > that. I think it's
> > > > much better than when you have to manage security yourselves.
> > > >
> > > >
> > > > > >
> > > > > > That would all be possible in Jenkins, some of it would be fairly
> > > > > > simple to integrate, others would indeed be non-trivial rewrites.
> > > > > >
> > > > >
> > > >
> > > > Yep. The non-trivial ones I am afraid of. It took me a year to
> perfect
> > > and
> > > > optimise
> > > > a number of steps in our CI and the problem was - it worked really
> well
> > > > until it stopped
> > > > because of uncontrolled increase of usage from other projects and no
> > > secure
> > > > way
> > > > to add extra resources needed (even if we have all the funds - we now
> > > have
> > > > 8000 USD
> > > > secured from our stakeholders - with outlook for more) to run those.
> > But
> > > if
> > > > you add the
> > > > engineering effort needed to migrate, the engineering time for that
> > costs
> > > > FAR more than
> > > > just that - enabling compute resources to use all the engineering
> > efforts
> > > > you've already
> > > > spent. This is no brainer which way is simpler, cheaper and can be
> done
> > > > faster.
> > > > We just need to have a secure way of doing it.
> > > >
> > > >
> > > > > >
> > > > > >
> > > > > > You can have your own Jenkins controller for your PMC. This is
> > vastly
> > > > > > simpler for you to administer than a super time-shared
> environment
> > > > > > like GA. CI systems seem to be a dime a dozen nowadays, yet not a
> > > > > > single one seems to implement job scheduling in a sufficiently
> > > > > > customizable way that scales.
> > > > >
> > > >
> > > > I do not want nor need to administer my CI. And I've done that many
> > times
> > > > in
> > > > the past - Jenkins, Bamboo, GitLab - you name it.
> > > > Heck - I built and maintained my first custom CI framework for my
> > company
> > > > some 20 years ago when the "CI" was just being coined.
> > > > With CI as a service, I do not want to do it anymore. At all. CI for
> me
> > > > should just 'be there'.
> > > > Great CI is one that you are not aware of its existence until your
> test
> > > > fail - and
> > > > even then you just want to see the logs of your failed tests and
> figure
> > > > out the reason
> > > > This is what you want from CI system. I do not want to learn how
> > > > to manage Jenkins, install plugins, configure that etc. This is not
> my
> > > job,
> > > > nor any
> > > > one in our project. This requires far more than just setting it
> > > > up - it is making sure that it is secure, that it runs 24/7, that it
> > gets
> > > > updated etc. etc.
> > > > This is far more complex than 'just use CI'. We have enough trouble
> > with
> > > > setting up
> > > > and maintaining runners (once we get them securely connected).
> > > >
> > > > I know it looks differently from the infrastructure person point of
> > view
> > > -
> > > > running
> > > > Jenkins is pretty much core part of what you do. But for project
> > > > developers, CI
> > > > should just 'work'. This is what I get from GitHub Actions. It just
> > > > 'works'. I have
> > > > to spend 0 effort to maintain it. Sometimes when it does not work, it
> > > > pains, but
> > > > then it's their problem to fix - and they have to fix it eventually
> > > because
> > > > they get
> > > > pressure from all their customers. In case I run my own jenkins
> install
> > > and
> > > > administer it - all those problems fall on us. I do not want that.
> This
> > > > moves us
> > > > away from doing what we should - develop our product.
> > > >
> > > >
> > > > > >
> > > > > > Definitely valid points. Any CI migrations are non-trivial,
> > > especially
> > > > > > once you've set up nice workflows. Perhaps there are some
> > > alternatives
> > > > > > that can help bridge the gap if GA still can't meet your needs.
> > I've
> > > > > > seen prow [1] used in various projects in the Kubernetes
> > communities,
> > > > > > and I'm sure there must be plenty of others.
> > > > >
> > > >
> > > > GA meets all my needs. Except one that I am asking ASF to help with -
> > > > make GitHub focus on making a secure way of working with self-hosted
> > > > runners. That's it . We even (In November) opened a PR to Github
> > Actions
> > > > Runner to enable it:  https://github.com/actions/runner/pull/783
> > > > But we have not heard anything since.
> > > >
> > > > This is what I ask INFRA to help with - put pressure on GitHub to
> make
> > it
> > > > happen. I need nothing more - no money, no Jenkins, nothing like
> that.
> > > > I just want to be able to spend the money we managed to secure.
> > > >
> > > > J.
> > > >
> > > >
> > > > --
> > > > +48 660 796 129
> > > >
> > >
> >
> >
> > --
> >
> > Jarek Potiuk
> > Polidea <https://www.polidea.com/> | Principal Software Engineer
> >
> > M: +48 660 796 129 <+48660796129>
> > [image: Polidea] <https://www.polidea.com/>
> >
>


-- 
+48 660 796 129

Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by Matt Sicker <bo...@gmail.com>.
If we can get GA to handle our use case properly, that would be awesome.
Being in the security engineering domain, though, I’m generally pessimistic
about security, so please excuse my cynicism.

On Sun, Jan 10, 2021 at 03:43 Jarek Potiuk <Ja...@polidea.com> wrote:

> I have a feeling (though I cannot know for sure)
> that you are underestimating the power of an organization like ASF in
> actually 'stating' their requirements and 'expectations' towards GitHub.
>
> I am now an engineer, but I used to be CTO, CEO, Head of IT, Head of
> Technology
> and I know that a lot can be achieved by proper communication, stating your
> expectations clearly and follow-up and pushing when you are dealing with
> partners like that - and engineering excellence or security perfection is
> not the only
> the thing that matters. Usability, maintenance, streamlining development
> matter and if you
> have "good enough security", they are more important for users.
>
> I know if you look at it from an "infrastructure security Jenkins" point of
> view - the Jenkins
> you manage is superior when it comes to security.
> This is perfectly clear, and I have no intention to question that or
> disagree with you.
> And yes - in this aspect I fully agree with you.
>
> But there are other aspects which I see (and try to explain).
> While I deeply care about security (as probably you could see from my
> earlier
> communication). Just limiting the discussion to "who is more secure" is a
> terrible,
> terrible oversimplification.
>
> I encourage you to exercise empathy and see it from the side I was
> explaining -
> maintenance, features, integration, streamlining development. Those are
> important
> things for developers. Less important for security engineers of course, but
> if
> we can satisfy security, those are the things that matter.
>
> I think currently we have mitigations for all the security problems we
> found at the project
> level. Also (as I mentioned before) we will have good leverage - via social
> media pressure
> to push GA into solving those that are 'systemic' problems we found. They
> are not
> necessary for our project to solve, but it would simplify your life as you
> take care of so
> many projects. So the security bounties that I opened are not for me - they
> are for the
> ASF as a whole and for the security team of ASF. I exercised a lot of
> empathy to your
> team that rather than only solving my problem, I also spend time and effort
> to push
> GA into solving it for all ASF projects and in the way that ASF infra
> security will be satisfied.
> I did not have to do that. Yet I try to think about your needs there.
>
> And to be honest I expect something in return. Empathy and understanding
> other needs
> I have - performance, usability, streamlining development, minimum
> engineering effort
> to solve our problems is the least I can ask for. Help in dealing
> with GitHub and
> exercising ASF powers would be great.
>
> Maybe with GitHub, the problem is that organizations like ASF do not
> exercise
> their leverage and do not clearly state what is essential for them while
> working with
> partners like them?
>
> Did the ASF explicitly contacted GA and firmly stated that solving the
> problem of
> self-hosted runnines is an absolute top priority to solve our performance
> issues?
>
> I do not know.
>
> Did anyone from ASF contacted GA and firmly stated that the two bounties I
> created are essential for the security team to be able to provide security
> for
> the organization?
>
> I do not know.
>
> Did the ASF push GA in any way in this direction  stating that
> ASF is considering alternatives? (The "stick" in this discussion)
>
> I do not know.
>
> Did the ASF propose GA that we can endorse their service, write blogs, and
> ask
> the 100s of projects that will use GA to endorse their service publicly
> once they
> start addressing our firmly stated needs and expectations? (The "carrot" in
> this
> discussion)
>
> I do not know.
>
> This is what I would do if I were at INFRA. I am not. I am not even an ASF
> member to
> have more insight and visibility into it.
>
> The only thing I can do is to ask for help and see if the ASF Infra is
> willing to help in
> the situation by exercising the powers that I do not have.
>
> For me, this is really a test, whether the ASF has the power to negotiate
> with such
> partners. If not - maybe it's time to think that everything (including
> GitHub repos)
> should be self-hosted by INFRA, because if you are dealing with partners
> like that
> you should have some negotiating power, otherwise, you put yourself in a
> loosing
> position.
>
> But again - I do not know much. This is what I would do If I had the
> powers.
> On my side, I think I've shown that I do above and beyond what you might
> expect
> from a PMC of one of the ASF projects, and asking for help from the
> organization,
> I - so far - proudly belong to, is the only thing left I have. I run out
> of all ammo.
>
> So again - please help!
>
> J.
>
>
> On Sat, Jan 9, 2021 at 11:00 PM Matt Sicker <bo...@gmail.com> wrote:
>
> > I work on the Jenkins security team. We don’t have embarrassing security
> > failures like this anymore, but part of that is due to the added
> complexity
> > of a secure configuration. By the time GA meets your security standards,
> > it’ll likely either require non-trivial changes to your CI scripts, or
> > it’ll break various use cases that you otherwise considered to be
> usability
> > enhancements. It’s really getting annoying how every complaint you have
> > about every non-Jenkins system isn’t a problem in Jenkins. We have more
> > expertise available to customize things in such a way that works for
> > non-proprietary SaaS that most services are optimized for (which is why
> > their security models tend to fall short once a large organization like
> > Apache tries using something).
> >
> > Many of the features you’re asking from GA are likely non-trivial
> > architecture changes they’ll have to make to accommodate the non-trivial
> > use cases we have. Or maybe it isn’t and they’re just incompetent?
> >
> > On Sat, Jan 9, 2021 at 05:58 Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > > >
> > > > >
> > > > > The multiple threads about how shitty those are in practice for
> your
> > > > > needs seem to indicate otherwise. Security and easy learning curves
> > > > > don't seem to get along too well, do they?
> > > >
> > >
> > > The usabilty, integration level (especially GitHub Actions),
> maintenance
> > > effort needed
> > > - thi is far, far superior. If only we could solve one simple problem -
> > > securely running
> > > the self-hosted runners for GA - all our problems are solved
> INSTANTLY.
> > >
> > > Security issues happen everywhere, at least if they happen in such
> > services
> > > you can
> > > mitigate (we just did it in Airflow- we mitigated all the security
> issues
> > > we found),
> > > open bounty requests (I did - I opened two bounty requests) and then
> > > escalate.
> > > If I do not hear about my 2 security bounties from GitHub shortly,
> > > I am going to start a hell of a social media campaign about it
> > >  using all the means I can. I tried to responsibly disclose it but I am
> > > going to write a nice
> > > blog post about "How to exploit Github Actions" and I am going to tell
> > them
> > > that before
> > > I publish it and give them a chance to fix it.
> > >
> > > So you have many ways to influence the security of public services like
> > > that. I think it's
> > > much better than when you have to manage security yourselves.
> > >
> > >
> > > > >
> > > > > That would all be possible in Jenkins, some of it would be fairly
> > > > > simple to integrate, others would indeed be non-trivial rewrites.
> > > > >
> > > >
> > >
> > > Yep. The non-trivial ones I am afraid of. It took me a year to perfect
> > and
> > > optimise
> > > a number of steps in our CI and the problem was - it worked really well
> > > until it stopped
> > > because of uncontrolled increase of usage from other projects and no
> > secure
> > > way
> > > to add extra resources needed (even if we have all the funds - we now
> > have
> > > 8000 USD
> > > secured from our stakeholders - with outlook for more) to run those.
> But
> > if
> > > you add the
> > > engineering effort needed to migrate, the engineering time for that
> costs
> > > FAR more than
> > > just that - enabling compute resources to use all the engineering
> efforts
> > > you've already
> > > spent. This is no brainer which way is simpler, cheaper and can be done
> > > faster.
> > > We just need to have a secure way of doing it.
> > >
> > >
> > > > >
> > > > >
> > > > > You can have your own Jenkins controller for your PMC. This is
> vastly
> > > > > simpler for you to administer than a super time-shared environment
> > > > > like GA. CI systems seem to be a dime a dozen nowadays, yet not a
> > > > > single one seems to implement job scheduling in a sufficiently
> > > > > customizable way that scales.
> > > >
> > >
> > > I do not want nor need to administer my CI. And I've done that many
> times
> > > in
> > > the past - Jenkins, Bamboo, GitLab - you name it.
> > > Heck - I built and maintained my first custom CI framework for my
> company
> > > some 20 years ago when the "CI" was just being coined.
> > > With CI as a service, I do not want to do it anymore. At all. CI for me
> > > should just 'be there'.
> > > Great CI is one that you are not aware of its existence until your test
> > > fail - and
> > > even then you just want to see the logs of your failed tests and figure
> > > out the reason
> > > This is what you want from CI system. I do not want to learn how
> > > to manage Jenkins, install plugins, configure that etc. This is not my
> > job,
> > > nor any
> > > one in our project. This requires far more than just setting it
> > > up - it is making sure that it is secure, that it runs 24/7, that it
> gets
> > > updated etc. etc.
> > > This is far more complex than 'just use CI'. We have enough trouble
> with
> > > setting up
> > > and maintaining runners (once we get them securely connected).
> > >
> > > I know it looks differently from the infrastructure person point of
> view
> > -
> > > running
> > > Jenkins is pretty much core part of what you do. But for project
> > > developers, CI
> > > should just 'work'. This is what I get from GitHub Actions. It just
> > > 'works'. I have
> > > to spend 0 effort to maintain it. Sometimes when it does not work, it
> > > pains, but
> > > then it's their problem to fix - and they have to fix it eventually
> > because
> > > they get
> > > pressure from all their customers. In case I run my own jenkins install
> > and
> > > administer it - all those problems fall on us. I do not want that. This
> > > moves us
> > > away from doing what we should - develop our product.
> > >
> > >
> > > > >
> > > > > Definitely valid points. Any CI migrations are non-trivial,
> > especially
> > > > > once you've set up nice workflows. Perhaps there are some
> > alternatives
> > > > > that can help bridge the gap if GA still can't meet your needs.
> I've
> > > > > seen prow [1] used in various projects in the Kubernetes
> communities,
> > > > > and I'm sure there must be plenty of others.
> > > >
> > >
> > > GA meets all my needs. Except one that I am asking ASF to help with -
> > > make GitHub focus on making a secure way of working with self-hosted
> > > runners. That's it . We even (In November) opened a PR to Github
> Actions
> > > Runner to enable it:  https://github.com/actions/runner/pull/783
> > > But we have not heard anything since.
> > >
> > > This is what I ask INFRA to help with - put pressure on GitHub to make
> it
> > > happen. I need nothing more - no money, no Jenkins, nothing like that.
> > > I just want to be able to spend the money we managed to secure.
> > >
> > > J.
> > >
> > >
> > > --
> > > +48 660 796 129
> > >
> >
>
>
> --
>
> Jarek Potiuk
> Polidea <https://www.polidea.com/> | Principal Software Engineer
>
> M: +48 660 796 129 <+48660796129>
> [image: Polidea] <https://www.polidea.com/>
>

Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by Jarek Potiuk <Ja...@polidea.com>.
I have a feeling (though I cannot know for sure)
that you are underestimating the power of an organization like ASF in
actually 'stating' their requirements and 'expectations' towards GitHub.

I am now an engineer, but I used to be CTO, CEO, Head of IT, Head of
Technology
and I know that a lot can be achieved by proper communication, stating your
expectations clearly and follow-up and pushing when you are dealing with
partners like that - and engineering excellence or security perfection is
not the only
the thing that matters. Usability, maintenance, streamlining development
matter and if you
have "good enough security", they are more important for users.

I know if you look at it from an "infrastructure security Jenkins" point of
view - the Jenkins
you manage is superior when it comes to security.
This is perfectly clear, and I have no intention to question that or
disagree with you.
And yes - in this aspect I fully agree with you.

But there are other aspects which I see (and try to explain).
While I deeply care about security (as probably you could see from my
earlier
communication). Just limiting the discussion to "who is more secure" is a
terrible,
terrible oversimplification.

I encourage you to exercise empathy and see it from the side I was
explaining -
maintenance, features, integration, streamlining development. Those are
important
things for developers. Less important for security engineers of course, but
if
we can satisfy security, those are the things that matter.

I think currently we have mitigations for all the security problems we
found at the project
level. Also (as I mentioned before) we will have good leverage - via social
media pressure
to push GA into solving those that are 'systemic' problems we found. They
are not
necessary for our project to solve, but it would simplify your life as you
take care of so
many projects. So the security bounties that I opened are not for me - they
are for the
ASF as a whole and for the security team of ASF. I exercised a lot of
empathy to your
team that rather than only solving my problem, I also spend time and effort
to push
GA into solving it for all ASF projects and in the way that ASF infra
security will be satisfied.
I did not have to do that. Yet I try to think about your needs there.

And to be honest I expect something in return. Empathy and understanding
other needs
I have - performance, usability, streamlining development, minimum
engineering effort
to solve our problems is the least I can ask for. Help in dealing
with GitHub and
exercising ASF powers would be great.

Maybe with GitHub, the problem is that organizations like ASF do not
exercise
their leverage and do not clearly state what is essential for them while
working with
partners like them?

Did the ASF explicitly contacted GA and firmly stated that solving the
problem of
self-hosted runnines is an absolute top priority to solve our performance
issues?

I do not know.

Did anyone from ASF contacted GA and firmly stated that the two bounties I
created are essential for the security team to be able to provide security
for
the organization?

I do not know.

Did the ASF push GA in any way in this direction  stating that
ASF is considering alternatives? (The "stick" in this discussion)

I do not know.

Did the ASF propose GA that we can endorse their service, write blogs, and
ask
the 100s of projects that will use GA to endorse their service publicly
once they
start addressing our firmly stated needs and expectations? (The "carrot" in
this
discussion)

I do not know.

This is what I would do if I were at INFRA. I am not. I am not even an ASF
member to
have more insight and visibility into it.

The only thing I can do is to ask for help and see if the ASF Infra is
willing to help in
the situation by exercising the powers that I do not have.

For me, this is really a test, whether the ASF has the power to negotiate
with such
partners. If not - maybe it's time to think that everything (including
GitHub repos)
should be self-hosted by INFRA, because if you are dealing with partners
like that
you should have some negotiating power, otherwise, you put yourself in a
loosing
position.

But again - I do not know much. This is what I would do If I had the
powers.
On my side, I think I've shown that I do above and beyond what you might
expect
from a PMC of one of the ASF projects, and asking for help from the
organization,
I - so far - proudly belong to, is the only thing left I have. I run out
of all ammo.

So again - please help!

J.


On Sat, Jan 9, 2021 at 11:00 PM Matt Sicker <bo...@gmail.com> wrote:

> I work on the Jenkins security team. We don’t have embarrassing security
> failures like this anymore, but part of that is due to the added complexity
> of a secure configuration. By the time GA meets your security standards,
> it’ll likely either require non-trivial changes to your CI scripts, or
> it’ll break various use cases that you otherwise considered to be usability
> enhancements. It’s really getting annoying how every complaint you have
> about every non-Jenkins system isn’t a problem in Jenkins. We have more
> expertise available to customize things in such a way that works for
> non-proprietary SaaS that most services are optimized for (which is why
> their security models tend to fall short once a large organization like
> Apache tries using something).
>
> Many of the features you’re asking from GA are likely non-trivial
> architecture changes they’ll have to make to accommodate the non-trivial
> use cases we have. Or maybe it isn’t and they’re just incompetent?
>
> On Sat, Jan 9, 2021 at 05:58 Jarek Potiuk <ja...@potiuk.com> wrote:
>
> > >
> > > >
> > > > The multiple threads about how shitty those are in practice for your
> > > > needs seem to indicate otherwise. Security and easy learning curves
> > > > don't seem to get along too well, do they?
> > >
> >
> > The usabilty, integration level (especially GitHub Actions), maintenance
> > effort needed
> > - thi is far, far superior. If only we could solve one simple problem -
> > securely running
> > the self-hosted runners for GA - all our problems are solved  INSTANTLY.
> >
> > Security issues happen everywhere, at least if they happen in such
> services
> > you can
> > mitigate (we just did it in Airflow- we mitigated all the security issues
> > we found),
> > open bounty requests (I did - I opened two bounty requests) and then
> > escalate.
> > If I do not hear about my 2 security bounties from GitHub shortly,
> > I am going to start a hell of a social media campaign about it
> >  using all the means I can. I tried to responsibly disclose it but I am
> > going to write a nice
> > blog post about "How to exploit Github Actions" and I am going to tell
> them
> > that before
> > I publish it and give them a chance to fix it.
> >
> > So you have many ways to influence the security of public services like
> > that. I think it's
> > much better than when you have to manage security yourselves.
> >
> >
> > > >
> > > > That would all be possible in Jenkins, some of it would be fairly
> > > > simple to integrate, others would indeed be non-trivial rewrites.
> > > >
> > >
> >
> > Yep. The non-trivial ones I am afraid of. It took me a year to perfect
> and
> > optimise
> > a number of steps in our CI and the problem was - it worked really well
> > until it stopped
> > because of uncontrolled increase of usage from other projects and no
> secure
> > way
> > to add extra resources needed (even if we have all the funds - we now
> have
> > 8000 USD
> > secured from our stakeholders - with outlook for more) to run those. But
> if
> > you add the
> > engineering effort needed to migrate, the engineering time for that costs
> > FAR more than
> > just that - enabling compute resources to use all the engineering efforts
> > you've already
> > spent. This is no brainer which way is simpler, cheaper and can be done
> > faster.
> > We just need to have a secure way of doing it.
> >
> >
> > > >
> > > >
> > > > You can have your own Jenkins controller for your PMC. This is vastly
> > > > simpler for you to administer than a super time-shared environment
> > > > like GA. CI systems seem to be a dime a dozen nowadays, yet not a
> > > > single one seems to implement job scheduling in a sufficiently
> > > > customizable way that scales.
> > >
> >
> > I do not want nor need to administer my CI. And I've done that many times
> > in
> > the past - Jenkins, Bamboo, GitLab - you name it.
> > Heck - I built and maintained my first custom CI framework for my company
> > some 20 years ago when the "CI" was just being coined.
> > With CI as a service, I do not want to do it anymore. At all. CI for me
> > should just 'be there'.
> > Great CI is one that you are not aware of its existence until your test
> > fail - and
> > even then you just want to see the logs of your failed tests and figure
> > out the reason
> > This is what you want from CI system. I do not want to learn how
> > to manage Jenkins, install plugins, configure that etc. This is not my
> job,
> > nor any
> > one in our project. This requires far more than just setting it
> > up - it is making sure that it is secure, that it runs 24/7, that it gets
> > updated etc. etc.
> > This is far more complex than 'just use CI'. We have enough trouble with
> > setting up
> > and maintaining runners (once we get them securely connected).
> >
> > I know it looks differently from the infrastructure person point of view
> -
> > running
> > Jenkins is pretty much core part of what you do. But for project
> > developers, CI
> > should just 'work'. This is what I get from GitHub Actions. It just
> > 'works'. I have
> > to spend 0 effort to maintain it. Sometimes when it does not work, it
> > pains, but
> > then it's their problem to fix - and they have to fix it eventually
> because
> > they get
> > pressure from all their customers. In case I run my own jenkins install
> and
> > administer it - all those problems fall on us. I do not want that. This
> > moves us
> > away from doing what we should - develop our product.
> >
> >
> > > >
> > > > Definitely valid points. Any CI migrations are non-trivial,
> especially
> > > > once you've set up nice workflows. Perhaps there are some
> alternatives
> > > > that can help bridge the gap if GA still can't meet your needs. I've
> > > > seen prow [1] used in various projects in the Kubernetes communities,
> > > > and I'm sure there must be plenty of others.
> > >
> >
> > GA meets all my needs. Except one that I am asking ASF to help with -
> > make GitHub focus on making a secure way of working with self-hosted
> > runners. That's it . We even (In November) opened a PR to Github Actions
> > Runner to enable it:  https://github.com/actions/runner/pull/783
> > But we have not heard anything since.
> >
> > This is what I ask INFRA to help with - put pressure on GitHub to make it
> > happen. I need nothing more - no money, no Jenkins, nothing like that.
> > I just want to be able to spend the money we managed to secure.
> >
> > J.
> >
> >
> > --
> > +48 660 796 129
> >
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by Matt Sicker <bo...@gmail.com>.
I work on the Jenkins security team. We don’t have embarrassing security
failures like this anymore, but part of that is due to the added complexity
of a secure configuration. By the time GA meets your security standards,
it’ll likely either require non-trivial changes to your CI scripts, or
it’ll break various use cases that you otherwise considered to be usability
enhancements. It’s really getting annoying how every complaint you have
about every non-Jenkins system isn’t a problem in Jenkins. We have more
expertise available to customize things in such a way that works for
non-proprietary SaaS that most services are optimized for (which is why
their security models tend to fall short once a large organization like
Apache tries using something).

Many of the features you’re asking from GA are likely non-trivial
architecture changes they’ll have to make to accommodate the non-trivial
use cases we have. Or maybe it isn’t and they’re just incompetent?

On Sat, Jan 9, 2021 at 05:58 Jarek Potiuk <ja...@potiuk.com> wrote:

> >
> > >
> > > The multiple threads about how shitty those are in practice for your
> > > needs seem to indicate otherwise. Security and easy learning curves
> > > don't seem to get along too well, do they?
> >
>
> The usabilty, integration level (especially GitHub Actions), maintenance
> effort needed
> - thi is far, far superior. If only we could solve one simple problem -
> securely running
> the self-hosted runners for GA - all our problems are solved  INSTANTLY.
>
> Security issues happen everywhere, at least if they happen in such services
> you can
> mitigate (we just did it in Airflow- we mitigated all the security issues
> we found),
> open bounty requests (I did - I opened two bounty requests) and then
> escalate.
> If I do not hear about my 2 security bounties from GitHub shortly,
> I am going to start a hell of a social media campaign about it
>  using all the means I can. I tried to responsibly disclose it but I am
> going to write a nice
> blog post about "How to exploit Github Actions" and I am going to tell them
> that before
> I publish it and give them a chance to fix it.
>
> So you have many ways to influence the security of public services like
> that. I think it's
> much better than when you have to manage security yourselves.
>
>
> > >
> > > That would all be possible in Jenkins, some of it would be fairly
> > > simple to integrate, others would indeed be non-trivial rewrites.
> > >
> >
>
> Yep. The non-trivial ones I am afraid of. It took me a year to perfect and
> optimise
> a number of steps in our CI and the problem was - it worked really well
> until it stopped
> because of uncontrolled increase of usage from other projects and no secure
> way
> to add extra resources needed (even if we have all the funds - we now have
> 8000 USD
> secured from our stakeholders - with outlook for more) to run those. But if
> you add the
> engineering effort needed to migrate, the engineering time for that costs
> FAR more than
> just that - enabling compute resources to use all the engineering efforts
> you've already
> spent. This is no brainer which way is simpler, cheaper and can be done
> faster.
> We just need to have a secure way of doing it.
>
>
> > >
> > >
> > > You can have your own Jenkins controller for your PMC. This is vastly
> > > simpler for you to administer than a super time-shared environment
> > > like GA. CI systems seem to be a dime a dozen nowadays, yet not a
> > > single one seems to implement job scheduling in a sufficiently
> > > customizable way that scales.
> >
>
> I do not want nor need to administer my CI. And I've done that many times
> in
> the past - Jenkins, Bamboo, GitLab - you name it.
> Heck - I built and maintained my first custom CI framework for my company
> some 20 years ago when the "CI" was just being coined.
> With CI as a service, I do not want to do it anymore. At all. CI for me
> should just 'be there'.
> Great CI is one that you are not aware of its existence until your test
> fail - and
> even then you just want to see the logs of your failed tests and figure
> out the reason
> This is what you want from CI system. I do not want to learn how
> to manage Jenkins, install plugins, configure that etc. This is not my job,
> nor any
> one in our project. This requires far more than just setting it
> up - it is making sure that it is secure, that it runs 24/7, that it gets
> updated etc. etc.
> This is far more complex than 'just use CI'. We have enough trouble with
> setting up
> and maintaining runners (once we get them securely connected).
>
> I know it looks differently from the infrastructure person point of view -
> running
> Jenkins is pretty much core part of what you do. But for project
> developers, CI
> should just 'work'. This is what I get from GitHub Actions. It just
> 'works'. I have
> to spend 0 effort to maintain it. Sometimes when it does not work, it
> pains, but
> then it's their problem to fix - and they have to fix it eventually because
> they get
> pressure from all their customers. In case I run my own jenkins install and
> administer it - all those problems fall on us. I do not want that. This
> moves us
> away from doing what we should - develop our product.
>
>
> > >
> > > Definitely valid points. Any CI migrations are non-trivial, especially
> > > once you've set up nice workflows. Perhaps there are some alternatives
> > > that can help bridge the gap if GA still can't meet your needs. I've
> > > seen prow [1] used in various projects in the Kubernetes communities,
> > > and I'm sure there must be plenty of others.
> >
>
> GA meets all my needs. Except one that I am asking ASF to help with -
> make GitHub focus on making a secure way of working with self-hosted
> runners. That's it . We even (In November) opened a PR to Github Actions
> Runner to enable it:  https://github.com/actions/runner/pull/783
> But we have not heard anything since.
>
> This is what I ask INFRA to help with - put pressure on GitHub to make it
> happen. I need nothing more - no money, no Jenkins, nothing like that.
> I just want to be able to spend the money we managed to secure.
>
> J.
>
>
> --
> +48 660 796 129
>

Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by Jarek Potiuk <ja...@potiuk.com>.
>
> >
> > The multiple threads about how shitty those are in practice for your
> > needs seem to indicate otherwise. Security and easy learning curves
> > don't seem to get along too well, do they?
>

The usabilty, integration level (especially GitHub Actions), maintenance
effort needed
- thi is far, far superior. If only we could solve one simple problem -
securely running
the self-hosted runners for GA - all our problems are solved  INSTANTLY.

Security issues happen everywhere, at least if they happen in such services
you can
mitigate (we just did it in Airflow- we mitigated all the security issues
we found),
open bounty requests (I did - I opened two bounty requests) and then
escalate.
If I do not hear about my 2 security bounties from GitHub shortly,
I am going to start a hell of a social media campaign about it
 using all the means I can. I tried to responsibly disclose it but I am
going to write a nice
blog post about "How to exploit Github Actions" and I am going to tell them
that before
I publish it and give them a chance to fix it.

So you have many ways to influence the security of public services like
that. I think it's
much better than when you have to manage security yourselves.


> >
> > That would all be possible in Jenkins, some of it would be fairly
> > simple to integrate, others would indeed be non-trivial rewrites.
> >
>

Yep. The non-trivial ones I am afraid of. It took me a year to perfect and
optimise
a number of steps in our CI and the problem was - it worked really well
until it stopped
because of uncontrolled increase of usage from other projects and no secure
way
to add extra resources needed (even if we have all the funds - we now have
8000 USD
secured from our stakeholders - with outlook for more) to run those. But if
you add the
engineering effort needed to migrate, the engineering time for that costs
FAR more than
just that - enabling compute resources to use all the engineering efforts
you've already
spent. This is no brainer which way is simpler, cheaper and can be done
faster.
We just need to have a secure way of doing it.


> >
> >
> > You can have your own Jenkins controller for your PMC. This is vastly
> > simpler for you to administer than a super time-shared environment
> > like GA. CI systems seem to be a dime a dozen nowadays, yet not a
> > single one seems to implement job scheduling in a sufficiently
> > customizable way that scales.
>

I do not want nor need to administer my CI. And I've done that many times
in
the past - Jenkins, Bamboo, GitLab - you name it.
Heck - I built and maintained my first custom CI framework for my company
some 20 years ago when the "CI" was just being coined.
With CI as a service, I do not want to do it anymore. At all. CI for me
should just 'be there'.
Great CI is one that you are not aware of its existence until your test
fail - and
even then you just want to see the logs of your failed tests and figure
out the reason
This is what you want from CI system. I do not want to learn how
to manage Jenkins, install plugins, configure that etc. This is not my job,
nor any
one in our project. This requires far more than just setting it
up - it is making sure that it is secure, that it runs 24/7, that it gets
updated etc. etc.
This is far more complex than 'just use CI'. We have enough trouble with
setting up
and maintaining runners (once we get them securely connected).

I know it looks differently from the infrastructure person point of view -
running
Jenkins is pretty much core part of what you do. But for project
developers, CI
should just 'work'. This is what I get from GitHub Actions. It just
'works'. I have
to spend 0 effort to maintain it. Sometimes when it does not work, it
pains, but
then it's their problem to fix - and they have to fix it eventually because
they get
pressure from all their customers. In case I run my own jenkins install and
administer it - all those problems fall on us. I do not want that. This
moves us
away from doing what we should - develop our product.


> >
> > Definitely valid points. Any CI migrations are non-trivial, especially
> > once you've set up nice workflows. Perhaps there are some alternatives
> > that can help bridge the gap if GA still can't meet your needs. I've
> > seen prow [1] used in various projects in the Kubernetes communities,
> > and I'm sure there must be plenty of others.
>

GA meets all my needs. Except one that I am asking ASF to help with -
make GitHub focus on making a secure way of working with self-hosted
runners. That's it . We even (In November) opened a PR to Github Actions
Runner to enable it:  https://github.com/actions/runner/pull/783
But we have not heard anything since.

This is what I ask INFRA to help with - put pressure on GitHub to make it
happen. I need nothing more - no money, no Jenkins, nothing like that.
I just want to be able to spend the money we managed to secure.

J.


-- 
+48 660 796 129

Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by Matt Sicker <bo...@gmail.com>.
Oops, forgot the footnote:

[1]: https://github.com/kubernetes/test-infra/tree/master/prow

On Fri, 8 Jan 2021 at 16:15, Matt Sicker <bo...@gmail.com> wrote:
>
> On Fri, 8 Jan 2021 at 16:04, Jarek Potiuk <ja...@potiuk.com> wrote:
> >
> > Github Actions, GitLab,  TravisCI. even Cloud Build are soooo much easier
> > to work with and get your stuff done.
>
> The multiple threads about how shitty those are in practice for your
> needs seem to indicate otherwise. Security and easy learning curves
> don't seem to get along too well, do they?
>
> > Repos is great. Rewriting all the logic we implemented in our actions (
> > https://github.com/apache/airflow/blob/master/CI.rst) where we use not only
> > the "tasks" but also:
> > * the image registry
> > * updating Github issues
> > * adding labels
> > * PR status integration
> > * pushing changes to dedicated branches in our repo
> > * cancelling duplicated workflows (of 3 kinds)
> > * selective builds based on what files changed, etc.
> >
> > And it is also nicely integrated with 'PR from forks' workflow.
> >
> > It would take a major engineering effort to rewrite to Jenkins (if at all
> > possible).
>
> That would all be possible in Jenkins, some of it would be fairly
> simple to integrate, others would indeed be non-trivial rewrites.
>
> > And even if we do so - we have no certainty whatsoever that the number of
> > our builds will be ok with the capacity.
> > I was involved in Apache Beam CI infra project where Beam has like 16
> > dedicated workers just for them.
>
> You can have your own Jenkins controller for your PMC. This is vastly
> simpler for you to administer than a super time-shared environment
> like GA. CI systems seem to be a dime a dozen nowadays, yet not a
> single one seems to implement job scheduling in a sufficiently
> customizable way that scales.
>
> > Even if we discount the months of engineering effort needed, I have no idea
> > if shared jenkins is enough for us. we run ~50 jobs, some of them 25
> > minute  or more - for each full build (thanks to selective checks we have a
> > lot of builds with shorter/smaller number of  jobs). It would take a LOT of
> > time to switch.
>
> Definitely valid points. Any CI migrations are non-trivial, especially
> once you've set up nice workflows. Perhaps there are some alternatives
> that can help bridge the gap if GA still can't meet your needs. I've
> seen prow [1] used in various projects in the Kubernetes communities,
> and I'm sure there must be plenty of others.

Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by Matt Sicker <bo...@gmail.com>.
On Fri, 8 Jan 2021 at 16:04, Jarek Potiuk <ja...@potiuk.com> wrote:
>
> Github Actions, GitLab,  TravisCI. even Cloud Build are soooo much easier
> to work with and get your stuff done.

The multiple threads about how shitty those are in practice for your
needs seem to indicate otherwise. Security and easy learning curves
don't seem to get along too well, do they?

> Repos is great. Rewriting all the logic we implemented in our actions (
> https://github.com/apache/airflow/blob/master/CI.rst) where we use not only
> the "tasks" but also:
> * the image registry
> * updating Github issues
> * adding labels
> * PR status integration
> * pushing changes to dedicated branches in our repo
> * cancelling duplicated workflows (of 3 kinds)
> * selective builds based on what files changed, etc.
>
> And it is also nicely integrated with 'PR from forks' workflow.
>
> It would take a major engineering effort to rewrite to Jenkins (if at all
> possible).

That would all be possible in Jenkins, some of it would be fairly
simple to integrate, others would indeed be non-trivial rewrites.

> And even if we do so - we have no certainty whatsoever that the number of
> our builds will be ok with the capacity.
> I was involved in Apache Beam CI infra project where Beam has like 16
> dedicated workers just for them.

You can have your own Jenkins controller for your PMC. This is vastly
simpler for you to administer than a super time-shared environment
like GA. CI systems seem to be a dime a dozen nowadays, yet not a
single one seems to implement job scheduling in a sufficiently
customizable way that scales.

> Even if we discount the months of engineering effort needed, I have no idea
> if shared jenkins is enough for us. we run ~50 jobs, some of them 25
> minute  or more - for each full build (thanks to selective checks we have a
> lot of builds with shorter/smaller number of  jobs). It would take a LOT of
> time to switch.

Definitely valid points. Any CI migrations are non-trivial, especially
once you've set up nice workflows. Perhaps there are some alternatives
that can help bridge the gap if GA still can't meet your needs. I've
seen prow [1] used in various projects in the Kubernetes communities,
and I'm sure there must be plenty of others.

Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by Jarek Potiuk <ja...@potiuk.com>.
I used to love gradle and Jenkins and started to hate it. Once you move to
python/javascript world, suddenly stuff that take you days in Java/Gradle
take hours in Python/Javascript. And it is amplified in case of CI where
you do not need to have an enterprise-grade system but a bunch of scripts
to work together. I have years of experience in C/C++/Java/Gradle
(including building custom Gradle plugins for mobile app development which
we open-sourced:
https://mvnrepository.com/artifact/com.apphance.ameba/Ameba/0.99.4  more
than 5 years ago).

And my point is - Jenkins sucks even if I used to love it (literally I was
cheering when Hudson became Jenkins following Koshuke Kawaguchi lad after
Oracle bought Sun. That was YEARS ago. And pipelines are ok but they are
afterthought.

Github Actions, GitLab,  TravisCI. even Cloud Build are soooo much easier
to work with and get your stuff done. Especially,GA integration with Github
Repos is great. Rewriting all the logic we implemented in our actions (
https://github.com/apache/airflow/blob/master/CI.rst) where we use not only
the "tasks" but also:
* the image registry
* updating Github issues
* adding labels
* PR status integration
* pushing changes to dedicated branches in our repo
* cancelling duplicated workflows (of 3 kinds)
* selective builds based on what files changed, etc.

And it is also nicely integrated with 'PR from forks' workflow.

It would take a major engineering effort to rewrite to Jenkins (if at all
possible).

And even if we do so - we have no certainty whatsoever that the number of
our builds will be ok with the capacity.
I was involved in Apache Beam CI infra project where Beam has like 16
dedicated workers just for them.

Even if we discount the months of engineering effort needed, I have no idea
if shared jenkins is enough for us. we run ~50 jobs, some of them 25
minute  or more - for each full build (thanks to selective checks we have a
lot of builds with shorter/smaller number of  jobs). It would take a LOT of
time to switch.

J.


On Fri, Jan 8, 2021 at 10:30 PM Antoine Pitrou <an...@python.org> wrote:

>
> Le 08/01/2021 à 22:29, P. Ottlinger a écrit :
> > Hi Antoine,
> >
> > Am 08.01.21 um 22:17 schrieb Antoine Pitrou:
> >>> What are the gaps in the ASF CI systems that are pushing people onto
> >> less viable platforms such as GA?
> >>
> >> While being a PMC and core developer for Apache Arrow, I'm going to
> >> give a personal opinion here:
> >>
> >> - Jenkins I think many people have had bad experiences in the past with
> >>   (I remember trying to use Jenkins and the obnoxious Web-based
> >>   configuration threw me off)
> >
> > I do not want to start a toolWar-thread, but Jenkins provides means to
> > have your config versioned easily:
> >
> > just change to declarative Jenkinsfiles that are checked in with your
> > projects.
> >
> > A very simple example of this can be found here:
> > https://github.com/apache/creadur-rat/blob/master/Jenkinsfile
> >
> > AFAIK it supports matrix builds as well (have a look at
> > https://github.com/apache/plc4x for a deeper integration).
> >
> > Just my 2ct.
>
> Thanks, I didn't know that existed.
>
> Regards
>
> Antoine.
>


-- 
+48 660 796 129

Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by Antoine Pitrou <an...@python.org>.
Le 08/01/2021 à 22:29, P. Ottlinger a écrit :
> Hi Antoine,
> 
> Am 08.01.21 um 22:17 schrieb Antoine Pitrou:
>>> What are the gaps in the ASF CI systems that are pushing people onto
>> less viable platforms such as GA?
>>
>> While being a PMC and core developer for Apache Arrow, I'm going to
>> give a personal opinion here:
>>
>> - Jenkins I think many people have had bad experiences in the past with
>>   (I remember trying to use Jenkins and the obnoxious Web-based
>>   configuration threw me off)
> 
> I do not want to start a toolWar-thread, but Jenkins provides means to
> have your config versioned easily:
> 
> just change to declarative Jenkinsfiles that are checked in with your
> projects.
> 
> A very simple example of this can be found here:
> https://github.com/apache/creadur-rat/blob/master/Jenkinsfile
> 
> AFAIK it supports matrix builds as well (have a look at
> https://github.com/apache/plc4x for a deeper integration).
> 
> Just my 2ct.

Thanks, I didn't know that existed.

Regards

Antoine.

Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by "P. Ottlinger" <po...@apache.org>.
Hi Antoine,


Am 08.01.21 um 22:17 schrieb Antoine Pitrou:
>> What are the gaps in the ASF CI systems that are pushing people onto
> less viable platforms such as GA?
>
> While being a PMC and core developer for Apache Arrow, I'm going to
> give a personal opinion here:
>
> - Jenkins I think many people have had bad experiences in the past with
>   (I remember trying to use Jenkins and the obnoxious Web-based
>   configuration threw me off)

I do not want to start a toolWar-thread, but Jenkins provides means to
have your config versioned easily:

just change to declarative Jenkinsfiles that are checked in with your
projects.

A very simple example of this can be found here:
https://github.com/apache/creadur-rat/blob/master/Jenkinsfile

AFAIK it supports matrix builds as well (have a look at
https://github.com/apache/plc4x for a deeper integration).

Just my 2ct.

At work we had the same discussion as people wanted to use the Gitlab
equivalent of GA .... just because it looked better.

Personally I never used buildbot as I considered its config rather
obscure, but you seem to have the same feeling towards Jenkins.

Cheers & happy weekend,
Phil

Re: ASF Jenkins usability [Was: Re: GA again unreasonably slow (again)]

Posted by Antoine Pitrou <an...@python.org>.
Hi,

On Fri, 8 Jan 2021 12:49:03 -0800
Chris Lambertus <cm...@public.gmane.org> wrote:
>
> Have you considered the internal and fully supported ASF Jenkins
and/or Buildbot infrastructure? Infra has little control over the free
open source offerings, but we have significantly more resources we can
bring to bear on own on CI systems.
>
> What are the gaps in the ASF CI systems that are pushing people onto
less viable platforms such as GA?

While being a PMC and core developer for Apache Arrow, I'm going to
give a personal opinion here:

- Jenkins I think many people have had bad experiences in the past with
  (I remember trying to use Jenkins and the obnoxious Web-based
  configuration threw me off)

- at Apache Arrow, we once used our own Buildbot instance, but there
  are problems with it. The main problem is that you can't edit the
  configuration on the fly to add / change builds: especially, you
  can't do so on a PR and have that validated before merging. That's
  because any build configuration change requires restarting the
  buildmaster.

What platforms like Github Actions, Travis-CI, etc. give you (apart
from free execution resources) is that the CI configuration is
versioned along your code source. This is a very valuable model,
especially when you're in a fast-changing stage.

Regards

Antoine.

Re: GA again unreasonably slow (again)

Posted by Jarek Potiuk <ja...@potiuk.com>.
> We should be able to make an efficient query via GraphQL API right? I found
> the REST API for actions to be a little underwhelming.


That was the first thing I checked when we started looking at the stats.
Unfortunately last time that I checked (and I even opened an issue for that
to
Github support) there was not a Github Actions GraphQL API.

I got a GH support answer "Yeah we know GH API does not have
GraphQL support yet, sorry". I think it has not changed since.


We have tried to make our builds faster with more caching but it's not easy
> since it's an embedded systems project we need to target a lot of
> configurations and most changes impact all builds.
>

Indeed, I know how much of my time was spent on optimising Airflow GH usage.
I think we eventually decreased the usage 10x or more. But it never helped,
for a
long as currently anyone even accidentally could block all the slots in
almost no
time at all. We have no organisation-wide way to block this and this is the
problem.

Right now I could:
a) mine cryptocurrency using PRs to any Apache project
b) block the queue for everone

I do not have to be even an Apache committer to do that. It's enough if
just open one PR
which is well crafted and spins of 180 jobs that run for 6 hours. It's
super-flawed.


>
> We too would like to would like to take advantage of our own runners but
> more for the ability to do Hardware In the Loop testing but have avoided it
> for the reasons already mentioned.
>

Self-hosted runner for now seems to be the only "Reasonable" option but the
security
issues with the current runner are not allowing us to do it.

>
> --Brennan
>


-- 
+48 660 796 129

Re: GA again unreasonably slow (again)

Posted by Brennan Ashton <ba...@brennanashton.com>.
On Fri, Jan 8, 2021, 12:08 PM Jarek Potiuk <Ja...@polidea.com> wrote:

>
> There is one problem with the charts, They are flawed. They show
> 'workflows' not 'jobs' and one workflow might mean many jobs :(. For
> example the big number of workflows you can see  in Airflow yesterday come
> from "Label when reviewed" workflows - each of which has 1 job that
> takes 10 seconds or so. One workflow  can be 20/30 times more important
> than another.
>
> We cannot easily drill down to jobs, because we are using Github API to get
> the information, but there are limits (max num requests/hr) and we are
> already close to hitting it with the current setup.
>
> Going to the jobs level would mean 20x more API requests. This is the 2nd
> thing where INFRA <> GitHub relation I believe there was the option that
> GitHub provides some better and more reliable stats to analyse.
>
> J.
>

We should be able to make an efficient query via GraphQL API right? I found
the REST API for actions to be a little underwhelming.

I was actually a little surprised to not see us (NuttX) not on the offender
list. I did open a ticket for us to look into how we can play better using
the auto cancel action that Airflow is using, that would make our own lives
easier.

We have tried to make our builds faster with more caching but it's not easy
since it's an embedded systems project we need to target a lot of
configurations and most changes impact all builds.

We too would like to would like to take advantage of our own runners but
more for the ability to do Hardware In the Loop testing but have avoided it
for the reasons already mentioned.

--Brennan

Re: GA again unreasonably slow (again)

Posted by Vladimir Sitnikov <si...@gmail.com>.
Jarek>But let's see, maybe it will work !

That is exactly my feeling. I guess they have already optimized the build,
however, let's see.

At least, 24 Spark jobs (22 of them are "master") are queued:
https://github.com/apache/spark/actions?query=is%3Aqueued
If they commit too fast, they might do better if they use "scheduled"
master build rather than "on commit" (or they could use cancel-workflow
action)

Jarek>One workflow can be 20/30 times more important than another.

:((
Just to double-check: are you sure you show "in progress" rather than
"queued" workflows?


In any case, thanks for the charts! You've done a majestic web-scraping :)

Vladimir

Re: GA again unreasonably slow (again)

Posted by Jarek Potiuk <Ja...@polidea.com>.
Thanks, for that!

I think this is not a "permanent" solution and the data is a bit flawed :( .

I do not think it's the fault in Pulsar/Spark per se. I think it is very
hard to request from them to do any limits, even if we do it now this might
again go ballistic tomorrow. And I think it's very unreasonable to request
any project to decrease their load if they even do not have the tools to
verify that.

But let's see, maybe it will work !

There is one problem with the charts, They are flawed. They show
'workflows' not 'jobs' and one workflow might mean many jobs :(. For
example the big number of workflows you can see  in Airflow yesterday come
from "Label when reviewed" workflows - each of which has 1 job that
takes 10 seconds or so. One workflow  can be 20/30 times more important
than another.

We cannot easily drill down to jobs, because we are using Github API to get
the information, but there are limits (max num requests/hr) and we are
already close to hitting it with the current setup.

Going to the jobs level would mean 20x more API requests. This is the 2nd
thing where INFRA <> GitHub relation I believe there was the option that
GitHub provides some better and more reliable stats to analyse.

J.

On Fri, Jan 8, 2021 at 8:51 PM Vladimir Sitnikov <
sitnikov.vladimir@gmail.com> wrote:

> Jarek>workflows in/progress/queued per project and they clearly show the
> Jarek> situation is getting worse by day
>
> The chart suggests that Pulsar, Spark and Airflow are the top contributors
> to the queue.
> I filed issues to Pulsar ( https://github.com/apache/pulsar/issues/9154 )
> and Spark ( https://issues.apache.org/jira/browse/SPARK-34053 )
> Hope they can do something to reduce the build time and the number of
> queued jobs.
>
> Vladimir
>


-- 

Jarek Potiuk
Polidea <https://www.polidea.com/> | Principal Software Engineer

M: +48 660 796 129 <+48660796129>
[image: Polidea] <https://www.polidea.com/>

Re: GA again unreasonably slow (again)

Posted by Vladimir Sitnikov <si...@gmail.com>.
Jarek>workflows in/progress/queued per project and they clearly show the
Jarek> situation is getting worse by day

The chart suggests that Pulsar, Spark and Airflow are the top contributors
to the queue.
I filed issues to Pulsar ( https://github.com/apache/pulsar/issues/9154 )
and Spark ( https://issues.apache.org/jira/browse/SPARK-34053 )
Hope they can do something to reduce the build time and the number of
queued jobs.

Vladimir