You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@pulsar.apache.org by Lari Hotari <lh...@apache.org> on 2022/08/26 12:00:20 UTC

Pulsar CI congested, master branch build broken

Hi,

GitHub Actions builds have been piling up in the build queue in the last few days.
I posted on builds@apache.org https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633 about this issue.
There's also a thread on the-asf slack, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . 

It seems that our build queue is finally getting picked up, but it would be great to see if we hit quota and whether that is the cause of pauses. 

Another issue is that the master branch broke after merging 2 conflicting PRs. 
The fix is in https://github.com/apache/pulsar/pull/17300 . 

Merging PRs will be slow until we have these 2 problems solved and existing PRs rebased over the changes. Let's prioritize merging #17300 before pushing more changes.

I'd like to point out that a good way to get build feedback before sending a PR, is to run builds on your personal GitHub Actions CI. The benefit of this is that it doesn't consume the shared quota and builds usually start instantly.
There are instructions in the contributors guide about this. 
https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
You simply open PRs to your own fork of apache/pulsar to run builds on your personal GitHub Actions CI.

BR,

Lari

Re: Pulsar CI congested, master branch build broken

Posted by Max Xu <ma...@gmail.com>.

Hi, Lari

Thanks for bringing this to our attention!

I was wondering if we could consider using the self-hosted runner? As there
are currently more than 2k projects in one apache org.


Best,
Max Xu


On Fri, Aug 26, 2022 at 8:00 PM Lari Hotari <lh...@apache.org> wrote:

> Hi,
>
> GitHub Actions builds have been piling up in the build queue in the last
> few days.
> I posted on builds@apache.org
> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and
> created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633
> about this issue.
> There's also a thread on the-asf slack,
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
>
> It seems that our build queue is finally getting picked up, but it would
> be great to see if we hit quota and whether that is the cause of pauses.
>
> Another issue is that the master branch broke after merging 2 conflicting
> PRs.
> The fix is in https://github.com/apache/pulsar/pull/17300 .
>
> Merging PRs will be slow until we have these 2 problems solved and
> existing PRs rebased over the changes. Let's prioritize merging #17300
> before pushing more changes.
>
> I'd like to point out that a good way to get build feedback before sending
> a PR, is to run builds on your personal GitHub Actions CI. The benefit of
> this is that it doesn't consume the shared quota and builds usually start
> instantly.
> There are instructions in the contributors guide about this.
> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> You simply open PRs to your own fork of apache/pulsar to run builds on
> your personal GitHub Actions CI.
>
> BR,
>
> Lari
>
>
>
>
>
>
>
>

Re: Pulsar CI congested, master branch build broken

Posted by Max Xu <ma...@gmail.com>.

And +1 vote for "Provide information about GitHub Actions usage for apache
organization", which would be greatly helpful by making these information
transparency.

Best,
Max Xu


On Fri, Aug 26, 2022 at 8:00 PM Lari Hotari <lh...@apache.org> wrote:

> Hi,
>
> GitHub Actions builds have been piling up in the build queue in the last
> few days.
> I posted on builds@apache.org
> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and
> created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633
> about this issue.
> There's also a thread on the-asf slack,
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
>
> It seems that our build queue is finally getting picked up, but it would
> be great to see if we hit quota and whether that is the cause of pauses.
>
> Another issue is that the master branch broke after merging 2 conflicting
> PRs.
> The fix is in https://github.com/apache/pulsar/pull/17300 .
>
> Merging PRs will be slow until we have these 2 problems solved and
> existing PRs rebased over the changes. Let's prioritize merging #17300
> before pushing more changes.
>
> I'd like to point out that a good way to get build feedback before sending
> a PR, is to run builds on your personal GitHub Actions CI. The benefit of
> this is that it doesn't consume the shared quota and builds usually start
> instantly.
> There are instructions in the contributors guide about this.
> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> You simply open PRs to your own fork of apache/pulsar to run builds on
> your personal GitHub Actions CI.
>
> BR,
>
> Lari
>
>
>
>
>
>
>
>

Re: Pulsar CI congested, master branch build broken

Posted by Nicolò Boschi <bo...@gmail.com>.

As you may have noticed, the CI is slow again.
There are more than 140 workflows pending:
https://github.com/apache/pulsar/actions?query=is%3Aqueued
There are only 2-3 workflows in progress:
https://github.com/apache/pulsar/actions?query=is%3Ain_progress

Lari and I believe that we're still penalized by the algorithm for resource
allocation across ASF projects.
We're still working on other optimizations to reduce the requested runners
and to reduce resource usage,
but if the algorithm still allows Pulsar to use at most 2-3 runners at the
same time, we'll never get out of this situation.

We're waiting for updates from the Github support issue that Lari reported.

Nicolò Boschi


Il giorno ven 9 set 2022 alle ore 04:09 Michael Marshall <
mmarshall@apache.org> ha scritto:

> Fantastic, thank you Lari and Nicolò!
>
> - Michael
>
> On Thu, Sep 8, 2022 at 9:03 PM Haiting Jiang <ji...@gmail.com>
> wrote:
> >
> > Great work. Thank you, Lari and Nicolò.
> >
> > BR,
> > Haiting
> >
> > On Fri, Sep 9, 2022 at 9:36 AM tison <wa...@gmail.com> wrote:
> > >
> > > Thank you, Lari and Nicolò!
> > > Best,
> > > tison.
> > >
> > >
> > > Nicolò Boschi <bo...@gmail.com> 于2022年9月9日周五 02:41写道：
> > >
> > > > Dear community,
> > > >
> > > > The plan has been executed.
> > > > The summary of our actions is:
> > > > 1. We cancelled all pending jobs (queue and in-progress)
> > > > 2. We removed the required checks to be able to merge improvements
> on the
> > > > CI workflow
> > > > 3. We merged a couple of improvements:
> > > >    1. workarounded the possible bug triggered by jobs retries. Now
> > > > broker flaky tests are in a dedicated workflow
> > > >    2. moved known flaky tests to the flaky suite
> > > >    3. optimized the runner consumption for docs-only and cpp-only
> pulls
> > > > 4. We reactivated the required checks.
> > > >
> > > >
> > > > Now it's possible to come back to normal life.
> > > > 1. You must rebase your branch to the latest master (there's a
> button for
> > > > you in the UI) or eventually you can close/reopen the pull to
> trigger the
> > > > checks
> > > > 2. You can merge a pull request if you want
> > > > 3. You will find a new job in the Checks section called "Pulsar CI /
> Pulsar
> > > > CI checks completed" that indicates the Pulsar CI successfully passed
> > > >
> > > > There's a slight chance that the CI will be stuck again in the next
> few
> > > > days but we will take it monitored.
> > > >
> > > > Thanks Lari for the nice work!
> > > >
> > > > Regards,
> > > > Nicolò Boschi
> > > >
> > > >
> > > > Il giorno gio 8 set 2022 alle ore 10:55 Lari Hotari <
> lhotari@apache.org>
> > > > ha
> > > > scritto:
> > > >
> > > > > Thank you Nicolo.
> > > > > There's lazy consensus, let's go forward with the action plan.
> > > > >
> > > > > -Lari
> > > > >
> > > > > On 2022/09/08 08:16:05 Nicolò Boschi wrote:
> > > > > > This is the pull for step 2.
> > > > https://github.com/apache/pulsar/pull/17539
> > > > > >
> > > > > > This is the script I'm going to use to cancel pending workflows.
> > > > > >
> > > > >
> > > >
> https://github.com/nicoloboschi/pulsar-validation-tool/blob/master/pulsar-scripts/pulsar-gha/cancel-workflows.js
> > > > > >
> > > > > > I'm going to run the script in minutes.
> > > > > >
> > > > > > I advertised on Slack about what is happening:
> > > > > >
> > > > >
> > > >
> https://apache-pulsar.slack.com/archives/C5ZSVEN4E/p1662624668695339?thread_ts=1662463042.016709&cid=C5ZSVEN4E
> > > > > >
> > > > > > >we’re going to execute the plan described in the ML. So any
> queued
> > > > > actions
> > > > > > will be cancelled. In order to validate your pull it is
> suggested to
> > > > run
> > > > > > the actions in your own Pulsar fork. Please don’t re-run failed
> jobs or
> > > > > > push any other commits to avoid triggering new actions
> > > > > >
> > > > > >
> > > > > > Nicolò Boschi
> > > > > >
> > > > > >
> > > > > > Il giorno gio 8 set 2022 alle ore 09:42 Nicolò Boschi <
> > > > > boschi1997@gmail.com>
> > > > > > ha scritto:
> > > > > >
> > > > > > > Thanks Lari for the detailed explanation. This is kind of an
> > > > emergency
> > > > > > > situation and I believe your plan is the way to go now.
> > > > > > >
> > > > > > > I already prepared a pull for moving the flaky suite out of the
> > > > Pulsar
> > > > > CI
> > > > > > > workflow: https://github.com/nicoloboschi/pulsar/pull/8
> > > > > > > I can take care of the execution of the plan.
> > > > > > >
> > > > > > > > 1. Cancel all existing builds in_progress or queued
> > > > > > >
> > > > > > > I have a script locally that uses GHA to check and cancel the
> pending
> > > > > > > runs. We can extend it to all the queued builds (will share it
> soon).
> > > > > > >
> > > > > > > > 2. Edit .asf.yaml and drop the "required checks" requirement
> for
> > > > > merging
> > > > > > > PRs.
> > > > > > > > 3. Wait for build to run for .asf.yaml change, merge it
> > > > > > >
> > > > > > > After the pull is out, we'll need to cancel all other
> workflows that
> > > > > > > contributors may inadvertently have triggered.
> > > > > > >
> > > > > > > > 4. Disable all workflows
> > > > > > > > 5. Process specific PRs manually to improve the situation.
> > > > > > > >    - Make GHA workflow improvements such as
> > > > > > > https://github.com/apache/pulsar/pull/17491 and
> > > > > > > https://github.com/apache/pulsar/pull/17490
> > > > > > > >    - Quarantine all very flaky tests so that everyone
> doesn't waste
> > > > > time
> > > > > > > with those. It should be possible to merge a PR even when a
> > > > quarantined
> > > > > > > test fails.
> > > > > > >
> > > > > > > in this step we will merge this
> > > > > > > https://github.com/nicoloboschi/pulsar/pull/8
> > > > > > >
> > > > > > > I want to add to the list this improvement to reduce runners
> usage in
> > > > > case
> > > > > > > of doc or cpp changes.
> > > > > > > https://github.com/nicoloboschi/pulsar/pull/7
> > > > > > >
> > > > > > >
> > > > > > > > 6. Rebase PRs (or close and re-open) that would be processed
> next
> > > > so
> > > > > > > that changes are picked up
> > > > > > >
> > > > > > > It's better to leave this task to the author of the pull in
> order to
> > > > > not
> > > > > > > create too much load at the same time
> > > > > > >
> > > > > > > > 7. Enable workflows
> > > > > > > > 8. Start processing PRs with checks to see if things are
> handled
> > > > in a
> > > > > > > better way.
> > > > > > > > 9. When things are stable, enable required checks again in
> > > > > .asf.yaml, in
> > > > > > > the meantime be careful about merging PRs
> > > > > > > > 10. Fix quarantined flaky tests
> > > > > > >
> > > > > > >
> > > > > > > Nicolò Boschi
> > > > > > >
> > > > > > >
> > > > > > > Il giorno gio 8 set 2022 alle ore 09:27 Lari Hotari <
> > > > > lhotari@apache.org>
> > > > > > > ha scritto:
> > > > > > >
> > > > > > >> If my assumption of the GitHub usage metrics bug in the GitHub
> > > > Actions
> > > > > > >> build job queue fairness algorithm is correct, what would
> help is
> > > > > running
> > > > > > >> the flaky unit test group outside of Pulsar CI workflow. In
> that
> > > > > case, the
> > > > > > >> impact of the usage metrics would be limited.
> > > > > > >>
> > > > > > >> The example of
> > > > > > >>
> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > > shows
> > > > > > >> this flaw as explained in the previous email. The total
> reported
> > > > > execution
> > > > > > >> time in that report is 1d 1h 40m 21s of usage and the actual
> usage
> > > > is
> > > > > about
> > > > > > >> 1/3 of this.
> > > > > > >>
> > > > > > >> When we move the most commonly failing job out of Pulsar CI
> > > > workflow,
> > > > > the
> > > > > > >> impact of the possible usage metrics bug would be much less.
> I hope
> > > > > GitHub
> > > > > > >> support responds to my issue and queries about this bug. It
> might
> > > > > take up
> > > > > > >> to 7 days to get a reply and for technical questions more
> time. In
> > > > the
> > > > > > >> meantime we need a solution for getting over this CI slowness
> issue.
> > > > > > >>
> > > > > > >> -Lari
> > > > > > >>
> > > > > > >>
> > > > > > >>
> > > > > > >> On 2022/09/08 06:34:42 Lari Hotari wrote:
> > > > > > >> > My current assumption of the CI slowness problem is that
> the usage
> > > > > > >> metrics for Apache Pulsar builds on GitHub side is done
> incorrectly
> > > > > and
> > > > > > >> that is resulting in apache/pulsar builds getting throttled.
> This
> > > > > > >> assumption might be wrong, but it's the best guess at the
> moment.
> > > > > > >> >
> > > > > > >> > The facts that support this assumption is that when
> re-running
> > > > > failed
> > > > > > >> jobs in a workflow, the execution times for previously
> successful
> > > > > jobs get
> > > > > > >> counted as if they have all run:
> > > > > > >> > Here's an example:
> > > > > > >>
> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > > > > >> > The reported total usage is about 3x than the actual usage.
> > > > > > >> >
> > > > > > >> > The assumption that I have is that the "fairness algorithm"
> that
> > > > > GitHub
> > > > > > >> uses to provide all Apache projects about the same amount of
> GitHub
> > > > > Actions
> > > > > > >> resources would take this flawed usage as the basis of it's
> > > > decisions
> > > > > and
> > > > > > >> it decides to throttle apache/pulsar builds.
> > > > > > >> >
> > > > > > >> > The reason why we are getting hit by this now is that there
> is a
> > > > > high
> > > > > > >> number of flaky test failures that cause almost every build
> to fail
> > > > > and we
> > > > > > >> have been re-running a lot of builds.
> > > > > > >> >
> > > > > > >> > The other fact to support the theory of flawed usage
> metrics used
> > > > in
> > > > > > >> the fairness algorithm is that other Apache projects aren't
> > > > reporting
> > > > > > >> issues about GitHub Actions slowness. This is mentioned in
> Jarek
> > > > > Potiuk's
> > > > > > >> comments on INFRA-23633 [1]:
> > > > > > >> > > Unlike the case 2 years ago, the problem is not affecting
> all
> > > > > > >> projects. In Apache Airflow we do > not see any particular
> slow-down
> > > > > with
> > > > > > >> Public Runners at this moment (just checked - >
> > > > > > >> > > everything is "as usual").. So I'd say it is something
> specific
> > > > to
> > > > > > >> Pulsar not to "ASF" as a whole.
> > > > > > >> >
> > > > > > >> > There are also other comments from Jarek about the GitHub
> > > > "fairness
> > > > > > >> algorithm" (comment [2], other comment [3])
> > > > > > >> > > But I believe the current problem is different - it might
> be
> > > > > (looking
> > > > > > >> at your jobs) simply a bug
> > > > > > >> > > in GA that you hit or indeed your demands are simply too
> high.
> > > > > > >> >
> > > > > > >> > I have opened tickets (2 tickets: 2 days ago and yesterday)
> to
> > > > > > >> support.github.com and there hasn't been any response to the
> > > > ticket.
> > > > > It
> > > > > > >> might take up to 7 days to get a response. We cannot rely on
> GitHub
> > > > > Support
> > > > > > >> resolving this issue.
> > > > > > >> >
> > > > > > >> > I propose that we go ahead with the previously suggested
> action
> > > > plan
> > > > > > >> > > One possible way forward:
> > > > > > >> > > 1. Cancel all existing builds in_progress or queued
> > > > > > >> > > 2. Edit .asf.yaml and drop the "required checks"
> requirement for
> > > > > > >> merging PRs.
> > > > > > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > > > > > >> > > 4. Disable all workflows
> > > > > > >> > > 5. Process specific PRs manually to improve the situation.
> > > > > > >> > >    - Make GHA workflow improvements such as
> > > > > > >> https://github.com/apache/pulsar/pull/17491 and
> > > > > > >> https://github.com/apache/pulsar/pull/17490
> > > > > > >> > >    - Quarantine all very flaky tests so that everyone
> doesn't
> > > > > waste
> > > > > > >> time with those. It should be possible to merge a PR even
> when a
> > > > > > >> quarantined test fails.
> > > > > > >> > > 6. Rebase PRs (or close and re-open) that would be
> processed
> > > > next
> > > > > so
> > > > > > >> that changes are picked up
> > > > > > >> > > 7. Enable workflows
> > > > > > >> > > 8. Start processing PRs with checks to see if things are
> handled
> > > > > in a
> > > > > > >> better way.
> > > > > > >> > > 9. When things are stable, enable required checks again in
> > > > > .asf.yaml,
> > > > > > >> in the meantime be careful about merging PRs
> > > > > > >> > > 10. Fix quarantined flaky tests
> > > > > > >> >
> > > > > > >> > To clarify, steps 1-6 would be done optimally in 1 day and
> we
> > > > would
> > > > > > >> stop processing ordinary PRs during this time. We would only
> handle
> > > > > PRs
> > > > > > >> that fix the CI situation during this exceptional period.
> > > > > > >> >
> > > > > > >> > -Lari
> > > > > > >> >
> > > > > > >> > Links to Jarek's comments:
> > > > > > >> > [1]
> > > > > > >>
> > > > >
> > > >
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749
> > > > > > >> > [2]
> > > > > > >>
> > > > >
> > > >
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > > > > > >> > [3]
> > > > > > >>
> > > > >
> > > >
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > > > > > >> >
> > > > > > >> > On 2022/09/07 17:01:43 Lari Hotari wrote:
> > > > > > >> > > One possible way forward:
> > > > > > >> > > 1. Cancel all existing builds in_progress or queued
> > > > > > >> > > 2. Edit .asf.yaml and drop the "required checks"
> requirement for
> > > > > > >> merging PRs.
> > > > > > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > > > > > >> > > 4. Disable all workflows
> > > > > > >> > > 5. Process specific PRs manually to improve the situation.
> > > > > > >> > >    - Make GHA workflow improvements such as
> > > > > > >> https://github.com/apache/pulsar/pull/17491 and
> > > > > > >> https://github.com/apache/pulsar/pull/17490
> > > > > > >> > >    - Quarantine all very flaky tests so that everyone
> doesn't
> > > > > waste
> > > > > > >> time with those. It should be possible to merge a PR even
> when a
> > > > > > >> quarantined test fails.
> > > > > > >> > > 6. Rebase PRs (or close and re-open) that would be
> processed
> > > > next
> > > > > so
> > > > > > >> that changes are picked up
> > > > > > >> > > 7. Enable workflows
> > > > > > >> > > 8. Start processing PRs with checks to see if things are
> handled
> > > > > in a
> > > > > > >> better way.
> > > > > > >> > > 9. When things are stable, enable required checks again in
> > > > > .asf.yaml,
> > > > > > >> in the meantime be careful about merging PRs
> > > > > > >> > > 10. Fix quarantined flaky tests
> > > > > > >> > >
> > > > > > >> > > -Lari
> > > > > > >> > >
> > > > > > >> > > On 2022/09/07 16:47:09 Lari Hotari wrote:
> > > > > > >> > > > The problem with CI is becoming worse. The build queue
> is 235
> > > > > jobs
> > > > > > >> now and the queue time is over 7 hours.
> > > > > > >> > > >
> > > > > > >> > > > We will need to start shedding load in the build queue
> and get
> > > > > some
> > > > > > >> fixes in.
> > > > > > >> > > > https://issues.apache.org/jira/browse/INFRA-23633
> continues
> > > > to
> > > > > > >> contain details about some activities. I have created 2 GitHub
> > > > Support
> > > > > > >> tickets, but usually it takes up to a week to get a response.
> > > > > > >> > > >
> > > > > > >> > > > I have some assumptions about the issue, but they are
> just
> > > > > > >> assumptions.
> > > > > > >> > > > One oddity is that when re-running failed jobs is used
> in a
> > > > > large
> > > > > > >> workflow, the execution times for previously successful jobs
> get
> > > > > counted as
> > > > > > >> if they have run.
> > > > > > >> > > > Here's an example:
> > > > > > >>
> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > > > > >> > > > The reported usage is about 3x than the actual usage.
> > > > > > >> > > > The assumption that I have is that the "fairness
> algorithm"
> > > > that
> > > > > > >> GitHub uses to provide all Apache projects about the same
> amount of
> > > > > GitHub
> > > > > > >> Actions resources would take this flawed usage as the basis
> of it's
> > > > > > >> decisions.
> > > > > > >> > > > The reason why we are getting hit by this now is that
> there
> > > > is a
> > > > > > >> high number of flaky test failures that cause almost every
> build to
> > > > > fail
> > > > > > >> and we are re-running a lot of builds.
> > > > > > >> > > >
> > > > > > >> > > > Another problem there is that the GitHub Actions search
> > > > doesn't
> > > > > > >> always show all workflow runs that are running. This has
> happened
> > > > > before
> > > > > > >> when the GitHub Actions workflow search index was corrupted.
> GitHub
> > > > > Support
> > > > > > >> resolved that by rebuilding the search index with some manual
> admin
> > > > > > >> operation behind the scenes.
> > > > > > >> > > >
> > > > > > >> > > > I'm proposing that we start shedding load from CI by
> > > > cancelling
> > > > > > >> build jobs and selecting which jobs to process so that we get
> the CI
> > > > > issue
> > > > > > >> resolved. We might also have to disable required checks so
> that we
> > > > > have
> > > > > > >> some way to get changes merged while CI doesn't work properly.
> > > > > > >> > > >
> > > > > > >> > > > I'm expecting lazy consensus on fixing CI unless someone
> > > > > proposes a
> > > > > > >> better plan. Let's keep everyone informed in this mailing list
> > > > thread.
> > > > > > >> > > >
> > > > > > >> > > > -Lari
> > > > > > >> > > >
> > > > > > >> > > >
> > > > > > >> > > > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > > > > > >> > > > > We are going to need to take actions to fix our
> problems.
> > > > See
> > > > > > >>
> > > > >
> > > >
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > > > > > >> > > > >
> > > > > > >> > > > > Jarek has done a large amount of GitHub Action work
> with
> > > > > Apache
> > > > > > >> Airflow and his suggestions might be helpful. One of his
> suggestions
> > > > > was
> > > > > > >> Apache Yetus. I think he means using the Maven plugins -
> > > > > > >>
> https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <
> > > > lhotari@apache.org
> > > > > >
> > > > > > >> wrote:
> > > > > > >> > > > > >
> > > > > > >> > > > > > The Apache Infra ticket is
> > > > > > >> https://issues.apache.org/jira/browse/INFRA-23633 .
> > > > > > >> > > > > >
> > > > > > >> > > > > > -Lari
> > > > > > >> > > > > >
> > > > > > >> > > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > > > > > >> > > > > >> I asked for an update on the Apache org GitHub
> Actions
> > > > > usage
> > > > > > >> stats from Gavin McDonald on the-asf slack in this thread:
> > > > > > >>
> > > > >
> > > >
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> > > > > > >> .
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> I hope we get this issue resolved since it delays
> PR
> > > > > > >> processing a lot.
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> -Lari
> > > > > > >> > > > > >>
> > > > > > >> > > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > > > > > >> > > > > >>> Pulsar CI continues to be congested, and the
> build queue
> > > > > [1]
> > > > > > >> is very long at the moment. There are 147 build jobs in the
> queue
> > > > and
> > > > > 16
> > > > > > >> jobs in progress at the moment.
> > > > > > >> > > > > >>>
> > > > > > >> > > > > >>> I would strongly advice everyone to use "personal
> CI" to
> > > > > > >> mitigate the issue of the long delay of CI feedback. You can
> simply
> > > > > open a
> > > > > > >> PR to your own personal fork of apache/pulsar to run the
> builds in
> > > > > your
> > > > > > >> "personal CI". There's more details in the previous emails in
> this
> > > > > thread.
> > > > > > >> > > > > >>>
> > > > > > >> > > > > >>> -Lari
> > > > > > >> > > > > >>>
> > > > > > >> > > > > >>> [1] - build queue:
> > > > > > >> https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > > > > > >> > > > > >>>
> > > > > > >> > > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > > > > > >> > > > > >>>> Pulsar CI continues to be congested, and the
> build
> > > > queue
> > > > > is
> > > > > > >> long.
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> I would strongly advice everyone to use
> "personal CI"
> > > > to
> > > > > > >> mitigate the issue of the long delay of CI feedback. You can
> simply
> > > > > open a
> > > > > > >> PR to your own personal fork of apache/pulsar to run the
> builds in
> > > > > your
> > > > > > >> "personal CI". There's more details in the previous email in
> this
> > > > > thread.
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> Some updates:
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> There has been a discussion with Gavin McDonald
> from
> > > > ASF
> > > > > > >> infra on the-asf slack about getting usage reports from
> GitHub to
> > > > > support
> > > > > > >> the investigation. Slack thread is the same one mentioned in
> the
> > > > > previous
> > > > > > >> email,
> > > > https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279
> > > > > .
> > > > > > >> Gavin already requested the usage report in GitHub UI, but it
> > > > produced
> > > > > > >> invalid results.
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> I made a change to mitigate a source of
> additional
> > > > GitHub
> > > > > > >> Actions overhead.
> > > > > > >> > > > > >>>> In the past, each cherry-picked commit to a
> maintenance
> > > > > > >> branch of Pulsar has triggered a lot of workflow runs.
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> The solution for cancelling duplicate builds
> > > > > automatically
> > > > > > >> is to add this definition to the workflow definition:
> > > > > > >> > > > > >>>> concurrency:
> > > > > > >> > > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > > > > > >> > > > > >>>>  cancel-in-progress: true
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> I added this to all maintenance branch GitHub
> Actions
> > > > > > >> workflows:
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> branch-2.10 change:
> > > > > > >> > > > > >>>>
> > > > > > >>
> > > > >
> > > >
> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > > > > > >> > > > > >>>> branch-2.9 change:
> > > > > > >> > > > > >>>>
> > > > > > >>
> > > > >
> > > >
> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > > > > > >> > > > > >>>> branch-2.8 change:
> > > > > > >> > > > > >>>>
> > > > > > >>
> > > > >
> > > >
> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > > > > > >> > > > > >>>> branch-2.7:
> > > > > > >> > > > > >>>>
> > > > > > >>
> > > > >
> > > >
> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> branch-2.11 already contains the necessary
> config for
> > > > > > >> cancelling duplicate builds.
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> The benefit of the above change is that when
> multiple
> > > > > > >> commits are cherry-picked to a branch at once, only the build
> of the
> > > > > last
> > > > > > >> commit will get run eventually. The builds for the
> intermediate
> > > > > commits
> > > > > > >> will get cancelled. Obviously there's a tradeoff here that we
> don't
> > > > > get the
> > > > > > >> information if one of the earlier commits breaks the build.
> It's the
> > > > > cost
> > > > > > >> that we need to pay. Nevertheless our build is so flaky that
> it's
> > > > > hard to
> > > > > > >> determine whether a failed build result is only caused by bad
> flaky
> > > > > test or
> > > > > > >> whether it's an actual failure. Because of this we don't lose
> > > > > anything by
> > > > > > >> cancelling builds. It's more important to save build
> resources. In
> > > > the
> > > > > > >> maintenance branches for 2.10 and older, the average total
> build
> > > > time
> > > > > > >> consumed is around 20 hours which is a lot.
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> At this time, the overhead of maintenance branch
> builds
> > > > > > >> doesn't seem to be the source of the problems. There must be
> some
> > > > > other
> > > > > > >> issue which is possibly related to exceeding a usage quota.
> > > > Hopefully
> > > > > we
> > > > > > >> get the CI slowness issue solved asap.
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> BR,
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> Lari
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > > > > >> > > > > >>>>> Hi,
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>> GitHub Actions builds have been piling up in
> the build
> > > > > > >> queue in the last few days.
> > > > > > >> > > > > >>>>> I posted on builds@apache.org
> > > > > > >>
> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s
> > > > and
> > > > > > >> created INFRA ticket
> > > > > https://issues.apache.org/jira/browse/INFRA-23633
> > > > > > >> about this issue.
> > > > > > >> > > > > >>>>> There's also a thread on the-asf slack,
> > > > > > >>
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>> It seems that our build queue is finally getting
> > > > picked
> > > > > up,
> > > > > > >> but it would be great to see if we hit quota and whether that
> is the
> > > > > cause
> > > > > > >> of pauses.
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>> Another issue is that the master branch broke
> after
> > > > > merging
> > > > > > >> 2 conflicting PRs.
> > > > > > >> > > > > >>>>> The fix is in
> > > > > https://github.com/apache/pulsar/pull/17300
> > > > > > >> .
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>> Merging PRs will be slow until we have these 2
> > > > problems
> > > > > > >> solved and existing PRs rebased over the changes. Let's
> prioritize
> > > > > merging
> > > > > > >> #17300 before pushing more changes.
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>> I'd like to point out that a good way to get
> build
> > > > > feedback
> > > > > > >> before sending a PR, is to run builds on your personal GitHub
> > > > Actions
> > > > > CI.
> > > > > > >> The benefit of this is that it doesn't consume the shared
> quota and
> > > > > builds
> > > > > > >> usually start instantly.
> > > > > > >> > > > > >>>>> There are instructions in the contributors
> guide about
> > > > > > >> this.
> > > > > > >> > > > > >>>>>
> > > > > > >>
> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > > > > >> > > > > >>>>> You simply open PRs to your own fork of
> apache/pulsar
> > > > to
> > > > > > >> run builds on your personal GitHub Actions CI.
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>> BR,
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>> Lari
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>>
> > > > > > >> > > > > >>>>
> > > > > > >> > > > > >>>
> > > > > > >> > > > > >>
> > > > > > >> > > > >
> > > > > > >> > > > >
> > > > > > >> > > >
> > > > > > >> > >
> > > > > > >> >
> > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > >
>

Re: Pulsar CI congested, master branch build broken

Posted by Michael Marshall <mm...@apache.org>.

Fantastic, thank you Lari and Nicolò!

- Michael

On Thu, Sep 8, 2022 at 9:03 PM Haiting Jiang <ji...@gmail.com> wrote:
>
> Great work. Thank you, Lari and Nicolò.
>
> BR,
> Haiting
>
> On Fri, Sep 9, 2022 at 9:36 AM tison <wa...@gmail.com> wrote:
> >
> > Thank you, Lari and Nicolò!
> > Best,
> > tison.
> >
> >
> > Nicolò Boschi <bo...@gmail.com> 于2022年9月9日周五 02:41写道：
> >
> > > Dear community,
> > >
> > > The plan has been executed.
> > > The summary of our actions is:
> > > 1. We cancelled all pending jobs (queue and in-progress)
> > > 2. We removed the required checks to be able to merge improvements on the
> > > CI workflow
> > > 3. We merged a couple of improvements:
> > >    1. workarounded the possible bug triggered by jobs retries. Now
> > > broker flaky tests are in a dedicated workflow
> > >    2. moved known flaky tests to the flaky suite
> > >    3. optimized the runner consumption for docs-only and cpp-only pulls
> > > 4. We reactivated the required checks.
> > >
> > >
> > > Now it's possible to come back to normal life.
> > > 1. You must rebase your branch to the latest master (there's a button for
> > > you in the UI) or eventually you can close/reopen the pull to trigger the
> > > checks
> > > 2. You can merge a pull request if you want
> > > 3. You will find a new job in the Checks section called "Pulsar CI / Pulsar
> > > CI checks completed" that indicates the Pulsar CI successfully passed
> > >
> > > There's a slight chance that the CI will be stuck again in the next few
> > > days but we will take it monitored.
> > >
> > > Thanks Lari for the nice work!
> > >
> > > Regards,
> > > Nicolò Boschi
> > >
> > >
> > > Il giorno gio 8 set 2022 alle ore 10:55 Lari Hotari <lh...@apache.org>
> > > ha
> > > scritto:
> > >
> > > > Thank you Nicolo.
> > > > There's lazy consensus, let's go forward with the action plan.
> > > >
> > > > -Lari
> > > >
> > > > On 2022/09/08 08:16:05 Nicolò Boschi wrote:
> > > > > This is the pull for step 2.
> > > https://github.com/apache/pulsar/pull/17539
> > > > >
> > > > > This is the script I'm going to use to cancel pending workflows.
> > > > >
> > > >
> > > https://github.com/nicoloboschi/pulsar-validation-tool/blob/master/pulsar-scripts/pulsar-gha/cancel-workflows.js
> > > > >
> > > > > I'm going to run the script in minutes.
> > > > >
> > > > > I advertised on Slack about what is happening:
> > > > >
> > > >
> > > https://apache-pulsar.slack.com/archives/C5ZSVEN4E/p1662624668695339?thread_ts=1662463042.016709&cid=C5ZSVEN4E
> > > > >
> > > > > >we’re going to execute the plan described in the ML. So any queued
> > > > actions
> > > > > will be cancelled. In order to validate your pull it is suggested to
> > > run
> > > > > the actions in your own Pulsar fork. Please don’t re-run failed jobs or
> > > > > push any other commits to avoid triggering new actions
> > > > >
> > > > >
> > > > > Nicolò Boschi
> > > > >
> > > > >
> > > > > Il giorno gio 8 set 2022 alle ore 09:42 Nicolò Boschi <
> > > > boschi1997@gmail.com>
> > > > > ha scritto:
> > > > >
> > > > > > Thanks Lari for the detailed explanation. This is kind of an
> > > emergency
> > > > > > situation and I believe your plan is the way to go now.
> > > > > >
> > > > > > I already prepared a pull for moving the flaky suite out of the
> > > Pulsar
> > > > CI
> > > > > > workflow: https://github.com/nicoloboschi/pulsar/pull/8
> > > > > > I can take care of the execution of the plan.
> > > > > >
> > > > > > > 1. Cancel all existing builds in_progress or queued
> > > > > >
> > > > > > I have a script locally that uses GHA to check and cancel the pending
> > > > > > runs. We can extend it to all the queued builds (will share it soon).
> > > > > >
> > > > > > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> > > > merging
> > > > > > PRs.
> > > > > > > 3. Wait for build to run for .asf.yaml change, merge it
> > > > > >
> > > > > > After the pull is out, we'll need to cancel all other workflows that
> > > > > > contributors may inadvertently have triggered.
> > > > > >
> > > > > > > 4. Disable all workflows
> > > > > > > 5. Process specific PRs manually to improve the situation.
> > > > > > >    - Make GHA workflow improvements such as
> > > > > > https://github.com/apache/pulsar/pull/17491 and
> > > > > > https://github.com/apache/pulsar/pull/17490
> > > > > > >    - Quarantine all very flaky tests so that everyone doesn't waste
> > > > time
> > > > > > with those. It should be possible to merge a PR even when a
> > > quarantined
> > > > > > test fails.
> > > > > >
> > > > > > in this step we will merge this
> > > > > > https://github.com/nicoloboschi/pulsar/pull/8
> > > > > >
> > > > > > I want to add to the list this improvement to reduce runners usage in
> > > > case
> > > > > > of doc or cpp changes.
> > > > > > https://github.com/nicoloboschi/pulsar/pull/7
> > > > > >
> > > > > >
> > > > > > > 6. Rebase PRs (or close and re-open) that would be processed next
> > > so
> > > > > > that changes are picked up
> > > > > >
> > > > > > It's better to leave this task to the author of the pull in order to
> > > > not
> > > > > > create too much load at the same time
> > > > > >
> > > > > > > 7. Enable workflows
> > > > > > > 8. Start processing PRs with checks to see if things are handled
> > > in a
> > > > > > better way.
> > > > > > > 9. When things are stable, enable required checks again in
> > > > .asf.yaml, in
> > > > > > the meantime be careful about merging PRs
> > > > > > > 10. Fix quarantined flaky tests
> > > > > >
> > > > > >
> > > > > > Nicolò Boschi
> > > > > >
> > > > > >
> > > > > > Il giorno gio 8 set 2022 alle ore 09:27 Lari Hotari <
> > > > lhotari@apache.org>
> > > > > > ha scritto:
> > > > > >
> > > > > >> If my assumption of the GitHub usage metrics bug in the GitHub
> > > Actions
> > > > > >> build job queue fairness algorithm is correct, what would help is
> > > > running
> > > > > >> the flaky unit test group outside of Pulsar CI workflow. In that
> > > > case, the
> > > > > >> impact of the usage metrics would be limited.
> > > > > >>
> > > > > >> The example of
> > > > > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > shows
> > > > > >> this flaw as explained in the previous email. The total reported
> > > > execution
> > > > > >> time in that report is 1d 1h 40m 21s of usage and the actual usage
> > > is
> > > > about
> > > > > >> 1/3 of this.
> > > > > >>
> > > > > >> When we move the most commonly failing job out of Pulsar CI
> > > workflow,
> > > > the
> > > > > >> impact of the possible usage metrics bug would be much less. I hope
> > > > GitHub
> > > > > >> support responds to my issue and queries about this bug. It might
> > > > take up
> > > > > >> to 7 days to get a reply and for technical questions more time. In
> > > the
> > > > > >> meantime we need a solution for getting over this CI slowness issue.
> > > > > >>
> > > > > >> -Lari
> > > > > >>
> > > > > >>
> > > > > >>
> > > > > >> On 2022/09/08 06:34:42 Lari Hotari wrote:
> > > > > >> > My current assumption of the CI slowness problem is that the usage
> > > > > >> metrics for Apache Pulsar builds on GitHub side is done incorrectly
> > > > and
> > > > > >> that is resulting in apache/pulsar builds getting throttled. This
> > > > > >> assumption might be wrong, but it's the best guess at the moment.
> > > > > >> >
> > > > > >> > The facts that support this assumption is that when re-running
> > > > failed
> > > > > >> jobs in a workflow, the execution times for previously successful
> > > > jobs get
> > > > > >> counted as if they have all run:
> > > > > >> > Here's an example:
> > > > > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > > > >> > The reported total usage is about 3x than the actual usage.
> > > > > >> >
> > > > > >> > The assumption that I have is that the "fairness algorithm" that
> > > > GitHub
> > > > > >> uses to provide all Apache projects about the same amount of GitHub
> > > > Actions
> > > > > >> resources would take this flawed usage as the basis of it's
> > > decisions
> > > > and
> > > > > >> it decides to throttle apache/pulsar builds.
> > > > > >> >
> > > > > >> > The reason why we are getting hit by this now is that there is a
> > > > high
> > > > > >> number of flaky test failures that cause almost every build to fail
> > > > and we
> > > > > >> have been re-running a lot of builds.
> > > > > >> >
> > > > > >> > The other fact to support the theory of flawed usage metrics used
> > > in
> > > > > >> the fairness algorithm is that other Apache projects aren't
> > > reporting
> > > > > >> issues about GitHub Actions slowness. This is mentioned in Jarek
> > > > Potiuk's
> > > > > >> comments on INFRA-23633 [1]:
> > > > > >> > > Unlike the case 2 years ago, the problem is not affecting all
> > > > > >> projects. In Apache Airflow we do > not see any particular slow-down
> > > > with
> > > > > >> Public Runners at this moment (just checked - >
> > > > > >> > > everything is "as usual").. So I'd say it is something specific
> > > to
> > > > > >> Pulsar not to "ASF" as a whole.
> > > > > >> >
> > > > > >> > There are also other comments from Jarek about the GitHub
> > > "fairness
> > > > > >> algorithm" (comment [2], other comment [3])
> > > > > >> > > But I believe the current problem is different - it might be
> > > > (looking
> > > > > >> at your jobs) simply a bug
> > > > > >> > > in GA that you hit or indeed your demands are simply too high.
> > > > > >> >
> > > > > >> > I have opened tickets (2 tickets: 2 days ago and yesterday) to
> > > > > >> support.github.com and there hasn't been any response to the
> > > ticket.
> > > > It
> > > > > >> might take up to 7 days to get a response. We cannot rely on GitHub
> > > > Support
> > > > > >> resolving this issue.
> > > > > >> >
> > > > > >> > I propose that we go ahead with the previously suggested action
> > > plan
> > > > > >> > > One possible way forward:
> > > > > >> > > 1. Cancel all existing builds in_progress or queued
> > > > > >> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> > > > > >> merging PRs.
> > > > > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > > > > >> > > 4. Disable all workflows
> > > > > >> > > 5. Process specific PRs manually to improve the situation.
> > > > > >> > >    - Make GHA workflow improvements such as
> > > > > >> https://github.com/apache/pulsar/pull/17491 and
> > > > > >> https://github.com/apache/pulsar/pull/17490
> > > > > >> > >    - Quarantine all very flaky tests so that everyone doesn't
> > > > waste
> > > > > >> time with those. It should be possible to merge a PR even when a
> > > > > >> quarantined test fails.
> > > > > >> > > 6. Rebase PRs (or close and re-open) that would be processed
> > > next
> > > > so
> > > > > >> that changes are picked up
> > > > > >> > > 7. Enable workflows
> > > > > >> > > 8. Start processing PRs with checks to see if things are handled
> > > > in a
> > > > > >> better way.
> > > > > >> > > 9. When things are stable, enable required checks again in
> > > > .asf.yaml,
> > > > > >> in the meantime be careful about merging PRs
> > > > > >> > > 10. Fix quarantined flaky tests
> > > > > >> >
> > > > > >> > To clarify, steps 1-6 would be done optimally in 1 day and we
> > > would
> > > > > >> stop processing ordinary PRs during this time. We would only handle
> > > > PRs
> > > > > >> that fix the CI situation during this exceptional period.
> > > > > >> >
> > > > > >> > -Lari
> > > > > >> >
> > > > > >> > Links to Jarek's comments:
> > > > > >> > [1]
> > > > > >>
> > > >
> > > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749
> > > > > >> > [2]
> > > > > >>
> > > >
> > > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > > > > >> > [3]
> > > > > >>
> > > >
> > > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > > > > >> >
> > > > > >> > On 2022/09/07 17:01:43 Lari Hotari wrote:
> > > > > >> > > One possible way forward:
> > > > > >> > > 1. Cancel all existing builds in_progress or queued
> > > > > >> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> > > > > >> merging PRs.
> > > > > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > > > > >> > > 4. Disable all workflows
> > > > > >> > > 5. Process specific PRs manually to improve the situation.
> > > > > >> > >    - Make GHA workflow improvements such as
> > > > > >> https://github.com/apache/pulsar/pull/17491 and
> > > > > >> https://github.com/apache/pulsar/pull/17490
> > > > > >> > >    - Quarantine all very flaky tests so that everyone doesn't
> > > > waste
> > > > > >> time with those. It should be possible to merge a PR even when a
> > > > > >> quarantined test fails.
> > > > > >> > > 6. Rebase PRs (or close and re-open) that would be processed
> > > next
> > > > so
> > > > > >> that changes are picked up
> > > > > >> > > 7. Enable workflows
> > > > > >> > > 8. Start processing PRs with checks to see if things are handled
> > > > in a
> > > > > >> better way.
> > > > > >> > > 9. When things are stable, enable required checks again in
> > > > .asf.yaml,
> > > > > >> in the meantime be careful about merging PRs
> > > > > >> > > 10. Fix quarantined flaky tests
> > > > > >> > >
> > > > > >> > > -Lari
> > > > > >> > >
> > > > > >> > > On 2022/09/07 16:47:09 Lari Hotari wrote:
> > > > > >> > > > The problem with CI is becoming worse. The build queue is 235
> > > > jobs
> > > > > >> now and the queue time is over 7 hours.
> > > > > >> > > >
> > > > > >> > > > We will need to start shedding load in the build queue and get
> > > > some
> > > > > >> fixes in.
> > > > > >> > > > https://issues.apache.org/jira/browse/INFRA-23633 continues
> > > to
> > > > > >> contain details about some activities. I have created 2 GitHub
> > > Support
> > > > > >> tickets, but usually it takes up to a week to get a response.
> > > > > >> > > >
> > > > > >> > > > I have some assumptions about the issue, but they are just
> > > > > >> assumptions.
> > > > > >> > > > One oddity is that when re-running failed jobs is used in a
> > > > large
> > > > > >> workflow, the execution times for previously successful jobs get
> > > > counted as
> > > > > >> if they have run.
> > > > > >> > > > Here's an example:
> > > > > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > > > >> > > > The reported usage is about 3x than the actual usage.
> > > > > >> > > > The assumption that I have is that the "fairness algorithm"
> > > that
> > > > > >> GitHub uses to provide all Apache projects about the same amount of
> > > > GitHub
> > > > > >> Actions resources would take this flawed usage as the basis of it's
> > > > > >> decisions.
> > > > > >> > > > The reason why we are getting hit by this now is that there
> > > is a
> > > > > >> high number of flaky test failures that cause almost every build to
> > > > fail
> > > > > >> and we are re-running a lot of builds.
> > > > > >> > > >
> > > > > >> > > > Another problem there is that the GitHub Actions search
> > > doesn't
> > > > > >> always show all workflow runs that are running. This has happened
> > > > before
> > > > > >> when the GitHub Actions workflow search index was corrupted. GitHub
> > > > Support
> > > > > >> resolved that by rebuilding the search index with some manual admin
> > > > > >> operation behind the scenes.
> > > > > >> > > >
> > > > > >> > > > I'm proposing that we start shedding load from CI by
> > > cancelling
> > > > > >> build jobs and selecting which jobs to process so that we get the CI
> > > > issue
> > > > > >> resolved. We might also have to disable required checks so that we
> > > > have
> > > > > >> some way to get changes merged while CI doesn't work properly.
> > > > > >> > > >
> > > > > >> > > > I'm expecting lazy consensus on fixing CI unless someone
> > > > proposes a
> > > > > >> better plan. Let's keep everyone informed in this mailing list
> > > thread.
> > > > > >> > > >
> > > > > >> > > > -Lari
> > > > > >> > > >
> > > > > >> > > >
> > > > > >> > > > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > > > > >> > > > > We are going to need to take actions to fix our problems.
> > > See
> > > > > >>
> > > >
> > > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > > > > >> > > > >
> > > > > >> > > > > Jarek has done a large amount of GitHub Action work with
> > > > Apache
> > > > > >> Airflow and his suggestions might be helpful. One of his suggestions
> > > > was
> > > > > >> Apache Yetus. I think he means using the Maven plugins -
> > > > > >> https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <
> > > lhotari@apache.org
> > > > >
> > > > > >> wrote:
> > > > > >> > > > > >
> > > > > >> > > > > > The Apache Infra ticket is
> > > > > >> https://issues.apache.org/jira/browse/INFRA-23633 .
> > > > > >> > > > > >
> > > > > >> > > > > > -Lari
> > > > > >> > > > > >
> > > > > >> > > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > > > > >> > > > > >> I asked for an update on the Apache org GitHub Actions
> > > > usage
> > > > > >> stats from Gavin McDonald on the-asf slack in this thread:
> > > > > >>
> > > >
> > > https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> > > > > >> .
> > > > > >> > > > > >>
> > > > > >> > > > > >> I hope we get this issue resolved since it delays PR
> > > > > >> processing a lot.
> > > > > >> > > > > >>
> > > > > >> > > > > >> -Lari
> > > > > >> > > > > >>
> > > > > >> > > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > > > > >> > > > > >>> Pulsar CI continues to be congested, and the build queue
> > > > [1]
> > > > > >> is very long at the moment. There are 147 build jobs in the queue
> > > and
> > > > 16
> > > > > >> jobs in progress at the moment.
> > > > > >> > > > > >>>
> > > > > >> > > > > >>> I would strongly advice everyone to use "personal CI" to
> > > > > >> mitigate the issue of the long delay of CI feedback. You can simply
> > > > open a
> > > > > >> PR to your own personal fork of apache/pulsar to run the builds in
> > > > your
> > > > > >> "personal CI". There's more details in the previous emails in this
> > > > thread.
> > > > > >> > > > > >>>
> > > > > >> > > > > >>> -Lari
> > > > > >> > > > > >>>
> > > > > >> > > > > >>> [1] - build queue:
> > > > > >> https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > > > > >> > > > > >>>
> > > > > >> > > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > > > > >> > > > > >>>> Pulsar CI continues to be congested, and the build
> > > queue
> > > > is
> > > > > >> long.
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> I would strongly advice everyone to use "personal CI"
> > > to
> > > > > >> mitigate the issue of the long delay of CI feedback. You can simply
> > > > open a
> > > > > >> PR to your own personal fork of apache/pulsar to run the builds in
> > > > your
> > > > > >> "personal CI". There's more details in the previous email in this
> > > > thread.
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> Some updates:
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> There has been a discussion with Gavin McDonald from
> > > ASF
> > > > > >> infra on the-asf slack about getting usage reports from GitHub to
> > > > support
> > > > > >> the investigation. Slack thread is the same one mentioned in the
> > > > previous
> > > > > >> email,
> > > https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279
> > > > .
> > > > > >> Gavin already requested the usage report in GitHub UI, but it
> > > produced
> > > > > >> invalid results.
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> I made a change to mitigate a source of additional
> > > GitHub
> > > > > >> Actions overhead.
> > > > > >> > > > > >>>> In the past, each cherry-picked commit to a maintenance
> > > > > >> branch of Pulsar has triggered a lot of workflow runs.
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> The solution for cancelling duplicate builds
> > > > automatically
> > > > > >> is to add this definition to the workflow definition:
> > > > > >> > > > > >>>> concurrency:
> > > > > >> > > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > > > > >> > > > > >>>>  cancel-in-progress: true
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> I added this to all maintenance branch GitHub Actions
> > > > > >> workflows:
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> branch-2.10 change:
> > > > > >> > > > > >>>>
> > > > > >>
> > > >
> > > https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > > > > >> > > > > >>>> branch-2.9 change:
> > > > > >> > > > > >>>>
> > > > > >>
> > > >
> > > https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > > > > >> > > > > >>>> branch-2.8 change:
> > > > > >> > > > > >>>>
> > > > > >>
> > > >
> > > https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > > > > >> > > > > >>>> branch-2.7:
> > > > > >> > > > > >>>>
> > > > > >>
> > > >
> > > https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> branch-2.11 already contains the necessary config for
> > > > > >> cancelling duplicate builds.
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> The benefit of the above change is that when multiple
> > > > > >> commits are cherry-picked to a branch at once, only the build of the
> > > > last
> > > > > >> commit will get run eventually. The builds for the intermediate
> > > > commits
> > > > > >> will get cancelled. Obviously there's a tradeoff here that we don't
> > > > get the
> > > > > >> information if one of the earlier commits breaks the build. It's the
> > > > cost
> > > > > >> that we need to pay. Nevertheless our build is so flaky that it's
> > > > hard to
> > > > > >> determine whether a failed build result is only caused by bad flaky
> > > > test or
> > > > > >> whether it's an actual failure. Because of this we don't lose
> > > > anything by
> > > > > >> cancelling builds. It's more important to save build resources. In
> > > the
> > > > > >> maintenance branches for 2.10 and older, the average total build
> > > time
> > > > > >> consumed is around 20 hours which is a lot.
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> At this time, the overhead of maintenance branch builds
> > > > > >> doesn't seem to be the source of the problems. There must be some
> > > > other
> > > > > >> issue which is possibly related to exceeding a usage quota.
> > > Hopefully
> > > > we
> > > > > >> get the CI slowness issue solved asap.
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> BR,
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> Lari
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > > > >> > > > > >>>>> Hi,
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>> GitHub Actions builds have been piling up in the build
> > > > > >> queue in the last few days.
> > > > > >> > > > > >>>>> I posted on builds@apache.org
> > > > > >> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s
> > > and
> > > > > >> created INFRA ticket
> > > > https://issues.apache.org/jira/browse/INFRA-23633
> > > > > >> about this issue.
> > > > > >> > > > > >>>>> There's also a thread on the-asf slack,
> > > > > >> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>> It seems that our build queue is finally getting
> > > picked
> > > > up,
> > > > > >> but it would be great to see if we hit quota and whether that is the
> > > > cause
> > > > > >> of pauses.
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>> Another issue is that the master branch broke after
> > > > merging
> > > > > >> 2 conflicting PRs.
> > > > > >> > > > > >>>>> The fix is in
> > > > https://github.com/apache/pulsar/pull/17300
> > > > > >> .
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>> Merging PRs will be slow until we have these 2
> > > problems
> > > > > >> solved and existing PRs rebased over the changes. Let's prioritize
> > > > merging
> > > > > >> #17300 before pushing more changes.
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>> I'd like to point out that a good way to get build
> > > > feedback
> > > > > >> before sending a PR, is to run builds on your personal GitHub
> > > Actions
> > > > CI.
> > > > > >> The benefit of this is that it doesn't consume the shared quota and
> > > > builds
> > > > > >> usually start instantly.
> > > > > >> > > > > >>>>> There are instructions in the contributors guide about
> > > > > >> this.
> > > > > >> > > > > >>>>>
> > > > > >> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > > > >> > > > > >>>>> You simply open PRs to your own fork of apache/pulsar
> > > to
> > > > > >> run builds on your personal GitHub Actions CI.
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>> BR,
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>> Lari
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>>
> > > > > >> > > > > >>>>
> > > > > >> > > > > >>>
> > > > > >> > > > > >>
> > > > > >> > > > >
> > > > > >> > > > >
> > > > > >> > > >
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >

Re: Pulsar CI congested, master branch build broken

Posted by Haiting Jiang <ji...@gmail.com>.

Great work. Thank you, Lari and Nicolò.

BR,
Haiting

On Fri, Sep 9, 2022 at 9:36 AM tison <wa...@gmail.com> wrote:
>
> Thank you, Lari and Nicolò!
> Best,
> tison.
>
>
> Nicolò Boschi <bo...@gmail.com> 于2022年9月9日周五 02:41写道：
>
> > Dear community,
> >
> > The plan has been executed.
> > The summary of our actions is:
> > 1. We cancelled all pending jobs (queue and in-progress)
> > 2. We removed the required checks to be able to merge improvements on the
> > CI workflow
> > 3. We merged a couple of improvements:
> >    1. workarounded the possible bug triggered by jobs retries. Now
> > broker flaky tests are in a dedicated workflow
> >    2. moved known flaky tests to the flaky suite
> >    3. optimized the runner consumption for docs-only and cpp-only pulls
> > 4. We reactivated the required checks.
> >
> >
> > Now it's possible to come back to normal life.
> > 1. You must rebase your branch to the latest master (there's a button for
> > you in the UI) or eventually you can close/reopen the pull to trigger the
> > checks
> > 2. You can merge a pull request if you want
> > 3. You will find a new job in the Checks section called "Pulsar CI / Pulsar
> > CI checks completed" that indicates the Pulsar CI successfully passed
> >
> > There's a slight chance that the CI will be stuck again in the next few
> > days but we will take it monitored.
> >
> > Thanks Lari for the nice work!
> >
> > Regards,
> > Nicolò Boschi
> >
> >
> > Il giorno gio 8 set 2022 alle ore 10:55 Lari Hotari <lh...@apache.org>
> > ha
> > scritto:
> >
> > > Thank you Nicolo.
> > > There's lazy consensus, let's go forward with the action plan.
> > >
> > > -Lari
> > >
> > > On 2022/09/08 08:16:05 Nicolò Boschi wrote:
> > > > This is the pull for step 2.
> > https://github.com/apache/pulsar/pull/17539
> > > >
> > > > This is the script I'm going to use to cancel pending workflows.
> > > >
> > >
> > https://github.com/nicoloboschi/pulsar-validation-tool/blob/master/pulsar-scripts/pulsar-gha/cancel-workflows.js
> > > >
> > > > I'm going to run the script in minutes.
> > > >
> > > > I advertised on Slack about what is happening:
> > > >
> > >
> > https://apache-pulsar.slack.com/archives/C5ZSVEN4E/p1662624668695339?thread_ts=1662463042.016709&cid=C5ZSVEN4E
> > > >
> > > > >we’re going to execute the plan described in the ML. So any queued
> > > actions
> > > > will be cancelled. In order to validate your pull it is suggested to
> > run
> > > > the actions in your own Pulsar fork. Please don’t re-run failed jobs or
> > > > push any other commits to avoid triggering new actions
> > > >
> > > >
> > > > Nicolò Boschi
> > > >
> > > >
> > > > Il giorno gio 8 set 2022 alle ore 09:42 Nicolò Boschi <
> > > boschi1997@gmail.com>
> > > > ha scritto:
> > > >
> > > > > Thanks Lari for the detailed explanation. This is kind of an
> > emergency
> > > > > situation and I believe your plan is the way to go now.
> > > > >
> > > > > I already prepared a pull for moving the flaky suite out of the
> > Pulsar
> > > CI
> > > > > workflow: https://github.com/nicoloboschi/pulsar/pull/8
> > > > > I can take care of the execution of the plan.
> > > > >
> > > > > > 1. Cancel all existing builds in_progress or queued
> > > > >
> > > > > I have a script locally that uses GHA to check and cancel the pending
> > > > > runs. We can extend it to all the queued builds (will share it soon).
> > > > >
> > > > > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> > > merging
> > > > > PRs.
> > > > > > 3. Wait for build to run for .asf.yaml change, merge it
> > > > >
> > > > > After the pull is out, we'll need to cancel all other workflows that
> > > > > contributors may inadvertently have triggered.
> > > > >
> > > > > > 4. Disable all workflows
> > > > > > 5. Process specific PRs manually to improve the situation.
> > > > > >    - Make GHA workflow improvements such as
> > > > > https://github.com/apache/pulsar/pull/17491 and
> > > > > https://github.com/apache/pulsar/pull/17490
> > > > > >    - Quarantine all very flaky tests so that everyone doesn't waste
> > > time
> > > > > with those. It should be possible to merge a PR even when a
> > quarantined
> > > > > test fails.
> > > > >
> > > > > in this step we will merge this
> > > > > https://github.com/nicoloboschi/pulsar/pull/8
> > > > >
> > > > > I want to add to the list this improvement to reduce runners usage in
> > > case
> > > > > of doc or cpp changes.
> > > > > https://github.com/nicoloboschi/pulsar/pull/7
> > > > >
> > > > >
> > > > > > 6. Rebase PRs (or close and re-open) that would be processed next
> > so
> > > > > that changes are picked up
> > > > >
> > > > > It's better to leave this task to the author of the pull in order to
> > > not
> > > > > create too much load at the same time
> > > > >
> > > > > > 7. Enable workflows
> > > > > > 8. Start processing PRs with checks to see if things are handled
> > in a
> > > > > better way.
> > > > > > 9. When things are stable, enable required checks again in
> > > .asf.yaml, in
> > > > > the meantime be careful about merging PRs
> > > > > > 10. Fix quarantined flaky tests
> > > > >
> > > > >
> > > > > Nicolò Boschi
> > > > >
> > > > >
> > > > > Il giorno gio 8 set 2022 alle ore 09:27 Lari Hotari <
> > > lhotari@apache.org>
> > > > > ha scritto:
> > > > >
> > > > >> If my assumption of the GitHub usage metrics bug in the GitHub
> > Actions
> > > > >> build job queue fairness algorithm is correct, what would help is
> > > running
> > > > >> the flaky unit test group outside of Pulsar CI workflow. In that
> > > case, the
> > > > >> impact of the usage metrics would be limited.
> > > > >>
> > > > >> The example of
> > > > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > shows
> > > > >> this flaw as explained in the previous email. The total reported
> > > execution
> > > > >> time in that report is 1d 1h 40m 21s of usage and the actual usage
> > is
> > > about
> > > > >> 1/3 of this.
> > > > >>
> > > > >> When we move the most commonly failing job out of Pulsar CI
> > workflow,
> > > the
> > > > >> impact of the possible usage metrics bug would be much less. I hope
> > > GitHub
> > > > >> support responds to my issue and queries about this bug. It might
> > > take up
> > > > >> to 7 days to get a reply and for technical questions more time. In
> > the
> > > > >> meantime we need a solution for getting over this CI slowness issue.
> > > > >>
> > > > >> -Lari
> > > > >>
> > > > >>
> > > > >>
> > > > >> On 2022/09/08 06:34:42 Lari Hotari wrote:
> > > > >> > My current assumption of the CI slowness problem is that the usage
> > > > >> metrics for Apache Pulsar builds on GitHub side is done incorrectly
> > > and
> > > > >> that is resulting in apache/pulsar builds getting throttled. This
> > > > >> assumption might be wrong, but it's the best guess at the moment.
> > > > >> >
> > > > >> > The facts that support this assumption is that when re-running
> > > failed
> > > > >> jobs in a workflow, the execution times for previously successful
> > > jobs get
> > > > >> counted as if they have all run:
> > > > >> > Here's an example:
> > > > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > > >> > The reported total usage is about 3x than the actual usage.
> > > > >> >
> > > > >> > The assumption that I have is that the "fairness algorithm" that
> > > GitHub
> > > > >> uses to provide all Apache projects about the same amount of GitHub
> > > Actions
> > > > >> resources would take this flawed usage as the basis of it's
> > decisions
> > > and
> > > > >> it decides to throttle apache/pulsar builds.
> > > > >> >
> > > > >> > The reason why we are getting hit by this now is that there is a
> > > high
> > > > >> number of flaky test failures that cause almost every build to fail
> > > and we
> > > > >> have been re-running a lot of builds.
> > > > >> >
> > > > >> > The other fact to support the theory of flawed usage metrics used
> > in
> > > > >> the fairness algorithm is that other Apache projects aren't
> > reporting
> > > > >> issues about GitHub Actions slowness. This is mentioned in Jarek
> > > Potiuk's
> > > > >> comments on INFRA-23633 [1]:
> > > > >> > > Unlike the case 2 years ago, the problem is not affecting all
> > > > >> projects. In Apache Airflow we do > not see any particular slow-down
> > > with
> > > > >> Public Runners at this moment (just checked - >
> > > > >> > > everything is "as usual").. So I'd say it is something specific
> > to
> > > > >> Pulsar not to "ASF" as a whole.
> > > > >> >
> > > > >> > There are also other comments from Jarek about the GitHub
> > "fairness
> > > > >> algorithm" (comment [2], other comment [3])
> > > > >> > > But I believe the current problem is different - it might be
> > > (looking
> > > > >> at your jobs) simply a bug
> > > > >> > > in GA that you hit or indeed your demands are simply too high.
> > > > >> >
> > > > >> > I have opened tickets (2 tickets: 2 days ago and yesterday) to
> > > > >> support.github.com and there hasn't been any response to the
> > ticket.
> > > It
> > > > >> might take up to 7 days to get a response. We cannot rely on GitHub
> > > Support
> > > > >> resolving this issue.
> > > > >> >
> > > > >> > I propose that we go ahead with the previously suggested action
> > plan
> > > > >> > > One possible way forward:
> > > > >> > > 1. Cancel all existing builds in_progress or queued
> > > > >> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> > > > >> merging PRs.
> > > > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > > > >> > > 4. Disable all workflows
> > > > >> > > 5. Process specific PRs manually to improve the situation.
> > > > >> > >    - Make GHA workflow improvements such as
> > > > >> https://github.com/apache/pulsar/pull/17491 and
> > > > >> https://github.com/apache/pulsar/pull/17490
> > > > >> > >    - Quarantine all very flaky tests so that everyone doesn't
> > > waste
> > > > >> time with those. It should be possible to merge a PR even when a
> > > > >> quarantined test fails.
> > > > >> > > 6. Rebase PRs (or close and re-open) that would be processed
> > next
> > > so
> > > > >> that changes are picked up
> > > > >> > > 7. Enable workflows
> > > > >> > > 8. Start processing PRs with checks to see if things are handled
> > > in a
> > > > >> better way.
> > > > >> > > 9. When things are stable, enable required checks again in
> > > .asf.yaml,
> > > > >> in the meantime be careful about merging PRs
> > > > >> > > 10. Fix quarantined flaky tests
> > > > >> >
> > > > >> > To clarify, steps 1-6 would be done optimally in 1 day and we
> > would
> > > > >> stop processing ordinary PRs during this time. We would only handle
> > > PRs
> > > > >> that fix the CI situation during this exceptional period.
> > > > >> >
> > > > >> > -Lari
> > > > >> >
> > > > >> > Links to Jarek's comments:
> > > > >> > [1]
> > > > >>
> > >
> > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749
> > > > >> > [2]
> > > > >>
> > >
> > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > > > >> > [3]
> > > > >>
> > >
> > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > > > >> >
> > > > >> > On 2022/09/07 17:01:43 Lari Hotari wrote:
> > > > >> > > One possible way forward:
> > > > >> > > 1. Cancel all existing builds in_progress or queued
> > > > >> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> > > > >> merging PRs.
> > > > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > > > >> > > 4. Disable all workflows
> > > > >> > > 5. Process specific PRs manually to improve the situation.
> > > > >> > >    - Make GHA workflow improvements such as
> > > > >> https://github.com/apache/pulsar/pull/17491 and
> > > > >> https://github.com/apache/pulsar/pull/17490
> > > > >> > >    - Quarantine all very flaky tests so that everyone doesn't
> > > waste
> > > > >> time with those. It should be possible to merge a PR even when a
> > > > >> quarantined test fails.
> > > > >> > > 6. Rebase PRs (or close and re-open) that would be processed
> > next
> > > so
> > > > >> that changes are picked up
> > > > >> > > 7. Enable workflows
> > > > >> > > 8. Start processing PRs with checks to see if things are handled
> > > in a
> > > > >> better way.
> > > > >> > > 9. When things are stable, enable required checks again in
> > > .asf.yaml,
> > > > >> in the meantime be careful about merging PRs
> > > > >> > > 10. Fix quarantined flaky tests
> > > > >> > >
> > > > >> > > -Lari
> > > > >> > >
> > > > >> > > On 2022/09/07 16:47:09 Lari Hotari wrote:
> > > > >> > > > The problem with CI is becoming worse. The build queue is 235
> > > jobs
> > > > >> now and the queue time is over 7 hours.
> > > > >> > > >
> > > > >> > > > We will need to start shedding load in the build queue and get
> > > some
> > > > >> fixes in.
> > > > >> > > > https://issues.apache.org/jira/browse/INFRA-23633 continues
> > to
> > > > >> contain details about some activities. I have created 2 GitHub
> > Support
> > > > >> tickets, but usually it takes up to a week to get a response.
> > > > >> > > >
> > > > >> > > > I have some assumptions about the issue, but they are just
> > > > >> assumptions.
> > > > >> > > > One oddity is that when re-running failed jobs is used in a
> > > large
> > > > >> workflow, the execution times for previously successful jobs get
> > > counted as
> > > > >> if they have run.
> > > > >> > > > Here's an example:
> > > > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > > >> > > > The reported usage is about 3x than the actual usage.
> > > > >> > > > The assumption that I have is that the "fairness algorithm"
> > that
> > > > >> GitHub uses to provide all Apache projects about the same amount of
> > > GitHub
> > > > >> Actions resources would take this flawed usage as the basis of it's
> > > > >> decisions.
> > > > >> > > > The reason why we are getting hit by this now is that there
> > is a
> > > > >> high number of flaky test failures that cause almost every build to
> > > fail
> > > > >> and we are re-running a lot of builds.
> > > > >> > > >
> > > > >> > > > Another problem there is that the GitHub Actions search
> > doesn't
> > > > >> always show all workflow runs that are running. This has happened
> > > before
> > > > >> when the GitHub Actions workflow search index was corrupted. GitHub
> > > Support
> > > > >> resolved that by rebuilding the search index with some manual admin
> > > > >> operation behind the scenes.
> > > > >> > > >
> > > > >> > > > I'm proposing that we start shedding load from CI by
> > cancelling
> > > > >> build jobs and selecting which jobs to process so that we get the CI
> > > issue
> > > > >> resolved. We might also have to disable required checks so that we
> > > have
> > > > >> some way to get changes merged while CI doesn't work properly.
> > > > >> > > >
> > > > >> > > > I'm expecting lazy consensus on fixing CI unless someone
> > > proposes a
> > > > >> better plan. Let's keep everyone informed in this mailing list
> > thread.
> > > > >> > > >
> > > > >> > > > -Lari
> > > > >> > > >
> > > > >> > > >
> > > > >> > > > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > > > >> > > > > We are going to need to take actions to fix our problems.
> > See
> > > > >>
> > >
> > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > > > >> > > > >
> > > > >> > > > > Jarek has done a large amount of GitHub Action work with
> > > Apache
> > > > >> Airflow and his suggestions might be helpful. One of his suggestions
> > > was
> > > > >> Apache Yetus. I think he means using the Maven plugins -
> > > > >> https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <
> > lhotari@apache.org
> > > >
> > > > >> wrote:
> > > > >> > > > > >
> > > > >> > > > > > The Apache Infra ticket is
> > > > >> https://issues.apache.org/jira/browse/INFRA-23633 .
> > > > >> > > > > >
> > > > >> > > > > > -Lari
> > > > >> > > > > >
> > > > >> > > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > > > >> > > > > >> I asked for an update on the Apache org GitHub Actions
> > > usage
> > > > >> stats from Gavin McDonald on the-asf slack in this thread:
> > > > >>
> > >
> > https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> > > > >> .
> > > > >> > > > > >>
> > > > >> > > > > >> I hope we get this issue resolved since it delays PR
> > > > >> processing a lot.
> > > > >> > > > > >>
> > > > >> > > > > >> -Lari
> > > > >> > > > > >>
> > > > >> > > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > > > >> > > > > >>> Pulsar CI continues to be congested, and the build queue
> > > [1]
> > > > >> is very long at the moment. There are 147 build jobs in the queue
> > and
> > > 16
> > > > >> jobs in progress at the moment.
> > > > >> > > > > >>>
> > > > >> > > > > >>> I would strongly advice everyone to use "personal CI" to
> > > > >> mitigate the issue of the long delay of CI feedback. You can simply
> > > open a
> > > > >> PR to your own personal fork of apache/pulsar to run the builds in
> > > your
> > > > >> "personal CI". There's more details in the previous emails in this
> > > thread.
> > > > >> > > > > >>>
> > > > >> > > > > >>> -Lari
> > > > >> > > > > >>>
> > > > >> > > > > >>> [1] - build queue:
> > > > >> https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > > > >> > > > > >>>
> > > > >> > > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > > > >> > > > > >>>> Pulsar CI continues to be congested, and the build
> > queue
> > > is
> > > > >> long.
> > > > >> > > > > >>>>
> > > > >> > > > > >>>> I would strongly advice everyone to use "personal CI"
> > to
> > > > >> mitigate the issue of the long delay of CI feedback. You can simply
> > > open a
> > > > >> PR to your own personal fork of apache/pulsar to run the builds in
> > > your
> > > > >> "personal CI". There's more details in the previous email in this
> > > thread.
> > > > >> > > > > >>>>
> > > > >> > > > > >>>> Some updates:
> > > > >> > > > > >>>>
> > > > >> > > > > >>>> There has been a discussion with Gavin McDonald from
> > ASF
> > > > >> infra on the-asf slack about getting usage reports from GitHub to
> > > support
> > > > >> the investigation. Slack thread is the same one mentioned in the
> > > previous
> > > > >> email,
> > https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279
> > > .
> > > > >> Gavin already requested the usage report in GitHub UI, but it
> > produced
> > > > >> invalid results.
> > > > >> > > > > >>>>
> > > > >> > > > > >>>> I made a change to mitigate a source of additional
> > GitHub
> > > > >> Actions overhead.
> > > > >> > > > > >>>> In the past, each cherry-picked commit to a maintenance
> > > > >> branch of Pulsar has triggered a lot of workflow runs.
> > > > >> > > > > >>>>
> > > > >> > > > > >>>> The solution for cancelling duplicate builds
> > > automatically
> > > > >> is to add this definition to the workflow definition:
> > > > >> > > > > >>>> concurrency:
> > > > >> > > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > > > >> > > > > >>>>  cancel-in-progress: true
> > > > >> > > > > >>>>
> > > > >> > > > > >>>> I added this to all maintenance branch GitHub Actions
> > > > >> workflows:
> > > > >> > > > > >>>>
> > > > >> > > > > >>>> branch-2.10 change:
> > > > >> > > > > >>>>
> > > > >>
> > >
> > https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > > > >> > > > > >>>> branch-2.9 change:
> > > > >> > > > > >>>>
> > > > >>
> > >
> > https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > > > >> > > > > >>>> branch-2.8 change:
> > > > >> > > > > >>>>
> > > > >>
> > >
> > https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > > > >> > > > > >>>> branch-2.7:
> > > > >> > > > > >>>>
> > > > >>
> > >
> > https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > > > >> > > > > >>>>
> > > > >> > > > > >>>> branch-2.11 already contains the necessary config for
> > > > >> cancelling duplicate builds.
> > > > >> > > > > >>>>
> > > > >> > > > > >>>> The benefit of the above change is that when multiple
> > > > >> commits are cherry-picked to a branch at once, only the build of the
> > > last
> > > > >> commit will get run eventually. The builds for the intermediate
> > > commits
> > > > >> will get cancelled. Obviously there's a tradeoff here that we don't
> > > get the
> > > > >> information if one of the earlier commits breaks the build. It's the
> > > cost
> > > > >> that we need to pay. Nevertheless our build is so flaky that it's
> > > hard to
> > > > >> determine whether a failed build result is only caused by bad flaky
> > > test or
> > > > >> whether it's an actual failure. Because of this we don't lose
> > > anything by
> > > > >> cancelling builds. It's more important to save build resources. In
> > the
> > > > >> maintenance branches for 2.10 and older, the average total build
> > time
> > > > >> consumed is around 20 hours which is a lot.
> > > > >> > > > > >>>>
> > > > >> > > > > >>>> At this time, the overhead of maintenance branch builds
> > > > >> doesn't seem to be the source of the problems. There must be some
> > > other
> > > > >> issue which is possibly related to exceeding a usage quota.
> > Hopefully
> > > we
> > > > >> get the CI slowness issue solved asap.
> > > > >> > > > > >>>>
> > > > >> > > > > >>>> BR,
> > > > >> > > > > >>>>
> > > > >> > > > > >>>> Lari
> > > > >> > > > > >>>>
> > > > >> > > > > >>>>
> > > > >> > > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > > >> > > > > >>>>> Hi,
> > > > >> > > > > >>>>>
> > > > >> > > > > >>>>> GitHub Actions builds have been piling up in the build
> > > > >> queue in the last few days.
> > > > >> > > > > >>>>> I posted on builds@apache.org
> > > > >> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s
> > and
> > > > >> created INFRA ticket
> > > https://issues.apache.org/jira/browse/INFRA-23633
> > > > >> about this issue.
> > > > >> > > > > >>>>> There's also a thread on the-asf slack,
> > > > >> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > > > >> > > > > >>>>>
> > > > >> > > > > >>>>> It seems that our build queue is finally getting
> > picked
> > > up,
> > > > >> but it would be great to see if we hit quota and whether that is the
> > > cause
> > > > >> of pauses.
> > > > >> > > > > >>>>>
> > > > >> > > > > >>>>> Another issue is that the master branch broke after
> > > merging
> > > > >> 2 conflicting PRs.
> > > > >> > > > > >>>>> The fix is in
> > > https://github.com/apache/pulsar/pull/17300
> > > > >> .
> > > > >> > > > > >>>>>
> > > > >> > > > > >>>>> Merging PRs will be slow until we have these 2
> > problems
> > > > >> solved and existing PRs rebased over the changes. Let's prioritize
> > > merging
> > > > >> #17300 before pushing more changes.
> > > > >> > > > > >>>>>
> > > > >> > > > > >>>>> I'd like to point out that a good way to get build
> > > feedback
> > > > >> before sending a PR, is to run builds on your personal GitHub
> > Actions
> > > CI.
> > > > >> The benefit of this is that it doesn't consume the shared quota and
> > > builds
> > > > >> usually start instantly.
> > > > >> > > > > >>>>> There are instructions in the contributors guide about
> > > > >> this.
> > > > >> > > > > >>>>>
> > > > >> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > > >> > > > > >>>>> You simply open PRs to your own fork of apache/pulsar
> > to
> > > > >> run builds on your personal GitHub Actions CI.
> > > > >> > > > > >>>>>
> > > > >> > > > > >>>>> BR,
> > > > >> > > > > >>>>>
> > > > >> > > > > >>>>> Lari
> > > > >> > > > > >>>>>
> > > > >> > > > > >>>>>
> > > > >> > > > > >>>>>
> > > > >> > > > > >>>>>
> > > > >> > > > > >>>>>
> > > > >> > > > > >>>>>
> > > > >> > > > > >>>>>
> > > > >> > > > > >>>>>
> > > > >> > > > > >>>>
> > > > >> > > > > >>>
> > > > >> > > > > >>
> > > > >> > > > >
> > > > >> > > > >
> > > > >> > > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >

Re: Pulsar CI congested, master branch build broken

Posted by tison <wa...@gmail.com>.

Thank you, Lari and Nicolò!
Best,
tison.


Nicolò Boschi <bo...@gmail.com> 于2022年9月9日周五 02:41写道：

> Dear community,
>
> The plan has been executed.
> The summary of our actions is:
> 1. We cancelled all pending jobs (queue and in-progress)
> 2. We removed the required checks to be able to merge improvements on the
> CI workflow
> 3. We merged a couple of improvements:
>    1. workarounded the possible bug triggered by jobs retries. Now
> broker flaky tests are in a dedicated workflow
>    2. moved known flaky tests to the flaky suite
>    3. optimized the runner consumption for docs-only and cpp-only pulls
> 4. We reactivated the required checks.
>
>
> Now it's possible to come back to normal life.
> 1. You must rebase your branch to the latest master (there's a button for
> you in the UI) or eventually you can close/reopen the pull to trigger the
> checks
> 2. You can merge a pull request if you want
> 3. You will find a new job in the Checks section called "Pulsar CI / Pulsar
> CI checks completed" that indicates the Pulsar CI successfully passed
>
> There's a slight chance that the CI will be stuck again in the next few
> days but we will take it monitored.
>
> Thanks Lari for the nice work!
>
> Regards,
> Nicolò Boschi
>
>
> Il giorno gio 8 set 2022 alle ore 10:55 Lari Hotari <lh...@apache.org>
> ha
> scritto:
>
> > Thank you Nicolo.
> > There's lazy consensus, let's go forward with the action plan.
> >
> > -Lari
> >
> > On 2022/09/08 08:16:05 Nicolò Boschi wrote:
> > > This is the pull for step 2.
> https://github.com/apache/pulsar/pull/17539
> > >
> > > This is the script I'm going to use to cancel pending workflows.
> > >
> >
> https://github.com/nicoloboschi/pulsar-validation-tool/blob/master/pulsar-scripts/pulsar-gha/cancel-workflows.js
> > >
> > > I'm going to run the script in minutes.
> > >
> > > I advertised on Slack about what is happening:
> > >
> >
> https://apache-pulsar.slack.com/archives/C5ZSVEN4E/p1662624668695339?thread_ts=1662463042.016709&cid=C5ZSVEN4E
> > >
> > > >we’re going to execute the plan described in the ML. So any queued
> > actions
> > > will be cancelled. In order to validate your pull it is suggested to
> run
> > > the actions in your own Pulsar fork. Please don’t re-run failed jobs or
> > > push any other commits to avoid triggering new actions
> > >
> > >
> > > Nicolò Boschi
> > >
> > >
> > > Il giorno gio 8 set 2022 alle ore 09:42 Nicolò Boschi <
> > boschi1997@gmail.com>
> > > ha scritto:
> > >
> > > > Thanks Lari for the detailed explanation. This is kind of an
> emergency
> > > > situation and I believe your plan is the way to go now.
> > > >
> > > > I already prepared a pull for moving the flaky suite out of the
> Pulsar
> > CI
> > > > workflow: https://github.com/nicoloboschi/pulsar/pull/8
> > > > I can take care of the execution of the plan.
> > > >
> > > > > 1. Cancel all existing builds in_progress or queued
> > > >
> > > > I have a script locally that uses GHA to check and cancel the pending
> > > > runs. We can extend it to all the queued builds (will share it soon).
> > > >
> > > > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> > merging
> > > > PRs.
> > > > > 3. Wait for build to run for .asf.yaml change, merge it
> > > >
> > > > After the pull is out, we'll need to cancel all other workflows that
> > > > contributors may inadvertently have triggered.
> > > >
> > > > > 4. Disable all workflows
> > > > > 5. Process specific PRs manually to improve the situation.
> > > > >    - Make GHA workflow improvements such as
> > > > https://github.com/apache/pulsar/pull/17491 and
> > > > https://github.com/apache/pulsar/pull/17490
> > > > >    - Quarantine all very flaky tests so that everyone doesn't waste
> > time
> > > > with those. It should be possible to merge a PR even when a
> quarantined
> > > > test fails.
> > > >
> > > > in this step we will merge this
> > > > https://github.com/nicoloboschi/pulsar/pull/8
> > > >
> > > > I want to add to the list this improvement to reduce runners usage in
> > case
> > > > of doc or cpp changes.
> > > > https://github.com/nicoloboschi/pulsar/pull/7
> > > >
> > > >
> > > > > 6. Rebase PRs (or close and re-open) that would be processed next
> so
> > > > that changes are picked up
> > > >
> > > > It's better to leave this task to the author of the pull in order to
> > not
> > > > create too much load at the same time
> > > >
> > > > > 7. Enable workflows
> > > > > 8. Start processing PRs with checks to see if things are handled
> in a
> > > > better way.
> > > > > 9. When things are stable, enable required checks again in
> > .asf.yaml, in
> > > > the meantime be careful about merging PRs
> > > > > 10. Fix quarantined flaky tests
> > > >
> > > >
> > > > Nicolò Boschi
> > > >
> > > >
> > > > Il giorno gio 8 set 2022 alle ore 09:27 Lari Hotari <
> > lhotari@apache.org>
> > > > ha scritto:
> > > >
> > > >> If my assumption of the GitHub usage metrics bug in the GitHub
> Actions
> > > >> build job queue fairness algorithm is correct, what would help is
> > running
> > > >> the flaky unit test group outside of Pulsar CI workflow. In that
> > case, the
> > > >> impact of the usage metrics would be limited.
> > > >>
> > > >> The example of
> > > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> shows
> > > >> this flaw as explained in the previous email. The total reported
> > execution
> > > >> time in that report is 1d 1h 40m 21s of usage and the actual usage
> is
> > about
> > > >> 1/3 of this.
> > > >>
> > > >> When we move the most commonly failing job out of Pulsar CI
> workflow,
> > the
> > > >> impact of the possible usage metrics bug would be much less. I hope
> > GitHub
> > > >> support responds to my issue and queries about this bug. It might
> > take up
> > > >> to 7 days to get a reply and for technical questions more time. In
> the
> > > >> meantime we need a solution for getting over this CI slowness issue.
> > > >>
> > > >> -Lari
> > > >>
> > > >>
> > > >>
> > > >> On 2022/09/08 06:34:42 Lari Hotari wrote:
> > > >> > My current assumption of the CI slowness problem is that the usage
> > > >> metrics for Apache Pulsar builds on GitHub side is done incorrectly
> > and
> > > >> that is resulting in apache/pulsar builds getting throttled. This
> > > >> assumption might be wrong, but it's the best guess at the moment.
> > > >> >
> > > >> > The facts that support this assumption is that when re-running
> > failed
> > > >> jobs in a workflow, the execution times for previously successful
> > jobs get
> > > >> counted as if they have all run:
> > > >> > Here's an example:
> > > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > >> > The reported total usage is about 3x than the actual usage.
> > > >> >
> > > >> > The assumption that I have is that the "fairness algorithm" that
> > GitHub
> > > >> uses to provide all Apache projects about the same amount of GitHub
> > Actions
> > > >> resources would take this flawed usage as the basis of it's
> decisions
> > and
> > > >> it decides to throttle apache/pulsar builds.
> > > >> >
> > > >> > The reason why we are getting hit by this now is that there is a
> > high
> > > >> number of flaky test failures that cause almost every build to fail
> > and we
> > > >> have been re-running a lot of builds.
> > > >> >
> > > >> > The other fact to support the theory of flawed usage metrics used
> in
> > > >> the fairness algorithm is that other Apache projects aren't
> reporting
> > > >> issues about GitHub Actions slowness. This is mentioned in Jarek
> > Potiuk's
> > > >> comments on INFRA-23633 [1]:
> > > >> > > Unlike the case 2 years ago, the problem is not affecting all
> > > >> projects. In Apache Airflow we do > not see any particular slow-down
> > with
> > > >> Public Runners at this moment (just checked - >
> > > >> > > everything is "as usual").. So I'd say it is something specific
> to
> > > >> Pulsar not to "ASF" as a whole.
> > > >> >
> > > >> > There are also other comments from Jarek about the GitHub
> "fairness
> > > >> algorithm" (comment [2], other comment [3])
> > > >> > > But I believe the current problem is different - it might be
> > (looking
> > > >> at your jobs) simply a bug
> > > >> > > in GA that you hit or indeed your demands are simply too high.
> > > >> >
> > > >> > I have opened tickets (2 tickets: 2 days ago and yesterday) to
> > > >> support.github.com and there hasn't been any response to the
> ticket.
> > It
> > > >> might take up to 7 days to get a response. We cannot rely on GitHub
> > Support
> > > >> resolving this issue.
> > > >> >
> > > >> > I propose that we go ahead with the previously suggested action
> plan
> > > >> > > One possible way forward:
> > > >> > > 1. Cancel all existing builds in_progress or queued
> > > >> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> > > >> merging PRs.
> > > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > > >> > > 4. Disable all workflows
> > > >> > > 5. Process specific PRs manually to improve the situation.
> > > >> > >    - Make GHA workflow improvements such as
> > > >> https://github.com/apache/pulsar/pull/17491 and
> > > >> https://github.com/apache/pulsar/pull/17490
> > > >> > >    - Quarantine all very flaky tests so that everyone doesn't
> > waste
> > > >> time with those. It should be possible to merge a PR even when a
> > > >> quarantined test fails.
> > > >> > > 6. Rebase PRs (or close and re-open) that would be processed
> next
> > so
> > > >> that changes are picked up
> > > >> > > 7. Enable workflows
> > > >> > > 8. Start processing PRs with checks to see if things are handled
> > in a
> > > >> better way.
> > > >> > > 9. When things are stable, enable required checks again in
> > .asf.yaml,
> > > >> in the meantime be careful about merging PRs
> > > >> > > 10. Fix quarantined flaky tests
> > > >> >
> > > >> > To clarify, steps 1-6 would be done optimally in 1 day and we
> would
> > > >> stop processing ordinary PRs during this time. We would only handle
> > PRs
> > > >> that fix the CI situation during this exceptional period.
> > > >> >
> > > >> > -Lari
> > > >> >
> > > >> > Links to Jarek's comments:
> > > >> > [1]
> > > >>
> >
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749
> > > >> > [2]
> > > >>
> >
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > > >> > [3]
> > > >>
> >
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > > >> >
> > > >> > On 2022/09/07 17:01:43 Lari Hotari wrote:
> > > >> > > One possible way forward:
> > > >> > > 1. Cancel all existing builds in_progress or queued
> > > >> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> > > >> merging PRs.
> > > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > > >> > > 4. Disable all workflows
> > > >> > > 5. Process specific PRs manually to improve the situation.
> > > >> > >    - Make GHA workflow improvements such as
> > > >> https://github.com/apache/pulsar/pull/17491 and
> > > >> https://github.com/apache/pulsar/pull/17490
> > > >> > >    - Quarantine all very flaky tests so that everyone doesn't
> > waste
> > > >> time with those. It should be possible to merge a PR even when a
> > > >> quarantined test fails.
> > > >> > > 6. Rebase PRs (or close and re-open) that would be processed
> next
> > so
> > > >> that changes are picked up
> > > >> > > 7. Enable workflows
> > > >> > > 8. Start processing PRs with checks to see if things are handled
> > in a
> > > >> better way.
> > > >> > > 9. When things are stable, enable required checks again in
> > .asf.yaml,
> > > >> in the meantime be careful about merging PRs
> > > >> > > 10. Fix quarantined flaky tests
> > > >> > >
> > > >> > > -Lari
> > > >> > >
> > > >> > > On 2022/09/07 16:47:09 Lari Hotari wrote:
> > > >> > > > The problem with CI is becoming worse. The build queue is 235
> > jobs
> > > >> now and the queue time is over 7 hours.
> > > >> > > >
> > > >> > > > We will need to start shedding load in the build queue and get
> > some
> > > >> fixes in.
> > > >> > > > https://issues.apache.org/jira/browse/INFRA-23633 continues
> to
> > > >> contain details about some activities. I have created 2 GitHub
> Support
> > > >> tickets, but usually it takes up to a week to get a response.
> > > >> > > >
> > > >> > > > I have some assumptions about the issue, but they are just
> > > >> assumptions.
> > > >> > > > One oddity is that when re-running failed jobs is used in a
> > large
> > > >> workflow, the execution times for previously successful jobs get
> > counted as
> > > >> if they have run.
> > > >> > > > Here's an example:
> > > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > >> > > > The reported usage is about 3x than the actual usage.
> > > >> > > > The assumption that I have is that the "fairness algorithm"
> that
> > > >> GitHub uses to provide all Apache projects about the same amount of
> > GitHub
> > > >> Actions resources would take this flawed usage as the basis of it's
> > > >> decisions.
> > > >> > > > The reason why we are getting hit by this now is that there
> is a
> > > >> high number of flaky test failures that cause almost every build to
> > fail
> > > >> and we are re-running a lot of builds.
> > > >> > > >
> > > >> > > > Another problem there is that the GitHub Actions search
> doesn't
> > > >> always show all workflow runs that are running. This has happened
> > before
> > > >> when the GitHub Actions workflow search index was corrupted. GitHub
> > Support
> > > >> resolved that by rebuilding the search index with some manual admin
> > > >> operation behind the scenes.
> > > >> > > >
> > > >> > > > I'm proposing that we start shedding load from CI by
> cancelling
> > > >> build jobs and selecting which jobs to process so that we get the CI
> > issue
> > > >> resolved. We might also have to disable required checks so that we
> > have
> > > >> some way to get changes merged while CI doesn't work properly.
> > > >> > > >
> > > >> > > > I'm expecting lazy consensus on fixing CI unless someone
> > proposes a
> > > >> better plan. Let's keep everyone informed in this mailing list
> thread.
> > > >> > > >
> > > >> > > > -Lari
> > > >> > > >
> > > >> > > >
> > > >> > > > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > > >> > > > > We are going to need to take actions to fix our problems.
> See
> > > >>
> >
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > > >> > > > >
> > > >> > > > > Jarek has done a large amount of GitHub Action work with
> > Apache
> > > >> Airflow and his suggestions might be helpful. One of his suggestions
> > was
> > > >> Apache Yetus. I think he means using the Maven plugins -
> > > >> https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > > >> > > > >
> > > >> > > > >
> > > >> > > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <
> lhotari@apache.org
> > >
> > > >> wrote:
> > > >> > > > > >
> > > >> > > > > > The Apache Infra ticket is
> > > >> https://issues.apache.org/jira/browse/INFRA-23633 .
> > > >> > > > > >
> > > >> > > > > > -Lari
> > > >> > > > > >
> > > >> > > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > > >> > > > > >> I asked for an update on the Apache org GitHub Actions
> > usage
> > > >> stats from Gavin McDonald on the-asf slack in this thread:
> > > >>
> >
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> > > >> .
> > > >> > > > > >>
> > > >> > > > > >> I hope we get this issue resolved since it delays PR
> > > >> processing a lot.
> > > >> > > > > >>
> > > >> > > > > >> -Lari
> > > >> > > > > >>
> > > >> > > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > > >> > > > > >>> Pulsar CI continues to be congested, and the build queue
> > [1]
> > > >> is very long at the moment. There are 147 build jobs in the queue
> and
> > 16
> > > >> jobs in progress at the moment.
> > > >> > > > > >>>
> > > >> > > > > >>> I would strongly advice everyone to use "personal CI" to
> > > >> mitigate the issue of the long delay of CI feedback. You can simply
> > open a
> > > >> PR to your own personal fork of apache/pulsar to run the builds in
> > your
> > > >> "personal CI". There's more details in the previous emails in this
> > thread.
> > > >> > > > > >>>
> > > >> > > > > >>> -Lari
> > > >> > > > > >>>
> > > >> > > > > >>> [1] - build queue:
> > > >> https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > > >> > > > > >>>
> > > >> > > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > > >> > > > > >>>> Pulsar CI continues to be congested, and the build
> queue
> > is
> > > >> long.
> > > >> > > > > >>>>
> > > >> > > > > >>>> I would strongly advice everyone to use "personal CI"
> to
> > > >> mitigate the issue of the long delay of CI feedback. You can simply
> > open a
> > > >> PR to your own personal fork of apache/pulsar to run the builds in
> > your
> > > >> "personal CI". There's more details in the previous email in this
> > thread.
> > > >> > > > > >>>>
> > > >> > > > > >>>> Some updates:
> > > >> > > > > >>>>
> > > >> > > > > >>>> There has been a discussion with Gavin McDonald from
> ASF
> > > >> infra on the-asf slack about getting usage reports from GitHub to
> > support
> > > >> the investigation. Slack thread is the same one mentioned in the
> > previous
> > > >> email,
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279
> > .
> > > >> Gavin already requested the usage report in GitHub UI, but it
> produced
> > > >> invalid results.
> > > >> > > > > >>>>
> > > >> > > > > >>>> I made a change to mitigate a source of additional
> GitHub
> > > >> Actions overhead.
> > > >> > > > > >>>> In the past, each cherry-picked commit to a maintenance
> > > >> branch of Pulsar has triggered a lot of workflow runs.
> > > >> > > > > >>>>
> > > >> > > > > >>>> The solution for cancelling duplicate builds
> > automatically
> > > >> is to add this definition to the workflow definition:
> > > >> > > > > >>>> concurrency:
> > > >> > > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > > >> > > > > >>>>  cancel-in-progress: true
> > > >> > > > > >>>>
> > > >> > > > > >>>> I added this to all maintenance branch GitHub Actions
> > > >> workflows:
> > > >> > > > > >>>>
> > > >> > > > > >>>> branch-2.10 change:
> > > >> > > > > >>>>
> > > >>
> >
> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > > >> > > > > >>>> branch-2.9 change:
> > > >> > > > > >>>>
> > > >>
> >
> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > > >> > > > > >>>> branch-2.8 change:
> > > >> > > > > >>>>
> > > >>
> >
> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > > >> > > > > >>>> branch-2.7:
> > > >> > > > > >>>>
> > > >>
> >
> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > > >> > > > > >>>>
> > > >> > > > > >>>> branch-2.11 already contains the necessary config for
> > > >> cancelling duplicate builds.
> > > >> > > > > >>>>
> > > >> > > > > >>>> The benefit of the above change is that when multiple
> > > >> commits are cherry-picked to a branch at once, only the build of the
> > last
> > > >> commit will get run eventually. The builds for the intermediate
> > commits
> > > >> will get cancelled. Obviously there's a tradeoff here that we don't
> > get the
> > > >> information if one of the earlier commits breaks the build. It's the
> > cost
> > > >> that we need to pay. Nevertheless our build is so flaky that it's
> > hard to
> > > >> determine whether a failed build result is only caused by bad flaky
> > test or
> > > >> whether it's an actual failure. Because of this we don't lose
> > anything by
> > > >> cancelling builds. It's more important to save build resources. In
> the
> > > >> maintenance branches for 2.10 and older, the average total build
> time
> > > >> consumed is around 20 hours which is a lot.
> > > >> > > > > >>>>
> > > >> > > > > >>>> At this time, the overhead of maintenance branch builds
> > > >> doesn't seem to be the source of the problems. There must be some
> > other
> > > >> issue which is possibly related to exceeding a usage quota.
> Hopefully
> > we
> > > >> get the CI slowness issue solved asap.
> > > >> > > > > >>>>
> > > >> > > > > >>>> BR,
> > > >> > > > > >>>>
> > > >> > > > > >>>> Lari
> > > >> > > > > >>>>
> > > >> > > > > >>>>
> > > >> > > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > >> > > > > >>>>> Hi,
> > > >> > > > > >>>>>
> > > >> > > > > >>>>> GitHub Actions builds have been piling up in the build
> > > >> queue in the last few days.
> > > >> > > > > >>>>> I posted on builds@apache.org
> > > >> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s
> and
> > > >> created INFRA ticket
> > https://issues.apache.org/jira/browse/INFRA-23633
> > > >> about this issue.
> > > >> > > > > >>>>> There's also a thread on the-asf slack,
> > > >> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > > >> > > > > >>>>>
> > > >> > > > > >>>>> It seems that our build queue is finally getting
> picked
> > up,
> > > >> but it would be great to see if we hit quota and whether that is the
> > cause
> > > >> of pauses.
> > > >> > > > > >>>>>
> > > >> > > > > >>>>> Another issue is that the master branch broke after
> > merging
> > > >> 2 conflicting PRs.
> > > >> > > > > >>>>> The fix is in
> > https://github.com/apache/pulsar/pull/17300
> > > >> .
> > > >> > > > > >>>>>
> > > >> > > > > >>>>> Merging PRs will be slow until we have these 2
> problems
> > > >> solved and existing PRs rebased over the changes. Let's prioritize
> > merging
> > > >> #17300 before pushing more changes.
> > > >> > > > > >>>>>
> > > >> > > > > >>>>> I'd like to point out that a good way to get build
> > feedback
> > > >> before sending a PR, is to run builds on your personal GitHub
> Actions
> > CI.
> > > >> The benefit of this is that it doesn't consume the shared quota and
> > builds
> > > >> usually start instantly.
> > > >> > > > > >>>>> There are instructions in the contributors guide about
> > > >> this.
> > > >> > > > > >>>>>
> > > >> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > >> > > > > >>>>> You simply open PRs to your own fork of apache/pulsar
> to
> > > >> run builds on your personal GitHub Actions CI.
> > > >> > > > > >>>>>
> > > >> > > > > >>>>> BR,
> > > >> > > > > >>>>>
> > > >> > > > > >>>>> Lari
> > > >> > > > > >>>>>
> > > >> > > > > >>>>>
> > > >> > > > > >>>>>
> > > >> > > > > >>>>>
> > > >> > > > > >>>>>
> > > >> > > > > >>>>>
> > > >> > > > > >>>>>
> > > >> > > > > >>>>>
> > > >> > > > > >>>>
> > > >> > > > > >>>
> > > >> > > > > >>
> > > >> > > > >
> > > >> > > > >
> > > >> > > >
> > > >> > >
> > > >> >
> > > >>
> > > >
> > >
> >
>

Re: Pulsar CI congested, master branch build broken

Posted by Nicolò Boschi <bo...@gmail.com>.

Dear community,

The plan has been executed.
The summary of our actions is:
1. We cancelled all pending jobs (queue and in-progress)
2. We removed the required checks to be able to merge improvements on the
CI workflow
3. We merged a couple of improvements:
   1. workarounded the possible bug triggered by jobs retries. Now
broker flaky tests are in a dedicated workflow
   2. moved known flaky tests to the flaky suite
   3. optimized the runner consumption for docs-only and cpp-only pulls
4. We reactivated the required checks.


Now it's possible to come back to normal life.
1. You must rebase your branch to the latest master (there's a button for
you in the UI) or eventually you can close/reopen the pull to trigger the
checks
2. You can merge a pull request if you want
3. You will find a new job in the Checks section called "Pulsar CI / Pulsar
CI checks completed" that indicates the Pulsar CI successfully passed

There's a slight chance that the CI will be stuck again in the next few
days but we will take it monitored.

Thanks Lari for the nice work!

Regards,
Nicolò Boschi


Il giorno gio 8 set 2022 alle ore 10:55 Lari Hotari <lh...@apache.org> ha
scritto:

> Thank you Nicolo.
> There's lazy consensus, let's go forward with the action plan.
>
> -Lari
>
> On 2022/09/08 08:16:05 Nicolò Boschi wrote:
> > This is the pull for step 2. https://github.com/apache/pulsar/pull/17539
> >
> > This is the script I'm going to use to cancel pending workflows.
> >
> https://github.com/nicoloboschi/pulsar-validation-tool/blob/master/pulsar-scripts/pulsar-gha/cancel-workflows.js
> >
> > I'm going to run the script in minutes.
> >
> > I advertised on Slack about what is happening:
> >
> https://apache-pulsar.slack.com/archives/C5ZSVEN4E/p1662624668695339?thread_ts=1662463042.016709&cid=C5ZSVEN4E
> >
> > >we’re going to execute the plan described in the ML. So any queued
> actions
> > will be cancelled. In order to validate your pull it is suggested to run
> > the actions in your own Pulsar fork. Please don’t re-run failed jobs or
> > push any other commits to avoid triggering new actions
> >
> >
> > Nicolò Boschi
> >
> >
> > Il giorno gio 8 set 2022 alle ore 09:42 Nicolò Boschi <
> boschi1997@gmail.com>
> > ha scritto:
> >
> > > Thanks Lari for the detailed explanation. This is kind of an emergency
> > > situation and I believe your plan is the way to go now.
> > >
> > > I already prepared a pull for moving the flaky suite out of the Pulsar
> CI
> > > workflow: https://github.com/nicoloboschi/pulsar/pull/8
> > > I can take care of the execution of the plan.
> > >
> > > > 1. Cancel all existing builds in_progress or queued
> > >
> > > I have a script locally that uses GHA to check and cancel the pending
> > > runs. We can extend it to all the queued builds (will share it soon).
> > >
> > > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> merging
> > > PRs.
> > > > 3. Wait for build to run for .asf.yaml change, merge it
> > >
> > > After the pull is out, we'll need to cancel all other workflows that
> > > contributors may inadvertently have triggered.
> > >
> > > > 4. Disable all workflows
> > > > 5. Process specific PRs manually to improve the situation.
> > > >    - Make GHA workflow improvements such as
> > > https://github.com/apache/pulsar/pull/17491 and
> > > https://github.com/apache/pulsar/pull/17490
> > > >    - Quarantine all very flaky tests so that everyone doesn't waste
> time
> > > with those. It should be possible to merge a PR even when a quarantined
> > > test fails.
> > >
> > > in this step we will merge this
> > > https://github.com/nicoloboschi/pulsar/pull/8
> > >
> > > I want to add to the list this improvement to reduce runners usage in
> case
> > > of doc or cpp changes.
> > > https://github.com/nicoloboschi/pulsar/pull/7
> > >
> > >
> > > > 6. Rebase PRs (or close and re-open) that would be processed next so
> > > that changes are picked up
> > >
> > > It's better to leave this task to the author of the pull in order to
> not
> > > create too much load at the same time
> > >
> > > > 7. Enable workflows
> > > > 8. Start processing PRs with checks to see if things are handled in a
> > > better way.
> > > > 9. When things are stable, enable required checks again in
> .asf.yaml, in
> > > the meantime be careful about merging PRs
> > > > 10. Fix quarantined flaky tests
> > >
> > >
> > > Nicolò Boschi
> > >
> > >
> > > Il giorno gio 8 set 2022 alle ore 09:27 Lari Hotari <
> lhotari@apache.org>
> > > ha scritto:
> > >
> > >> If my assumption of the GitHub usage metrics bug in the GitHub Actions
> > >> build job queue fairness algorithm is correct, what would help is
> running
> > >> the flaky unit test group outside of Pulsar CI workflow. In that
> case, the
> > >> impact of the usage metrics would be limited.
> > >>
> > >> The example of
> > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage shows
> > >> this flaw as explained in the previous email. The total reported
> execution
> > >> time in that report is 1d 1h 40m 21s of usage and the actual usage is
> about
> > >> 1/3 of this.
> > >>
> > >> When we move the most commonly failing job out of Pulsar CI workflow,
> the
> > >> impact of the possible usage metrics bug would be much less. I hope
> GitHub
> > >> support responds to my issue and queries about this bug. It might
> take up
> > >> to 7 days to get a reply and for technical questions more time. In the
> > >> meantime we need a solution for getting over this CI slowness issue.
> > >>
> > >> -Lari
> > >>
> > >>
> > >>
> > >> On 2022/09/08 06:34:42 Lari Hotari wrote:
> > >> > My current assumption of the CI slowness problem is that the usage
> > >> metrics for Apache Pulsar builds on GitHub side is done incorrectly
> and
> > >> that is resulting in apache/pulsar builds getting throttled. This
> > >> assumption might be wrong, but it's the best guess at the moment.
> > >> >
> > >> > The facts that support this assumption is that when re-running
> failed
> > >> jobs in a workflow, the execution times for previously successful
> jobs get
> > >> counted as if they have all run:
> > >> > Here's an example:
> > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > >> > The reported total usage is about 3x than the actual usage.
> > >> >
> > >> > The assumption that I have is that the "fairness algorithm" that
> GitHub
> > >> uses to provide all Apache projects about the same amount of GitHub
> Actions
> > >> resources would take this flawed usage as the basis of it's decisions
> and
> > >> it decides to throttle apache/pulsar builds.
> > >> >
> > >> > The reason why we are getting hit by this now is that there is a
> high
> > >> number of flaky test failures that cause almost every build to fail
> and we
> > >> have been re-running a lot of builds.
> > >> >
> > >> > The other fact to support the theory of flawed usage metrics used in
> > >> the fairness algorithm is that other Apache projects aren't reporting
> > >> issues about GitHub Actions slowness. This is mentioned in Jarek
> Potiuk's
> > >> comments on INFRA-23633 [1]:
> > >> > > Unlike the case 2 years ago, the problem is not affecting all
> > >> projects. In Apache Airflow we do > not see any particular slow-down
> with
> > >> Public Runners at this moment (just checked - >
> > >> > > everything is "as usual").. So I'd say it is something specific to
> > >> Pulsar not to "ASF" as a whole.
> > >> >
> > >> > There are also other comments from Jarek about the GitHub "fairness
> > >> algorithm" (comment [2], other comment [3])
> > >> > > But I believe the current problem is different - it might be
> (looking
> > >> at your jobs) simply a bug
> > >> > > in GA that you hit or indeed your demands are simply too high.
> > >> >
> > >> > I have opened tickets (2 tickets: 2 days ago and yesterday) to
> > >> support.github.com and there hasn't been any response to the ticket.
> It
> > >> might take up to 7 days to get a response. We cannot rely on GitHub
> Support
> > >> resolving this issue.
> > >> >
> > >> > I propose that we go ahead with the previously suggested action plan
> > >> > > One possible way forward:
> > >> > > 1. Cancel all existing builds in_progress or queued
> > >> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> > >> merging PRs.
> > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > >> > > 4. Disable all workflows
> > >> > > 5. Process specific PRs manually to improve the situation.
> > >> > >    - Make GHA workflow improvements such as
> > >> https://github.com/apache/pulsar/pull/17491 and
> > >> https://github.com/apache/pulsar/pull/17490
> > >> > >    - Quarantine all very flaky tests so that everyone doesn't
> waste
> > >> time with those. It should be possible to merge a PR even when a
> > >> quarantined test fails.
> > >> > > 6. Rebase PRs (or close and re-open) that would be processed next
> so
> > >> that changes are picked up
> > >> > > 7. Enable workflows
> > >> > > 8. Start processing PRs with checks to see if things are handled
> in a
> > >> better way.
> > >> > > 9. When things are stable, enable required checks again in
> .asf.yaml,
> > >> in the meantime be careful about merging PRs
> > >> > > 10. Fix quarantined flaky tests
> > >> >
> > >> > To clarify, steps 1-6 would be done optimally in 1 day and we would
> > >> stop processing ordinary PRs during this time. We would only handle
> PRs
> > >> that fix the CI situation during this exceptional period.
> > >> >
> > >> > -Lari
> > >> >
> > >> > Links to Jarek's comments:
> > >> > [1]
> > >>
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749
> > >> > [2]
> > >>
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > >> > [3]
> > >>
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > >> >
> > >> > On 2022/09/07 17:01:43 Lari Hotari wrote:
> > >> > > One possible way forward:
> > >> > > 1. Cancel all existing builds in_progress or queued
> > >> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> > >> merging PRs.
> > >> > > 3. Wait for build to run for .asf.yaml change, merge it
> > >> > > 4. Disable all workflows
> > >> > > 5. Process specific PRs manually to improve the situation.
> > >> > >    - Make GHA workflow improvements such as
> > >> https://github.com/apache/pulsar/pull/17491 and
> > >> https://github.com/apache/pulsar/pull/17490
> > >> > >    - Quarantine all very flaky tests so that everyone doesn't
> waste
> > >> time with those. It should be possible to merge a PR even when a
> > >> quarantined test fails.
> > >> > > 6. Rebase PRs (or close and re-open) that would be processed next
> so
> > >> that changes are picked up
> > >> > > 7. Enable workflows
> > >> > > 8. Start processing PRs with checks to see if things are handled
> in a
> > >> better way.
> > >> > > 9. When things are stable, enable required checks again in
> .asf.yaml,
> > >> in the meantime be careful about merging PRs
> > >> > > 10. Fix quarantined flaky tests
> > >> > >
> > >> > > -Lari
> > >> > >
> > >> > > On 2022/09/07 16:47:09 Lari Hotari wrote:
> > >> > > > The problem with CI is becoming worse. The build queue is 235
> jobs
> > >> now and the queue time is over 7 hours.
> > >> > > >
> > >> > > > We will need to start shedding load in the build queue and get
> some
> > >> fixes in.
> > >> > > > https://issues.apache.org/jira/browse/INFRA-23633 continues to
> > >> contain details about some activities. I have created 2 GitHub Support
> > >> tickets, but usually it takes up to a week to get a response.
> > >> > > >
> > >> > > > I have some assumptions about the issue, but they are just
> > >> assumptions.
> > >> > > > One oddity is that when re-running failed jobs is used in a
> large
> > >> workflow, the execution times for previously successful jobs get
> counted as
> > >> if they have run.
> > >> > > > Here's an example:
> > >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > >> > > > The reported usage is about 3x than the actual usage.
> > >> > > > The assumption that I have is that the "fairness algorithm" that
> > >> GitHub uses to provide all Apache projects about the same amount of
> GitHub
> > >> Actions resources would take this flawed usage as the basis of it's
> > >> decisions.
> > >> > > > The reason why we are getting hit by this now is that there is a
> > >> high number of flaky test failures that cause almost every build to
> fail
> > >> and we are re-running a lot of builds.
> > >> > > >
> > >> > > > Another problem there is that the GitHub Actions search doesn't
> > >> always show all workflow runs that are running. This has happened
> before
> > >> when the GitHub Actions workflow search index was corrupted. GitHub
> Support
> > >> resolved that by rebuilding the search index with some manual admin
> > >> operation behind the scenes.
> > >> > > >
> > >> > > > I'm proposing that we start shedding load from CI by cancelling
> > >> build jobs and selecting which jobs to process so that we get the CI
> issue
> > >> resolved. We might also have to disable required checks so that we
> have
> > >> some way to get changes merged while CI doesn't work properly.
> > >> > > >
> > >> > > > I'm expecting lazy consensus on fixing CI unless someone
> proposes a
> > >> better plan. Let's keep everyone informed in this mailing list thread.
> > >> > > >
> > >> > > > -Lari
> > >> > > >
> > >> > > >
> > >> > > > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > >> > > > > We are going to need to take actions to fix our problems. See
> > >>
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > >> > > > >
> > >> > > > > Jarek has done a large amount of GitHub Action work with
> Apache
> > >> Airflow and his suggestions might be helpful. One of his suggestions
> was
> > >> Apache Yetus. I think he means using the Maven plugins -
> > >> https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > >> > > > >
> > >> > > > >
> > >> > > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lhotari@apache.org
> >
> > >> wrote:
> > >> > > > > >
> > >> > > > > > The Apache Infra ticket is
> > >> https://issues.apache.org/jira/browse/INFRA-23633 .
> > >> > > > > >
> > >> > > > > > -Lari
> > >> > > > > >
> > >> > > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > >> > > > > >> I asked for an update on the Apache org GitHub Actions
> usage
> > >> stats from Gavin McDonald on the-asf slack in this thread:
> > >>
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> > >> .
> > >> > > > > >>
> > >> > > > > >> I hope we get this issue resolved since it delays PR
> > >> processing a lot.
> > >> > > > > >>
> > >> > > > > >> -Lari
> > >> > > > > >>
> > >> > > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > >> > > > > >>> Pulsar CI continues to be congested, and the build queue
> [1]
> > >> is very long at the moment. There are 147 build jobs in the queue and
> 16
> > >> jobs in progress at the moment.
> > >> > > > > >>>
> > >> > > > > >>> I would strongly advice everyone to use "personal CI" to
> > >> mitigate the issue of the long delay of CI feedback. You can simply
> open a
> > >> PR to your own personal fork of apache/pulsar to run the builds in
> your
> > >> "personal CI". There's more details in the previous emails in this
> thread.
> > >> > > > > >>>
> > >> > > > > >>> -Lari
> > >> > > > > >>>
> > >> > > > > >>> [1] - build queue:
> > >> https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > >> > > > > >>>
> > >> > > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > >> > > > > >>>> Pulsar CI continues to be congested, and the build queue
> is
> > >> long.
> > >> > > > > >>>>
> > >> > > > > >>>> I would strongly advice everyone to use "personal CI" to
> > >> mitigate the issue of the long delay of CI feedback. You can simply
> open a
> > >> PR to your own personal fork of apache/pulsar to run the builds in
> your
> > >> "personal CI". There's more details in the previous email in this
> thread.
> > >> > > > > >>>>
> > >> > > > > >>>> Some updates:
> > >> > > > > >>>>
> > >> > > > > >>>> There has been a discussion with Gavin McDonald from ASF
> > >> infra on the-asf slack about getting usage reports from GitHub to
> support
> > >> the investigation. Slack thread is the same one mentioned in the
> previous
> > >> email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279
> .
> > >> Gavin already requested the usage report in GitHub UI, but it produced
> > >> invalid results.
> > >> > > > > >>>>
> > >> > > > > >>>> I made a change to mitigate a source of additional GitHub
> > >> Actions overhead.
> > >> > > > > >>>> In the past, each cherry-picked commit to a maintenance
> > >> branch of Pulsar has triggered a lot of workflow runs.
> > >> > > > > >>>>
> > >> > > > > >>>> The solution for cancelling duplicate builds
> automatically
> > >> is to add this definition to the workflow definition:
> > >> > > > > >>>> concurrency:
> > >> > > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > >> > > > > >>>>  cancel-in-progress: true
> > >> > > > > >>>>
> > >> > > > > >>>> I added this to all maintenance branch GitHub Actions
> > >> workflows:
> > >> > > > > >>>>
> > >> > > > > >>>> branch-2.10 change:
> > >> > > > > >>>>
> > >>
> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > >> > > > > >>>> branch-2.9 change:
> > >> > > > > >>>>
> > >>
> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > >> > > > > >>>> branch-2.8 change:
> > >> > > > > >>>>
> > >>
> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > >> > > > > >>>> branch-2.7:
> > >> > > > > >>>>
> > >>
> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > >> > > > > >>>>
> > >> > > > > >>>> branch-2.11 already contains the necessary config for
> > >> cancelling duplicate builds.
> > >> > > > > >>>>
> > >> > > > > >>>> The benefit of the above change is that when multiple
> > >> commits are cherry-picked to a branch at once, only the build of the
> last
> > >> commit will get run eventually. The builds for the intermediate
> commits
> > >> will get cancelled. Obviously there's a tradeoff here that we don't
> get the
> > >> information if one of the earlier commits breaks the build. It's the
> cost
> > >> that we need to pay. Nevertheless our build is so flaky that it's
> hard to
> > >> determine whether a failed build result is only caused by bad flaky
> test or
> > >> whether it's an actual failure. Because of this we don't lose
> anything by
> > >> cancelling builds. It's more important to save build resources. In the
> > >> maintenance branches for 2.10 and older, the average total build time
> > >> consumed is around 20 hours which is a lot.
> > >> > > > > >>>>
> > >> > > > > >>>> At this time, the overhead of maintenance branch builds
> > >> doesn't seem to be the source of the problems. There must be some
> other
> > >> issue which is possibly related to exceeding a usage quota. Hopefully
> we
> > >> get the CI slowness issue solved asap.
> > >> > > > > >>>>
> > >> > > > > >>>> BR,
> > >> > > > > >>>>
> > >> > > > > >>>> Lari
> > >> > > > > >>>>
> > >> > > > > >>>>
> > >> > > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > >> > > > > >>>>> Hi,
> > >> > > > > >>>>>
> > >> > > > > >>>>> GitHub Actions builds have been piling up in the build
> > >> queue in the last few days.
> > >> > > > > >>>>> I posted on builds@apache.org
> > >> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and
> > >> created INFRA ticket
> https://issues.apache.org/jira/browse/INFRA-23633
> > >> about this issue.
> > >> > > > > >>>>> There's also a thread on the-asf slack,
> > >> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > >> > > > > >>>>>
> > >> > > > > >>>>> It seems that our build queue is finally getting picked
> up,
> > >> but it would be great to see if we hit quota and whether that is the
> cause
> > >> of pauses.
> > >> > > > > >>>>>
> > >> > > > > >>>>> Another issue is that the master branch broke after
> merging
> > >> 2 conflicting PRs.
> > >> > > > > >>>>> The fix is in
> https://github.com/apache/pulsar/pull/17300
> > >> .
> > >> > > > > >>>>>
> > >> > > > > >>>>> Merging PRs will be slow until we have these 2 problems
> > >> solved and existing PRs rebased over the changes. Let's prioritize
> merging
> > >> #17300 before pushing more changes.
> > >> > > > > >>>>>
> > >> > > > > >>>>> I'd like to point out that a good way to get build
> feedback
> > >> before sending a PR, is to run builds on your personal GitHub Actions
> CI.
> > >> The benefit of this is that it doesn't consume the shared quota and
> builds
> > >> usually start instantly.
> > >> > > > > >>>>> There are instructions in the contributors guide about
> > >> this.
> > >> > > > > >>>>>
> > >> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > >> > > > > >>>>> You simply open PRs to your own fork of apache/pulsar to
> > >> run builds on your personal GitHub Actions CI.
> > >> > > > > >>>>>
> > >> > > > > >>>>> BR,
> > >> > > > > >>>>>
> > >> > > > > >>>>> Lari
> > >> > > > > >>>>>
> > >> > > > > >>>>>
> > >> > > > > >>>>>
> > >> > > > > >>>>>
> > >> > > > > >>>>>
> > >> > > > > >>>>>
> > >> > > > > >>>>>
> > >> > > > > >>>>>
> > >> > > > > >>>>
> > >> > > > > >>>
> > >> > > > > >>
> > >> > > > >
> > >> > > > >
> > >> > > >
> > >> > >
> > >> >
> > >>
> > >
> >
>

Re: Pulsar CI congested, master branch build broken

Posted by Lari Hotari <lh...@apache.org>.

Thank you Nicolo.
There's lazy consensus, let's go forward with the action plan.

-Lari

On 2022/09/08 08:16:05 Nicolò Boschi wrote:
> This is the pull for step 2. https://github.com/apache/pulsar/pull/17539
> 
> This is the script I'm going to use to cancel pending workflows.
> https://github.com/nicoloboschi/pulsar-validation-tool/blob/master/pulsar-scripts/pulsar-gha/cancel-workflows.js
> 
> I'm going to run the script in minutes.
> 
> I advertised on Slack about what is happening:
> https://apache-pulsar.slack.com/archives/C5ZSVEN4E/p1662624668695339?thread_ts=1662463042.016709&cid=C5ZSVEN4E
> 
> >we’re going to execute the plan described in the ML. So any queued actions
> will be cancelled. In order to validate your pull it is suggested to run
> the actions in your own Pulsar fork. Please don’t re-run failed jobs or
> push any other commits to avoid triggering new actions
> 
> 
> Nicolò Boschi
> 
> 
> Il giorno gio 8 set 2022 alle ore 09:42 Nicolò Boschi <bo...@gmail.com>
> ha scritto:
> 
> > Thanks Lari for the detailed explanation. This is kind of an emergency
> > situation and I believe your plan is the way to go now.
> >
> > I already prepared a pull for moving the flaky suite out of the Pulsar CI
> > workflow: https://github.com/nicoloboschi/pulsar/pull/8
> > I can take care of the execution of the plan.
> >
> > > 1. Cancel all existing builds in_progress or queued
> >
> > I have a script locally that uses GHA to check and cancel the pending
> > runs. We can extend it to all the queued builds (will share it soon).
> >
> > > 2. Edit .asf.yaml and drop the "required checks" requirement for merging
> > PRs.
> > > 3. Wait for build to run for .asf.yaml change, merge it
> >
> > After the pull is out, we'll need to cancel all other workflows that
> > contributors may inadvertently have triggered.
> >
> > > 4. Disable all workflows
> > > 5. Process specific PRs manually to improve the situation.
> > >    - Make GHA workflow improvements such as
> > https://github.com/apache/pulsar/pull/17491 and
> > https://github.com/apache/pulsar/pull/17490
> > >    - Quarantine all very flaky tests so that everyone doesn't waste time
> > with those. It should be possible to merge a PR even when a quarantined
> > test fails.
> >
> > in this step we will merge this
> > https://github.com/nicoloboschi/pulsar/pull/8
> >
> > I want to add to the list this improvement to reduce runners usage in case
> > of doc or cpp changes.
> > https://github.com/nicoloboschi/pulsar/pull/7
> >
> >
> > > 6. Rebase PRs (or close and re-open) that would be processed next so
> > that changes are picked up
> >
> > It's better to leave this task to the author of the pull in order to not
> > create too much load at the same time
> >
> > > 7. Enable workflows
> > > 8. Start processing PRs with checks to see if things are handled in a
> > better way.
> > > 9. When things are stable, enable required checks again in .asf.yaml, in
> > the meantime be careful about merging PRs
> > > 10. Fix quarantined flaky tests
> >
> >
> > Nicolò Boschi
> >
> >
> > Il giorno gio 8 set 2022 alle ore 09:27 Lari Hotari <lh...@apache.org>
> > ha scritto:
> >
> >> If my assumption of the GitHub usage metrics bug in the GitHub Actions
> >> build job queue fairness algorithm is correct, what would help is running
> >> the flaky unit test group outside of Pulsar CI workflow. In that case, the
> >> impact of the usage metrics would be limited.
> >>
> >> The example of
> >> https://github.com/apache/pulsar/actions/runs/3003787409/usage shows
> >> this flaw as explained in the previous email. The total reported execution
> >> time in that report is 1d 1h 40m 21s of usage and the actual usage is about
> >> 1/3 of this.
> >>
> >> When we move the most commonly failing job out of Pulsar CI workflow, the
> >> impact of the possible usage metrics bug would be much less. I hope GitHub
> >> support responds to my issue and queries about this bug. It might take up
> >> to 7 days to get a reply and for technical questions more time. In the
> >> meantime we need a solution for getting over this CI slowness issue.
> >>
> >> -Lari
> >>
> >>
> >>
> >> On 2022/09/08 06:34:42 Lari Hotari wrote:
> >> > My current assumption of the CI slowness problem is that the usage
> >> metrics for Apache Pulsar builds on GitHub side is done incorrectly and
> >> that is resulting in apache/pulsar builds getting throttled. This
> >> assumption might be wrong, but it's the best guess at the moment.
> >> >
> >> > The facts that support this assumption is that when re-running failed
> >> jobs in a workflow, the execution times for previously successful jobs get
> >> counted as if they have all run:
> >> > Here's an example:
> >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> >> > The reported total usage is about 3x than the actual usage.
> >> >
> >> > The assumption that I have is that the "fairness algorithm" that GitHub
> >> uses to provide all Apache projects about the same amount of GitHub Actions
> >> resources would take this flawed usage as the basis of it's decisions and
> >> it decides to throttle apache/pulsar builds.
> >> >
> >> > The reason why we are getting hit by this now is that there is a high
> >> number of flaky test failures that cause almost every build to fail and we
> >> have been re-running a lot of builds.
> >> >
> >> > The other fact to support the theory of flawed usage metrics used in
> >> the fairness algorithm is that other Apache projects aren't reporting
> >> issues about GitHub Actions slowness. This is mentioned in Jarek Potiuk's
> >> comments on INFRA-23633 [1]:
> >> > > Unlike the case 2 years ago, the problem is not affecting all
> >> projects. In Apache Airflow we do > not see any particular slow-down with
> >> Public Runners at this moment (just checked - >
> >> > > everything is "as usual").. So I'd say it is something specific to
> >> Pulsar not to "ASF" as a whole.
> >> >
> >> > There are also other comments from Jarek about the GitHub "fairness
> >> algorithm" (comment [2], other comment [3])
> >> > > But I believe the current problem is different - it might be (looking
> >> at your jobs) simply a bug
> >> > > in GA that you hit or indeed your demands are simply too high.
> >> >
> >> > I have opened tickets (2 tickets: 2 days ago and yesterday) to
> >> support.github.com and there hasn't been any response to the ticket. It
> >> might take up to 7 days to get a response. We cannot rely on GitHub Support
> >> resolving this issue.
> >> >
> >> > I propose that we go ahead with the previously suggested action plan
> >> > > One possible way forward:
> >> > > 1. Cancel all existing builds in_progress or queued
> >> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> >> merging PRs.
> >> > > 3. Wait for build to run for .asf.yaml change, merge it
> >> > > 4. Disable all workflows
> >> > > 5. Process specific PRs manually to improve the situation.
> >> > >    - Make GHA workflow improvements such as
> >> https://github.com/apache/pulsar/pull/17491 and
> >> https://github.com/apache/pulsar/pull/17490
> >> > >    - Quarantine all very flaky tests so that everyone doesn't waste
> >> time with those. It should be possible to merge a PR even when a
> >> quarantined test fails.
> >> > > 6. Rebase PRs (or close and re-open) that would be processed next so
> >> that changes are picked up
> >> > > 7. Enable workflows
> >> > > 8. Start processing PRs with checks to see if things are handled in a
> >> better way.
> >> > > 9. When things are stable, enable required checks again in .asf.yaml,
> >> in the meantime be careful about merging PRs
> >> > > 10. Fix quarantined flaky tests
> >> >
> >> > To clarify, steps 1-6 would be done optimally in 1 day and we would
> >> stop processing ordinary PRs during this time. We would only handle PRs
> >> that fix the CI situation during this exceptional period.
> >> >
> >> > -Lari
> >> >
> >> > Links to Jarek's comments:
> >> > [1]
> >> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749
> >> > [2]
> >> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> >> > [3]
> >> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> >> >
> >> > On 2022/09/07 17:01:43 Lari Hotari wrote:
> >> > > One possible way forward:
> >> > > 1. Cancel all existing builds in_progress or queued
> >> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> >> merging PRs.
> >> > > 3. Wait for build to run for .asf.yaml change, merge it
> >> > > 4. Disable all workflows
> >> > > 5. Process specific PRs manually to improve the situation.
> >> > >    - Make GHA workflow improvements such as
> >> https://github.com/apache/pulsar/pull/17491 and
> >> https://github.com/apache/pulsar/pull/17490
> >> > >    - Quarantine all very flaky tests so that everyone doesn't waste
> >> time with those. It should be possible to merge a PR even when a
> >> quarantined test fails.
> >> > > 6. Rebase PRs (or close and re-open) that would be processed next so
> >> that changes are picked up
> >> > > 7. Enable workflows
> >> > > 8. Start processing PRs with checks to see if things are handled in a
> >> better way.
> >> > > 9. When things are stable, enable required checks again in .asf.yaml,
> >> in the meantime be careful about merging PRs
> >> > > 10. Fix quarantined flaky tests
> >> > >
> >> > > -Lari
> >> > >
> >> > > On 2022/09/07 16:47:09 Lari Hotari wrote:
> >> > > > The problem with CI is becoming worse. The build queue is 235 jobs
> >> now and the queue time is over 7 hours.
> >> > > >
> >> > > > We will need to start shedding load in the build queue and get some
> >> fixes in.
> >> > > > https://issues.apache.org/jira/browse/INFRA-23633 continues to
> >> contain details about some activities. I have created 2 GitHub Support
> >> tickets, but usually it takes up to a week to get a response.
> >> > > >
> >> > > > I have some assumptions about the issue, but they are just
> >> assumptions.
> >> > > > One oddity is that when re-running failed jobs is used in a large
> >> workflow, the execution times for previously successful jobs get counted as
> >> if they have run.
> >> > > > Here's an example:
> >> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> >> > > > The reported usage is about 3x than the actual usage.
> >> > > > The assumption that I have is that the "fairness algorithm" that
> >> GitHub uses to provide all Apache projects about the same amount of GitHub
> >> Actions resources would take this flawed usage as the basis of it's
> >> decisions.
> >> > > > The reason why we are getting hit by this now is that there is a
> >> high number of flaky test failures that cause almost every build to fail
> >> and we are re-running a lot of builds.
> >> > > >
> >> > > > Another problem there is that the GitHub Actions search doesn't
> >> always show all workflow runs that are running. This has happened before
> >> when the GitHub Actions workflow search index was corrupted. GitHub Support
> >> resolved that by rebuilding the search index with some manual admin
> >> operation behind the scenes.
> >> > > >
> >> > > > I'm proposing that we start shedding load from CI by cancelling
> >> build jobs and selecting which jobs to process so that we get the CI issue
> >> resolved. We might also have to disable required checks so that we have
> >> some way to get changes merged while CI doesn't work properly.
> >> > > >
> >> > > > I'm expecting lazy consensus on fixing CI unless someone proposes a
> >> better plan. Let's keep everyone informed in this mailing list thread.
> >> > > >
> >> > > > -Lari
> >> > > >
> >> > > >
> >> > > > On 2022/09/06 14:41:07 Dave Fisher wrote:
> >> > > > > We are going to need to take actions to fix our problems. See
> >> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> >> > > > >
> >> > > > > Jarek has done a large amount of GitHub Action work with Apache
> >> Airflow and his suggestions might be helpful. One of his suggestions was
> >> Apache Yetus. I think he means using the Maven plugins -
> >> https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> >> > > > >
> >> > > > >
> >> > > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lh...@apache.org>
> >> wrote:
> >> > > > > >
> >> > > > > > The Apache Infra ticket is
> >> https://issues.apache.org/jira/browse/INFRA-23633 .
> >> > > > > >
> >> > > > > > -Lari
> >> > > > > >
> >> > > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> >> > > > > >> I asked for an update on the Apache org GitHub Actions usage
> >> stats from Gavin McDonald on the-asf slack in this thread:
> >> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> >> .
> >> > > > > >>
> >> > > > > >> I hope we get this issue resolved since it delays PR
> >> processing a lot.
> >> > > > > >>
> >> > > > > >> -Lari
> >> > > > > >>
> >> > > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> >> > > > > >>> Pulsar CI continues to be congested, and the build queue [1]
> >> is very long at the moment. There are 147 build jobs in the queue and 16
> >> jobs in progress at the moment.
> >> > > > > >>>
> >> > > > > >>> I would strongly advice everyone to use "personal CI" to
> >> mitigate the issue of the long delay of CI feedback. You can simply open a
> >> PR to your own personal fork of apache/pulsar to run the builds in your
> >> "personal CI". There's more details in the previous emails in this thread.
> >> > > > > >>>
> >> > > > > >>> -Lari
> >> > > > > >>>
> >> > > > > >>> [1] - build queue:
> >> https://github.com/apache/pulsar/actions?query=is%3Aqueued
> >> > > > > >>>
> >> > > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> >> > > > > >>>> Pulsar CI continues to be congested, and the build queue is
> >> long.
> >> > > > > >>>>
> >> > > > > >>>> I would strongly advice everyone to use "personal CI" to
> >> mitigate the issue of the long delay of CI feedback. You can simply open a
> >> PR to your own personal fork of apache/pulsar to run the builds in your
> >> "personal CI". There's more details in the previous email in this thread.
> >> > > > > >>>>
> >> > > > > >>>> Some updates:
> >> > > > > >>>>
> >> > > > > >>>> There has been a discussion with Gavin McDonald from ASF
> >> infra on the-asf slack about getting usage reports from GitHub to support
> >> the investigation. Slack thread is the same one mentioned in the previous
> >> email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> >> Gavin already requested the usage report in GitHub UI, but it produced
> >> invalid results.
> >> > > > > >>>>
> >> > > > > >>>> I made a change to mitigate a source of additional GitHub
> >> Actions overhead.
> >> > > > > >>>> In the past, each cherry-picked commit to a maintenance
> >> branch of Pulsar has triggered a lot of workflow runs.
> >> > > > > >>>>
> >> > > > > >>>> The solution for cancelling duplicate builds automatically
> >> is to add this definition to the workflow definition:
> >> > > > > >>>> concurrency:
> >> > > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> >> > > > > >>>>  cancel-in-progress: true
> >> > > > > >>>>
> >> > > > > >>>> I added this to all maintenance branch GitHub Actions
> >> workflows:
> >> > > > > >>>>
> >> > > > > >>>> branch-2.10 change:
> >> > > > > >>>>
> >> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> >> > > > > >>>> branch-2.9 change:
> >> > > > > >>>>
> >> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> >> > > > > >>>> branch-2.8 change:
> >> > > > > >>>>
> >> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> >> > > > > >>>> branch-2.7:
> >> > > > > >>>>
> >> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> >> > > > > >>>>
> >> > > > > >>>> branch-2.11 already contains the necessary config for
> >> cancelling duplicate builds.
> >> > > > > >>>>
> >> > > > > >>>> The benefit of the above change is that when multiple
> >> commits are cherry-picked to a branch at once, only the build of the last
> >> commit will get run eventually. The builds for the intermediate commits
> >> will get cancelled. Obviously there's a tradeoff here that we don't get the
> >> information if one of the earlier commits breaks the build. It's the cost
> >> that we need to pay. Nevertheless our build is so flaky that it's hard to
> >> determine whether a failed build result is only caused by bad flaky test or
> >> whether it's an actual failure. Because of this we don't lose anything by
> >> cancelling builds. It's more important to save build resources. In the
> >> maintenance branches for 2.10 and older, the average total build time
> >> consumed is around 20 hours which is a lot.
> >> > > > > >>>>
> >> > > > > >>>> At this time, the overhead of maintenance branch builds
> >> doesn't seem to be the source of the problems. There must be some other
> >> issue which is possibly related to exceeding a usage quota. Hopefully we
> >> get the CI slowness issue solved asap.
> >> > > > > >>>>
> >> > > > > >>>> BR,
> >> > > > > >>>>
> >> > > > > >>>> Lari
> >> > > > > >>>>
> >> > > > > >>>>
> >> > > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> >> > > > > >>>>> Hi,
> >> > > > > >>>>>
> >> > > > > >>>>> GitHub Actions builds have been piling up in the build
> >> queue in the last few days.
> >> > > > > >>>>> I posted on builds@apache.org
> >> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and
> >> created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633
> >> about this issue.
> >> > > > > >>>>> There's also a thread on the-asf slack,
> >> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> >> > > > > >>>>>
> >> > > > > >>>>> It seems that our build queue is finally getting picked up,
> >> but it would be great to see if we hit quota and whether that is the cause
> >> of pauses.
> >> > > > > >>>>>
> >> > > > > >>>>> Another issue is that the master branch broke after merging
> >> 2 conflicting PRs.
> >> > > > > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300
> >> .
> >> > > > > >>>>>
> >> > > > > >>>>> Merging PRs will be slow until we have these 2 problems
> >> solved and existing PRs rebased over the changes. Let's prioritize merging
> >> #17300 before pushing more changes.
> >> > > > > >>>>>
> >> > > > > >>>>> I'd like to point out that a good way to get build feedback
> >> before sending a PR, is to run builds on your personal GitHub Actions CI.
> >> The benefit of this is that it doesn't consume the shared quota and builds
> >> usually start instantly.
> >> > > > > >>>>> There are instructions in the contributors guide about
> >> this.
> >> > > > > >>>>>
> >> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> >> > > > > >>>>> You simply open PRs to your own fork of apache/pulsar to
> >> run builds on your personal GitHub Actions CI.
> >> > > > > >>>>>
> >> > > > > >>>>> BR,
> >> > > > > >>>>>
> >> > > > > >>>>> Lari
> >> > > > > >>>>>
> >> > > > > >>>>>
> >> > > > > >>>>>
> >> > > > > >>>>>
> >> > > > > >>>>>
> >> > > > > >>>>>
> >> > > > > >>>>>
> >> > > > > >>>>>
> >> > > > > >>>>
> >> > > > > >>>
> >> > > > > >>
> >> > > > >
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Re: Pulsar CI congested, master branch build broken

Posted by Nicolò Boschi <bo...@gmail.com>.

This is the pull for step 2. https://github.com/apache/pulsar/pull/17539

This is the script I'm going to use to cancel pending workflows.
https://github.com/nicoloboschi/pulsar-validation-tool/blob/master/pulsar-scripts/pulsar-gha/cancel-workflows.js

I'm going to run the script in minutes.

I advertised on Slack about what is happening:
https://apache-pulsar.slack.com/archives/C5ZSVEN4E/p1662624668695339?thread_ts=1662463042.016709&cid=C5ZSVEN4E

>we’re going to execute the plan described in the ML. So any queued actions
will be cancelled. In order to validate your pull it is suggested to run
the actions in your own Pulsar fork. Please don’t re-run failed jobs or
push any other commits to avoid triggering new actions


Nicolò Boschi


Il giorno gio 8 set 2022 alle ore 09:42 Nicolò Boschi <bo...@gmail.com>
ha scritto:

> Thanks Lari for the detailed explanation. This is kind of an emergency
> situation and I believe your plan is the way to go now.
>
> I already prepared a pull for moving the flaky suite out of the Pulsar CI
> workflow: https://github.com/nicoloboschi/pulsar/pull/8
> I can take care of the execution of the plan.
>
> > 1. Cancel all existing builds in_progress or queued
>
> I have a script locally that uses GHA to check and cancel the pending
> runs. We can extend it to all the queued builds (will share it soon).
>
> > 2. Edit .asf.yaml and drop the "required checks" requirement for merging
> PRs.
> > 3. Wait for build to run for .asf.yaml change, merge it
>
> After the pull is out, we'll need to cancel all other workflows that
> contributors may inadvertently have triggered.
>
> > 4. Disable all workflows
> > 5. Process specific PRs manually to improve the situation.
> >    - Make GHA workflow improvements such as
> https://github.com/apache/pulsar/pull/17491 and
> https://github.com/apache/pulsar/pull/17490
> >    - Quarantine all very flaky tests so that everyone doesn't waste time
> with those. It should be possible to merge a PR even when a quarantined
> test fails.
>
> in this step we will merge this
> https://github.com/nicoloboschi/pulsar/pull/8
>
> I want to add to the list this improvement to reduce runners usage in case
> of doc or cpp changes.
> https://github.com/nicoloboschi/pulsar/pull/7
>
>
> > 6. Rebase PRs (or close and re-open) that would be processed next so
> that changes are picked up
>
> It's better to leave this task to the author of the pull in order to not
> create too much load at the same time
>
> > 7. Enable workflows
> > 8. Start processing PRs with checks to see if things are handled in a
> better way.
> > 9. When things are stable, enable required checks again in .asf.yaml, in
> the meantime be careful about merging PRs
> > 10. Fix quarantined flaky tests
>
>
> Nicolò Boschi
>
>
> Il giorno gio 8 set 2022 alle ore 09:27 Lari Hotari <lh...@apache.org>
> ha scritto:
>
>> If my assumption of the GitHub usage metrics bug in the GitHub Actions
>> build job queue fairness algorithm is correct, what would help is running
>> the flaky unit test group outside of Pulsar CI workflow. In that case, the
>> impact of the usage metrics would be limited.
>>
>> The example of
>> https://github.com/apache/pulsar/actions/runs/3003787409/usage shows
>> this flaw as explained in the previous email. The total reported execution
>> time in that report is 1d 1h 40m 21s of usage and the actual usage is about
>> 1/3 of this.
>>
>> When we move the most commonly failing job out of Pulsar CI workflow, the
>> impact of the possible usage metrics bug would be much less. I hope GitHub
>> support responds to my issue and queries about this bug. It might take up
>> to 7 days to get a reply and for technical questions more time. In the
>> meantime we need a solution for getting over this CI slowness issue.
>>
>> -Lari
>>
>>
>>
>> On 2022/09/08 06:34:42 Lari Hotari wrote:
>> > My current assumption of the CI slowness problem is that the usage
>> metrics for Apache Pulsar builds on GitHub side is done incorrectly and
>> that is resulting in apache/pulsar builds getting throttled. This
>> assumption might be wrong, but it's the best guess at the moment.
>> >
>> > The facts that support this assumption is that when re-running failed
>> jobs in a workflow, the execution times for previously successful jobs get
>> counted as if they have all run:
>> > Here's an example:
>> https://github.com/apache/pulsar/actions/runs/3003787409/usage
>> > The reported total usage is about 3x than the actual usage.
>> >
>> > The assumption that I have is that the "fairness algorithm" that GitHub
>> uses to provide all Apache projects about the same amount of GitHub Actions
>> resources would take this flawed usage as the basis of it's decisions and
>> it decides to throttle apache/pulsar builds.
>> >
>> > The reason why we are getting hit by this now is that there is a high
>> number of flaky test failures that cause almost every build to fail and we
>> have been re-running a lot of builds.
>> >
>> > The other fact to support the theory of flawed usage metrics used in
>> the fairness algorithm is that other Apache projects aren't reporting
>> issues about GitHub Actions slowness. This is mentioned in Jarek Potiuk's
>> comments on INFRA-23633 [1]:
>> > > Unlike the case 2 years ago, the problem is not affecting all
>> projects. In Apache Airflow we do > not see any particular slow-down with
>> Public Runners at this moment (just checked - >
>> > > everything is "as usual").. So I'd say it is something specific to
>> Pulsar not to "ASF" as a whole.
>> >
>> > There are also other comments from Jarek about the GitHub "fairness
>> algorithm" (comment [2], other comment [3])
>> > > But I believe the current problem is different - it might be (looking
>> at your jobs) simply a bug
>> > > in GA that you hit or indeed your demands are simply too high.
>> >
>> > I have opened tickets (2 tickets: 2 days ago and yesterday) to
>> support.github.com and there hasn't been any response to the ticket. It
>> might take up to 7 days to get a response. We cannot rely on GitHub Support
>> resolving this issue.
>> >
>> > I propose that we go ahead with the previously suggested action plan
>> > > One possible way forward:
>> > > 1. Cancel all existing builds in_progress or queued
>> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
>> merging PRs.
>> > > 3. Wait for build to run for .asf.yaml change, merge it
>> > > 4. Disable all workflows
>> > > 5. Process specific PRs manually to improve the situation.
>> > >    - Make GHA workflow improvements such as
>> https://github.com/apache/pulsar/pull/17491 and
>> https://github.com/apache/pulsar/pull/17490
>> > >    - Quarantine all very flaky tests so that everyone doesn't waste
>> time with those. It should be possible to merge a PR even when a
>> quarantined test fails.
>> > > 6. Rebase PRs (or close and re-open) that would be processed next so
>> that changes are picked up
>> > > 7. Enable workflows
>> > > 8. Start processing PRs with checks to see if things are handled in a
>> better way.
>> > > 9. When things are stable, enable required checks again in .asf.yaml,
>> in the meantime be careful about merging PRs
>> > > 10. Fix quarantined flaky tests
>> >
>> > To clarify, steps 1-6 would be done optimally in 1 day and we would
>> stop processing ordinary PRs during this time. We would only handle PRs
>> that fix the CI situation during this exceptional period.
>> >
>> > -Lari
>> >
>> > Links to Jarek's comments:
>> > [1]
>> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749
>> > [2]
>> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
>> > [3]
>> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
>> >
>> > On 2022/09/07 17:01:43 Lari Hotari wrote:
>> > > One possible way forward:
>> > > 1. Cancel all existing builds in_progress or queued
>> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
>> merging PRs.
>> > > 3. Wait for build to run for .asf.yaml change, merge it
>> > > 4. Disable all workflows
>> > > 5. Process specific PRs manually to improve the situation.
>> > >    - Make GHA workflow improvements such as
>> https://github.com/apache/pulsar/pull/17491 and
>> https://github.com/apache/pulsar/pull/17490
>> > >    - Quarantine all very flaky tests so that everyone doesn't waste
>> time with those. It should be possible to merge a PR even when a
>> quarantined test fails.
>> > > 6. Rebase PRs (or close and re-open) that would be processed next so
>> that changes are picked up
>> > > 7. Enable workflows
>> > > 8. Start processing PRs with checks to see if things are handled in a
>> better way.
>> > > 9. When things are stable, enable required checks again in .asf.yaml,
>> in the meantime be careful about merging PRs
>> > > 10. Fix quarantined flaky tests
>> > >
>> > > -Lari
>> > >
>> > > On 2022/09/07 16:47:09 Lari Hotari wrote:
>> > > > The problem with CI is becoming worse. The build queue is 235 jobs
>> now and the queue time is over 7 hours.
>> > > >
>> > > > We will need to start shedding load in the build queue and get some
>> fixes in.
>> > > > https://issues.apache.org/jira/browse/INFRA-23633 continues to
>> contain details about some activities. I have created 2 GitHub Support
>> tickets, but usually it takes up to a week to get a response.
>> > > >
>> > > > I have some assumptions about the issue, but they are just
>> assumptions.
>> > > > One oddity is that when re-running failed jobs is used in a large
>> workflow, the execution times for previously successful jobs get counted as
>> if they have run.
>> > > > Here's an example:
>> https://github.com/apache/pulsar/actions/runs/3003787409/usage
>> > > > The reported usage is about 3x than the actual usage.
>> > > > The assumption that I have is that the "fairness algorithm" that
>> GitHub uses to provide all Apache projects about the same amount of GitHub
>> Actions resources would take this flawed usage as the basis of it's
>> decisions.
>> > > > The reason why we are getting hit by this now is that there is a
>> high number of flaky test failures that cause almost every build to fail
>> and we are re-running a lot of builds.
>> > > >
>> > > > Another problem there is that the GitHub Actions search doesn't
>> always show all workflow runs that are running. This has happened before
>> when the GitHub Actions workflow search index was corrupted. GitHub Support
>> resolved that by rebuilding the search index with some manual admin
>> operation behind the scenes.
>> > > >
>> > > > I'm proposing that we start shedding load from CI by cancelling
>> build jobs and selecting which jobs to process so that we get the CI issue
>> resolved. We might also have to disable required checks so that we have
>> some way to get changes merged while CI doesn't work properly.
>> > > >
>> > > > I'm expecting lazy consensus on fixing CI unless someone proposes a
>> better plan. Let's keep everyone informed in this mailing list thread.
>> > > >
>> > > > -Lari
>> > > >
>> > > >
>> > > > On 2022/09/06 14:41:07 Dave Fisher wrote:
>> > > > > We are going to need to take actions to fix our problems. See
>> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
>> > > > >
>> > > > > Jarek has done a large amount of GitHub Action work with Apache
>> Airflow and his suggestions might be helpful. One of his suggestions was
>> Apache Yetus. I think he means using the Maven plugins -
>> https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
>> > > > >
>> > > > >
>> > > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lh...@apache.org>
>> wrote:
>> > > > > >
>> > > > > > The Apache Infra ticket is
>> https://issues.apache.org/jira/browse/INFRA-23633 .
>> > > > > >
>> > > > > > -Lari
>> > > > > >
>> > > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
>> > > > > >> I asked for an update on the Apache org GitHub Actions usage
>> stats from Gavin McDonald on the-asf slack in this thread:
>> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
>> .
>> > > > > >>
>> > > > > >> I hope we get this issue resolved since it delays PR
>> processing a lot.
>> > > > > >>
>> > > > > >> -Lari
>> > > > > >>
>> > > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
>> > > > > >>> Pulsar CI continues to be congested, and the build queue [1]
>> is very long at the moment. There are 147 build jobs in the queue and 16
>> jobs in progress at the moment.
>> > > > > >>>
>> > > > > >>> I would strongly advice everyone to use "personal CI" to
>> mitigate the issue of the long delay of CI feedback. You can simply open a
>> PR to your own personal fork of apache/pulsar to run the builds in your
>> "personal CI". There's more details in the previous emails in this thread.
>> > > > > >>>
>> > > > > >>> -Lari
>> > > > > >>>
>> > > > > >>> [1] - build queue:
>> https://github.com/apache/pulsar/actions?query=is%3Aqueued
>> > > > > >>>
>> > > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
>> > > > > >>>> Pulsar CI continues to be congested, and the build queue is
>> long.
>> > > > > >>>>
>> > > > > >>>> I would strongly advice everyone to use "personal CI" to
>> mitigate the issue of the long delay of CI feedback. You can simply open a
>> PR to your own personal fork of apache/pulsar to run the builds in your
>> "personal CI". There's more details in the previous email in this thread.
>> > > > > >>>>
>> > > > > >>>> Some updates:
>> > > > > >>>>
>> > > > > >>>> There has been a discussion with Gavin McDonald from ASF
>> infra on the-asf slack about getting usage reports from GitHub to support
>> the investigation. Slack thread is the same one mentioned in the previous
>> email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
>> Gavin already requested the usage report in GitHub UI, but it produced
>> invalid results.
>> > > > > >>>>
>> > > > > >>>> I made a change to mitigate a source of additional GitHub
>> Actions overhead.
>> > > > > >>>> In the past, each cherry-picked commit to a maintenance
>> branch of Pulsar has triggered a lot of workflow runs.
>> > > > > >>>>
>> > > > > >>>> The solution for cancelling duplicate builds automatically
>> is to add this definition to the workflow definition:
>> > > > > >>>> concurrency:
>> > > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
>> > > > > >>>>  cancel-in-progress: true
>> > > > > >>>>
>> > > > > >>>> I added this to all maintenance branch GitHub Actions
>> workflows:
>> > > > > >>>>
>> > > > > >>>> branch-2.10 change:
>> > > > > >>>>
>> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
>> > > > > >>>> branch-2.9 change:
>> > > > > >>>>
>> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
>> > > > > >>>> branch-2.8 change:
>> > > > > >>>>
>> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
>> > > > > >>>> branch-2.7:
>> > > > > >>>>
>> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
>> > > > > >>>>
>> > > > > >>>> branch-2.11 already contains the necessary config for
>> cancelling duplicate builds.
>> > > > > >>>>
>> > > > > >>>> The benefit of the above change is that when multiple
>> commits are cherry-picked to a branch at once, only the build of the last
>> commit will get run eventually. The builds for the intermediate commits
>> will get cancelled. Obviously there's a tradeoff here that we don't get the
>> information if one of the earlier commits breaks the build. It's the cost
>> that we need to pay. Nevertheless our build is so flaky that it's hard to
>> determine whether a failed build result is only caused by bad flaky test or
>> whether it's an actual failure. Because of this we don't lose anything by
>> cancelling builds. It's more important to save build resources. In the
>> maintenance branches for 2.10 and older, the average total build time
>> consumed is around 20 hours which is a lot.
>> > > > > >>>>
>> > > > > >>>> At this time, the overhead of maintenance branch builds
>> doesn't seem to be the source of the problems. There must be some other
>> issue which is possibly related to exceeding a usage quota. Hopefully we
>> get the CI slowness issue solved asap.
>> > > > > >>>>
>> > > > > >>>> BR,
>> > > > > >>>>
>> > > > > >>>> Lari
>> > > > > >>>>
>> > > > > >>>>
>> > > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
>> > > > > >>>>> Hi,
>> > > > > >>>>>
>> > > > > >>>>> GitHub Actions builds have been piling up in the build
>> queue in the last few days.
>> > > > > >>>>> I posted on builds@apache.org
>> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and
>> created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633
>> about this issue.
>> > > > > >>>>> There's also a thread on the-asf slack,
>> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
>> > > > > >>>>>
>> > > > > >>>>> It seems that our build queue is finally getting picked up,
>> but it would be great to see if we hit quota and whether that is the cause
>> of pauses.
>> > > > > >>>>>
>> > > > > >>>>> Another issue is that the master branch broke after merging
>> 2 conflicting PRs.
>> > > > > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300
>> .
>> > > > > >>>>>
>> > > > > >>>>> Merging PRs will be slow until we have these 2 problems
>> solved and existing PRs rebased over the changes. Let's prioritize merging
>> #17300 before pushing more changes.
>> > > > > >>>>>
>> > > > > >>>>> I'd like to point out that a good way to get build feedback
>> before sending a PR, is to run builds on your personal GitHub Actions CI.
>> The benefit of this is that it doesn't consume the shared quota and builds
>> usually start instantly.
>> > > > > >>>>> There are instructions in the contributors guide about
>> this.
>> > > > > >>>>>
>> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
>> > > > > >>>>> You simply open PRs to your own fork of apache/pulsar to
>> run builds on your personal GitHub Actions CI.
>> > > > > >>>>>
>> > > > > >>>>> BR,
>> > > > > >>>>>
>> > > > > >>>>> Lari
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>>
>> > > > > >>>>
>> > > > > >>>
>> > > > > >>
>> > > > >
>> > > > >
>> > > >
>> > >
>> >
>>
>

Re: Pulsar CI congested, master branch build broken

Posted by Nicolò Boschi <bo...@gmail.com>.

Thanks Lari for the detailed explanation. This is kind of an emergency
situation and I believe your plan is the way to go now.

I already prepared a pull for moving the flaky suite out of the Pulsar CI
workflow: https://github.com/nicoloboschi/pulsar/pull/8
I can take care of the execution of the plan.

> 1. Cancel all existing builds in_progress or queued

I have a script locally that uses GHA to check and cancel the pending runs.
We can extend it to all the queued builds (will share it soon).

> 2. Edit .asf.yaml and drop the "required checks" requirement for merging
PRs.
> 3. Wait for build to run for .asf.yaml change, merge it

After the pull is out, we'll need to cancel all other workflows that
contributors may inadvertently have triggered.

> 4. Disable all workflows
> 5. Process specific PRs manually to improve the situation.
>    - Make GHA workflow improvements such as
https://github.com/apache/pulsar/pull/17491 and
https://github.com/apache/pulsar/pull/17490
>    - Quarantine all very flaky tests so that everyone doesn't waste time
with those. It should be possible to merge a PR even when a quarantined
test fails.

in this step we will merge this
https://github.com/nicoloboschi/pulsar/pull/8

I want to add to the list this improvement to reduce runners usage in case
of doc or cpp changes.
https://github.com/nicoloboschi/pulsar/pull/7


> 6. Rebase PRs (or close and re-open) that would be processed next so that
changes are picked up

It's better to leave this task to the author of the pull in order to not
create too much load at the same time

> 7. Enable workflows
> 8. Start processing PRs with checks to see if things are handled in a
better way.
> 9. When things are stable, enable required checks again in .asf.yaml, in
the meantime be careful about merging PRs
> 10. Fix quarantined flaky tests


Nicolò Boschi


Il giorno gio 8 set 2022 alle ore 09:27 Lari Hotari <lh...@apache.org> ha
scritto:

> If my assumption of the GitHub usage metrics bug in the GitHub Actions
> build job queue fairness algorithm is correct, what would help is running
> the flaky unit test group outside of Pulsar CI workflow. In that case, the
> impact of the usage metrics would be limited.
>
> The example of
> https://github.com/apache/pulsar/actions/runs/3003787409/usage shows this
> flaw as explained in the previous email. The total reported execution time
> in that report is 1d 1h 40m 21s of usage and the actual usage is about 1/3
> of this.
>
> When we move the most commonly failing job out of Pulsar CI workflow, the
> impact of the possible usage metrics bug would be much less. I hope GitHub
> support responds to my issue and queries about this bug. It might take up
> to 7 days to get a reply and for technical questions more time. In the
> meantime we need a solution for getting over this CI slowness issue.
>
> -Lari
>
>
>
> On 2022/09/08 06:34:42 Lari Hotari wrote:
> > My current assumption of the CI slowness problem is that the usage
> metrics for Apache Pulsar builds on GitHub side is done incorrectly and
> that is resulting in apache/pulsar builds getting throttled. This
> assumption might be wrong, but it's the best guess at the moment.
> >
> > The facts that support this assumption is that when re-running failed
> jobs in a workflow, the execution times for previously successful jobs get
> counted as if they have all run:
> > Here's an example:
> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > The reported total usage is about 3x than the actual usage.
> >
> > The assumption that I have is that the "fairness algorithm" that GitHub
> uses to provide all Apache projects about the same amount of GitHub Actions
> resources would take this flawed usage as the basis of it's decisions and
> it decides to throttle apache/pulsar builds.
> >
> > The reason why we are getting hit by this now is that there is a high
> number of flaky test failures that cause almost every build to fail and we
> have been re-running a lot of builds.
> >
> > The other fact to support the theory of flawed usage metrics used in the
> fairness algorithm is that other Apache projects aren't reporting issues
> about GitHub Actions slowness. This is mentioned in Jarek Potiuk's comments
> on INFRA-23633 [1]:
> > > Unlike the case 2 years ago, the problem is not affecting all
> projects. In Apache Airflow we do > not see any particular slow-down with
> Public Runners at this moment (just checked - >
> > > everything is "as usual").. So I'd say it is something specific to
> Pulsar not to "ASF" as a whole.
> >
> > There are also other comments from Jarek about the GitHub "fairness
> algorithm" (comment [2], other comment [3])
> > > But I believe the current problem is different - it might be (looking
> at your jobs) simply a bug
> > > in GA that you hit or indeed your demands are simply too high.
> >
> > I have opened tickets (2 tickets: 2 days ago and yesterday) to
> support.github.com and there hasn't been any response to the ticket. It
> might take up to 7 days to get a response. We cannot rely on GitHub Support
> resolving this issue.
> >
> > I propose that we go ahead with the previously suggested action plan
> > > One possible way forward:
> > > 1. Cancel all existing builds in_progress or queued
> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> merging PRs.
> > > 3. Wait for build to run for .asf.yaml change, merge it
> > > 4. Disable all workflows
> > > 5. Process specific PRs manually to improve the situation.
> > >    - Make GHA workflow improvements such as
> https://github.com/apache/pulsar/pull/17491 and
> https://github.com/apache/pulsar/pull/17490
> > >    - Quarantine all very flaky tests so that everyone doesn't waste
> time with those. It should be possible to merge a PR even when a
> quarantined test fails.
> > > 6. Rebase PRs (or close and re-open) that would be processed next so
> that changes are picked up
> > > 7. Enable workflows
> > > 8. Start processing PRs with checks to see if things are handled in a
> better way.
> > > 9. When things are stable, enable required checks again in .asf.yaml,
> in the meantime be careful about merging PRs
> > > 10. Fix quarantined flaky tests
> >
> > To clarify, steps 1-6 would be done optimally in 1 day and we would stop
> processing ordinary PRs during this time. We would only handle PRs that fix
> the CI situation during this exceptional period.
> >
> > -Lari
> >
> > Links to Jarek's comments:
> > [1]
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749
> > [2]
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> > [3]
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> >
> > On 2022/09/07 17:01:43 Lari Hotari wrote:
> > > One possible way forward:
> > > 1. Cancel all existing builds in_progress or queued
> > > 2. Edit .asf.yaml and drop the "required checks" requirement for
> merging PRs.
> > > 3. Wait for build to run for .asf.yaml change, merge it
> > > 4. Disable all workflows
> > > 5. Process specific PRs manually to improve the situation.
> > >    - Make GHA workflow improvements such as
> https://github.com/apache/pulsar/pull/17491 and
> https://github.com/apache/pulsar/pull/17490
> > >    - Quarantine all very flaky tests so that everyone doesn't waste
> time with those. It should be possible to merge a PR even when a
> quarantined test fails.
> > > 6. Rebase PRs (or close and re-open) that would be processed next so
> that changes are picked up
> > > 7. Enable workflows
> > > 8. Start processing PRs with checks to see if things are handled in a
> better way.
> > > 9. When things are stable, enable required checks again in .asf.yaml,
> in the meantime be careful about merging PRs
> > > 10. Fix quarantined flaky tests
> > >
> > > -Lari
> > >
> > > On 2022/09/07 16:47:09 Lari Hotari wrote:
> > > > The problem with CI is becoming worse. The build queue is 235 jobs
> now and the queue time is over 7 hours.
> > > >
> > > > We will need to start shedding load in the build queue and get some
> fixes in.
> > > > https://issues.apache.org/jira/browse/INFRA-23633 continues to
> contain details about some activities. I have created 2 GitHub Support
> tickets, but usually it takes up to a week to get a response.
> > > >
> > > > I have some assumptions about the issue, but they are just
> assumptions.
> > > > One oddity is that when re-running failed jobs is used in a large
> workflow, the execution times for previously successful jobs get counted as
> if they have run.
> > > > Here's an example:
> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > > The reported usage is about 3x than the actual usage.
> > > > The assumption that I have is that the "fairness algorithm" that
> GitHub uses to provide all Apache projects about the same amount of GitHub
> Actions resources would take this flawed usage as the basis of it's
> decisions.
> > > > The reason why we are getting hit by this now is that there is a
> high number of flaky test failures that cause almost every build to fail
> and we are re-running a lot of builds.
> > > >
> > > > Another problem there is that the GitHub Actions search doesn't
> always show all workflow runs that are running. This has happened before
> when the GitHub Actions workflow search index was corrupted. GitHub Support
> resolved that by rebuilding the search index with some manual admin
> operation behind the scenes.
> > > >
> > > > I'm proposing that we start shedding load from CI by cancelling
> build jobs and selecting which jobs to process so that we get the CI issue
> resolved. We might also have to disable required checks so that we have
> some way to get changes merged while CI doesn't work properly.
> > > >
> > > > I'm expecting lazy consensus on fixing CI unless someone proposes a
> better plan. Let's keep everyone informed in this mailing list thread.
> > > >
> > > > -Lari
> > > >
> > > >
> > > > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > > > > We are going to need to take actions to fix our problems. See
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > > > >
> > > > > Jarek has done a large amount of GitHub Action work with Apache
> Airflow and his suggestions might be helpful. One of his suggestions was
> Apache Yetus. I think he means using the Maven plugins -
> https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > > > >
> > > > >
> > > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lh...@apache.org>
> wrote:
> > > > > >
> > > > > > The Apache Infra ticket is
> https://issues.apache.org/jira/browse/INFRA-23633 .
> > > > > >
> > > > > > -Lari
> > > > > >
> > > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > > > > >> I asked for an update on the Apache org GitHub Actions usage
> stats from Gavin McDonald on the-asf slack in this thread:
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> .
> > > > > >>
> > > > > >> I hope we get this issue resolved since it delays PR processing
> a lot.
> > > > > >>
> > > > > >> -Lari
> > > > > >>
> > > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > > > > >>> Pulsar CI continues to be congested, and the build queue [1]
> is very long at the moment. There are 147 build jobs in the queue and 16
> jobs in progress at the moment.
> > > > > >>>
> > > > > >>> I would strongly advice everyone to use "personal CI" to
> mitigate the issue of the long delay of CI feedback. You can simply open a
> PR to your own personal fork of apache/pulsar to run the builds in your
> "personal CI". There's more details in the previous emails in this thread.
> > > > > >>>
> > > > > >>> -Lari
> > > > > >>>
> > > > > >>> [1] - build queue:
> https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > > > > >>>
> > > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > > > > >>>> Pulsar CI continues to be congested, and the build queue is
> long.
> > > > > >>>>
> > > > > >>>> I would strongly advice everyone to use "personal CI" to
> mitigate the issue of the long delay of CI feedback. You can simply open a
> PR to your own personal fork of apache/pulsar to run the builds in your
> "personal CI". There's more details in the previous email in this thread.
> > > > > >>>>
> > > > > >>>> Some updates:
> > > > > >>>>
> > > > > >>>> There has been a discussion with Gavin McDonald from ASF
> infra on the-asf slack about getting usage reports from GitHub to support
> the investigation. Slack thread is the same one mentioned in the previous
> email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> Gavin already requested the usage report in GitHub UI, but it produced
> invalid results.
> > > > > >>>>
> > > > > >>>> I made a change to mitigate a source of additional GitHub
> Actions overhead.
> > > > > >>>> In the past, each cherry-picked commit to a maintenance
> branch of Pulsar has triggered a lot of workflow runs.
> > > > > >>>>
> > > > > >>>> The solution for cancelling duplicate builds automatically is
> to add this definition to the workflow definition:
> > > > > >>>> concurrency:
> > > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > > > > >>>>  cancel-in-progress: true
> > > > > >>>>
> > > > > >>>> I added this to all maintenance branch GitHub Actions
> workflows:
> > > > > >>>>
> > > > > >>>> branch-2.10 change:
> > > > > >>>>
> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > > > > >>>> branch-2.9 change:
> > > > > >>>>
> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > > > > >>>> branch-2.8 change:
> > > > > >>>>
> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > > > > >>>> branch-2.7:
> > > > > >>>>
> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > > > > >>>>
> > > > > >>>> branch-2.11 already contains the necessary config for
> cancelling duplicate builds.
> > > > > >>>>
> > > > > >>>> The benefit of the above change is that when multiple commits
> are cherry-picked to a branch at once, only the build of the last commit
> will get run eventually. The builds for the intermediate commits will get
> cancelled. Obviously there's a tradeoff here that we don't get the
> information if one of the earlier commits breaks the build. It's the cost
> that we need to pay. Nevertheless our build is so flaky that it's hard to
> determine whether a failed build result is only caused by bad flaky test or
> whether it's an actual failure. Because of this we don't lose anything by
> cancelling builds. It's more important to save build resources. In the
> maintenance branches for 2.10 and older, the average total build time
> consumed is around 20 hours which is a lot.
> > > > > >>>>
> > > > > >>>> At this time, the overhead of maintenance branch builds
> doesn't seem to be the source of the problems. There must be some other
> issue which is possibly related to exceeding a usage quota. Hopefully we
> get the CI slowness issue solved asap.
> > > > > >>>>
> > > > > >>>> BR,
> > > > > >>>>
> > > > > >>>> Lari
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > > > >>>>> Hi,
> > > > > >>>>>
> > > > > >>>>> GitHub Actions builds have been piling up in the build queue
> in the last few days.
> > > > > >>>>> I posted on builds@apache.org
> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and
> created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633
> about this issue.
> > > > > >>>>> There's also a thread on the-asf slack,
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > > > > >>>>>
> > > > > >>>>> It seems that our build queue is finally getting picked up,
> but it would be great to see if we hit quota and whether that is the cause
> of pauses.
> > > > > >>>>>
> > > > > >>>>> Another issue is that the master branch broke after merging
> 2 conflicting PRs.
> > > > > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 .
> > > > > >>>>>
> > > > > >>>>> Merging PRs will be slow until we have these 2 problems
> solved and existing PRs rebased over the changes. Let's prioritize merging
> #17300 before pushing more changes.
> > > > > >>>>>
> > > > > >>>>> I'd like to point out that a good way to get build feedback
> before sending a PR, is to run builds on your personal GitHub Actions CI.
> The benefit of this is that it doesn't consume the shared quota and builds
> usually start instantly.
> > > > > >>>>> There are instructions in the contributors guide about this.
> > > > > >>>>>
> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > > > >>>>> You simply open PRs to your own fork of apache/pulsar to run
> builds on your personal GitHub Actions CI.
> > > > > >>>>>
> > > > > >>>>> BR,
> > > > > >>>>>
> > > > > >>>>> Lari
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Pulsar CI congested, master branch build broken

Posted by Lari Hotari <lh...@apache.org>.

If my assumption of the GitHub usage metrics bug in the GitHub Actions build job queue fairness algorithm is correct, what would help is running the flaky unit test group outside of Pulsar CI workflow. In that case, the impact of the usage metrics would be limited.

The example of https://github.com/apache/pulsar/actions/runs/3003787409/usage shows this flaw as explained in the previous email. The total reported execution time in that report is 1d 1h 40m 21s of usage and the actual usage is about 1/3 of this. 

When we move the most commonly failing job out of Pulsar CI workflow, the impact of the possible usage metrics bug would be much less. I hope GitHub support responds to my issue and queries about this bug. It might take up to 7 days to get a reply and for technical questions more time. In the meantime we need a solution for getting over this CI slowness issue.

-Lari



On 2022/09/08 06:34:42 Lari Hotari wrote:
> My current assumption of the CI slowness problem is that the usage metrics for Apache Pulsar builds on GitHub side is done incorrectly and that is resulting in apache/pulsar builds getting throttled. This assumption might be wrong, but it's the best guess at the moment.
> 
> The facts that support this assumption is that when re-running failed jobs in a workflow, the execution times for previously successful jobs get counted as if they have all run:
> Here's an example: https://github.com/apache/pulsar/actions/runs/3003787409/usage
> The reported total usage is about 3x than the actual usage.
> 
> The assumption that I have is that the "fairness algorithm" that GitHub uses to provide all Apache projects about the same amount of GitHub Actions resources would take this flawed usage as the basis of it's decisions and it decides to throttle apache/pulsar builds.
> 
> The reason why we are getting hit by this now is that there is a high number of flaky test failures that cause almost every build to fail and we have been re-running a lot of builds.
> 
> The other fact to support the theory of flawed usage metrics used in the fairness algorithm is that other Apache projects aren't reporting issues about GitHub Actions slowness. This is mentioned in Jarek Potiuk's comments on INFRA-23633 [1]:
> > Unlike the case 2 years ago, the problem is not affecting all projects. In Apache Airflow we do > not see any particular slow-down with Public Runners at this moment (just checked - > 
> > everything is "as usual").. So I'd say it is something specific to Pulsar not to "ASF" as a whole.
> 
> There are also other comments from Jarek about the GitHub "fairness algorithm" (comment [2], other comment [3])
> > But I believe the current problem is different - it might be (looking at your jobs) simply a bug 
> > in GA that you hit or indeed your demands are simply too high. 
> 
> I have opened tickets (2 tickets: 2 days ago and yesterday) to support.github.com and there hasn't been any response to the ticket. It might take up to 7 days to get a response. We cannot rely on GitHub Support resolving this issue.
> 
> I propose that we go ahead with the previously suggested action plan
> > One possible way forward:
> > 1. Cancel all existing builds in_progress or queued
> > 2. Edit .asf.yaml and drop the "required checks" requirement for merging PRs.
> > 3. Wait for build to run for .asf.yaml change, merge it
> > 4. Disable all workflows
> > 5. Process specific PRs manually to improve the situation.
> >    - Make GHA workflow improvements such as https://github.com/apache/pulsar/pull/17491 and https://github.com/apache/pulsar/pull/17490
> >    - Quarantine all very flaky tests so that everyone doesn't waste time with those. It should be possible to merge a PR even when a quarantined test fails.
> > 6. Rebase PRs (or close and re-open) that would be processed next so that changes are picked up
> > 7. Enable workflows
> > 8. Start processing PRs with checks to see if things are handled in a better way.
> > 9. When things are stable, enable required checks again in .asf.yaml, in the meantime be careful about merging PRs
> > 10. Fix quarantined flaky tests
> 
> To clarify, steps 1-6 would be done optimally in 1 day and we would stop processing ordinary PRs during this time. We would only handle PRs that fix the CI situation during this exceptional period.
> 
> -Lari
> 
> Links to Jarek's comments:
> [1] https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749
> [2] https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> [3] https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
> 
> On 2022/09/07 17:01:43 Lari Hotari wrote:
> > One possible way forward:
> > 1. Cancel all existing builds in_progress or queued
> > 2. Edit .asf.yaml and drop the "required checks" requirement for merging PRs.
> > 3. Wait for build to run for .asf.yaml change, merge it
> > 4. Disable all workflows
> > 5. Process specific PRs manually to improve the situation.
> >    - Make GHA workflow improvements such as https://github.com/apache/pulsar/pull/17491 and https://github.com/apache/pulsar/pull/17490
> >    - Quarantine all very flaky tests so that everyone doesn't waste time with those. It should be possible to merge a PR even when a quarantined test fails.
> > 6. Rebase PRs (or close and re-open) that would be processed next so that changes are picked up
> > 7. Enable workflows
> > 8. Start processing PRs with checks to see if things are handled in a better way.
> > 9. When things are stable, enable required checks again in .asf.yaml, in the meantime be careful about merging PRs
> > 10. Fix quarantined flaky tests
> > 
> > -Lari
> > 
> > On 2022/09/07 16:47:09 Lari Hotari wrote:
> > > The problem with CI is becoming worse. The build queue is 235 jobs now and the queue time is over 7 hours.
> > > 
> > > We will need to start shedding load in the build queue and get some fixes in.
> > > https://issues.apache.org/jira/browse/INFRA-23633 continues to contain details about some activities. I have created 2 GitHub Support tickets, but usually it takes up to a week to get a response.
> > > 
> > > I have some assumptions about the issue, but they are just assumptions.
> > > One oddity is that when re-running failed jobs is used in a large workflow, the execution times for previously successful jobs get counted as if they have run. 
> > > Here's an example: https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > The reported usage is about 3x than the actual usage.
> > > The assumption that I have is that the "fairness algorithm" that GitHub uses to provide all Apache projects about the same amount of GitHub Actions resources would take this flawed usage as the basis of it's decisions.
> > > The reason why we are getting hit by this now is that there is a high number of flaky test failures that cause almost every build to fail and we are re-running a lot of builds.
> > > 
> > > Another problem there is that the GitHub Actions search doesn't always show all workflow runs that are running. This has happened before when the GitHub Actions workflow search index was corrupted. GitHub Support resolved that by rebuilding the search index with some manual admin operation behind the scenes.
> > > 
> > > I'm proposing that we start shedding load from CI by cancelling build jobs and selecting which jobs to process so that we get the CI issue resolved. We might also have to disable required checks so that we have some way to get changes merged while CI doesn't work properly.
> > > 
> > > I'm expecting lazy consensus on fixing CI unless someone proposes a better plan. Let's keep everyone informed in this mailing list thread.
> > > 
> > > -Lari
> > > 
> > > 
> > > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > > > We are going to need to take actions to fix our problems. See https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > > > 
> > > > Jarek has done a large amount of GitHub Action work with Apache Airflow and his suggestions might be helpful. One of his suggestions was Apache Yetus. I think he means using the Maven plugins - https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > > > 
> > > > 
> > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lh...@apache.org> wrote:
> > > > > 
> > > > > The Apache Infra ticket is https://issues.apache.org/jira/browse/INFRA-23633 . 
> > > > > 
> > > > > -Lari
> > > > > 
> > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > > > >> I asked for an update on the Apache org GitHub Actions usage stats from Gavin McDonald on the-asf slack in this thread: https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8 .
> > > > >> 
> > > > >> I hope we get this issue resolved since it delays PR processing a lot.
> > > > >> 
> > > > >> -Lari
> > > > >> 
> > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > > > >>> Pulsar CI continues to be congested, and the build queue [1] is very long at the moment. There are 147 build jobs in the queue and 16 jobs in progress at the moment.
> > > > >>> 
> > > > >>> I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous emails in this thread.
> > > > >>> 
> > > > >>> -Lari
> > > > >>> 
> > > > >>> [1] - build queue: https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > > > >>> 
> > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > > > >>>> Pulsar CI continues to be congested, and the build queue is long.
> > > > >>>> 
> > > > >>>> I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous email in this thread.
> > > > >>>> 
> > > > >>>> Some updates:
> > > > >>>> 
> > > > >>>> There has been a discussion with Gavin McDonald from ASF infra on the-asf slack about getting usage reports from GitHub to support the investigation. Slack thread is the same one mentioned in the previous email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . Gavin already requested the usage report in GitHub UI, but it produced invalid results.
> > > > >>>> 
> > > > >>>> I made a change to mitigate a source of additional GitHub Actions overhead. 
> > > > >>>> In the past, each cherry-picked commit to a maintenance branch of Pulsar has triggered a lot of workflow runs. 
> > > > >>>> 
> > > > >>>> The solution for cancelling duplicate builds automatically is to add this definition to the workflow definition:
> > > > >>>> concurrency:
> > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > > > >>>>  cancel-in-progress: true
> > > > >>>> 
> > > > >>>> I added this to all maintenance branch GitHub Actions workflows:
> > > > >>>> 
> > > > >>>> branch-2.10 change:
> > > > >>>> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > > > >>>> branch-2.9 change:
> > > > >>>> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > > > >>>> branch-2.8 change:
> > > > >>>> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > > > >>>> branch-2.7:
> > > > >>>> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > > > >>>> 
> > > > >>>> branch-2.11 already contains the necessary config for cancelling duplicate builds.
> > > > >>>> 
> > > > >>>> The benefit of the above change is that when multiple commits are cherry-picked to a branch at once, only the build of the last commit will get run eventually. The builds for the intermediate commits will get cancelled. Obviously there's a tradeoff here that we don't get the information if one of the earlier commits breaks the build. It's the cost that we need to pay. Nevertheless our build is so flaky that it's hard to determine whether a failed build result is only caused by bad flaky test or whether it's an actual failure. Because of this we don't lose anything by cancelling builds. It's more important to save build resources. In the maintenance branches for 2.10 and older, the average total build time consumed is around 20 hours which is a lot.
> > > > >>>> 
> > > > >>>> At this time, the overhead of maintenance branch builds doesn't seem to be the source of the problems. There must be some other issue which is possibly related to exceeding a usage quota. Hopefully we get the CI slowness issue solved asap.
> > > > >>>> 
> > > > >>>> BR,
> > > > >>>> 
> > > > >>>> Lari
> > > > >>>> 
> > > > >>>> 
> > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > > >>>>> Hi,
> > > > >>>>> 
> > > > >>>>> GitHub Actions builds have been piling up in the build queue in the last few days.
> > > > >>>>> I posted on builds@apache.org https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633 about this issue.
> > > > >>>>> There's also a thread on the-asf slack, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . 
> > > > >>>>> 
> > > > >>>>> It seems that our build queue is finally getting picked up, but it would be great to see if we hit quota and whether that is the cause of pauses. 
> > > > >>>>> 
> > > > >>>>> Another issue is that the master branch broke after merging 2 conflicting PRs. 
> > > > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 . 
> > > > >>>>> 
> > > > >>>>> Merging PRs will be slow until we have these 2 problems solved and existing PRs rebased over the changes. Let's prioritize merging #17300 before pushing more changes.
> > > > >>>>> 
> > > > >>>>> I'd like to point out that a good way to get build feedback before sending a PR, is to run builds on your personal GitHub Actions CI. The benefit of this is that it doesn't consume the shared quota and builds usually start instantly.
> > > > >>>>> There are instructions in the contributors guide about this. 
> > > > >>>>> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > > >>>>> You simply open PRs to your own fork of apache/pulsar to run builds on your personal GitHub Actions CI.
> > > > >>>>> 
> > > > >>>>> BR,
> > > > >>>>> 
> > > > >>>>> Lari
> > > > >>>>> 
> > > > >>>>> 
> > > > >>>>> 
> > > > >>>>> 
> > > > >>>>> 
> > > > >>>>> 
> > > > >>>>> 
> > > > >>>>> 
> > > > >>>> 
> > > > >>> 
> > > > >> 
> > > > 
> > > > 
> > > 
> > 
>

Re: Pulsar CI congested, master branch build broken

Posted by Lari Hotari <lh...@apache.org>.

My current assumption of the CI slowness problem is that the usage metrics for Apache Pulsar builds on GitHub side is done incorrectly and that is resulting in apache/pulsar builds getting throttled. This assumption might be wrong, but it's the best guess at the moment.

The facts that support this assumption is that when re-running failed jobs in a workflow, the execution times for previously successful jobs get counted as if they have all run:
Here's an example: https://github.com/apache/pulsar/actions/runs/3003787409/usage
The reported total usage is about 3x than the actual usage.

The assumption that I have is that the "fairness algorithm" that GitHub uses to provide all Apache projects about the same amount of GitHub Actions resources would take this flawed usage as the basis of it's decisions and it decides to throttle apache/pulsar builds.

The reason why we are getting hit by this now is that there is a high number of flaky test failures that cause almost every build to fail and we have been re-running a lot of builds.

The other fact to support the theory of flawed usage metrics used in the fairness algorithm is that other Apache projects aren't reporting issues about GitHub Actions slowness. This is mentioned in Jarek Potiuk's comments on INFRA-23633 [1]:
> Unlike the case 2 years ago, the problem is not affecting all projects. In Apache Airflow we do > not see any particular slow-down with Public Runners at this moment (just checked - > 
> everything is "as usual").. So I'd say it is something specific to Pulsar not to "ASF" as a whole.

There are also other comments from Jarek about the GitHub "fairness algorithm" (comment [2], other comment [3])
> But I believe the current problem is different - it might be (looking at your jobs) simply a bug 
> in GA that you hit or indeed your demands are simply too high. 

I have opened tickets (2 tickets: 2 days ago and yesterday) to support.github.com and there hasn't been any response to the ticket. It might take up to 7 days to get a response. We cannot rely on GitHub Support resolving this issue.

I propose that we go ahead with the previously suggested action plan
> One possible way forward:
> 1. Cancel all existing builds in_progress or queued
> 2. Edit .asf.yaml and drop the "required checks" requirement for merging PRs.
> 3. Wait for build to run for .asf.yaml change, merge it
> 4. Disable all workflows
> 5. Process specific PRs manually to improve the situation.
>    - Make GHA workflow improvements such as https://github.com/apache/pulsar/pull/17491 and https://github.com/apache/pulsar/pull/17490
>    - Quarantine all very flaky tests so that everyone doesn't waste time with those. It should be possible to merge a PR even when a quarantined test fails.
> 6. Rebase PRs (or close and re-open) that would be processed next so that changes are picked up
> 7. Enable workflows
> 8. Start processing PRs with checks to see if things are handled in a better way.
> 9. When things are stable, enable required checks again in .asf.yaml, in the meantime be careful about merging PRs
> 10. Fix quarantined flaky tests

To clarify, steps 1-6 would be done optimally in 1 day and we would stop processing ordinary PRs during this time. We would only handle PRs that fix the CI situation during this exceptional period.

-Lari

Links to Jarek's comments:
[1] https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600749
[2] https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893
[3] https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600893&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17600893

On 2022/09/07 17:01:43 Lari Hotari wrote:
> One possible way forward:
> 1. Cancel all existing builds in_progress or queued
> 2. Edit .asf.yaml and drop the "required checks" requirement for merging PRs.
> 3. Wait for build to run for .asf.yaml change, merge it
> 4. Disable all workflows
> 5. Process specific PRs manually to improve the situation.
>    - Make GHA workflow improvements such as https://github.com/apache/pulsar/pull/17491 and https://github.com/apache/pulsar/pull/17490
>    - Quarantine all very flaky tests so that everyone doesn't waste time with those. It should be possible to merge a PR even when a quarantined test fails.
> 6. Rebase PRs (or close and re-open) that would be processed next so that changes are picked up
> 7. Enable workflows
> 8. Start processing PRs with checks to see if things are handled in a better way.
> 9. When things are stable, enable required checks again in .asf.yaml, in the meantime be careful about merging PRs
> 10. Fix quarantined flaky tests
> 
> -Lari
> 
> On 2022/09/07 16:47:09 Lari Hotari wrote:
> > The problem with CI is becoming worse. The build queue is 235 jobs now and the queue time is over 7 hours.
> > 
> > We will need to start shedding load in the build queue and get some fixes in.
> > https://issues.apache.org/jira/browse/INFRA-23633 continues to contain details about some activities. I have created 2 GitHub Support tickets, but usually it takes up to a week to get a response.
> > 
> > I have some assumptions about the issue, but they are just assumptions.
> > One oddity is that when re-running failed jobs is used in a large workflow, the execution times for previously successful jobs get counted as if they have run. 
> > Here's an example: https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > The reported usage is about 3x than the actual usage.
> > The assumption that I have is that the "fairness algorithm" that GitHub uses to provide all Apache projects about the same amount of GitHub Actions resources would take this flawed usage as the basis of it's decisions.
> > The reason why we are getting hit by this now is that there is a high number of flaky test failures that cause almost every build to fail and we are re-running a lot of builds.
> > 
> > Another problem there is that the GitHub Actions search doesn't always show all workflow runs that are running. This has happened before when the GitHub Actions workflow search index was corrupted. GitHub Support resolved that by rebuilding the search index with some manual admin operation behind the scenes.
> > 
> > I'm proposing that we start shedding load from CI by cancelling build jobs and selecting which jobs to process so that we get the CI issue resolved. We might also have to disable required checks so that we have some way to get changes merged while CI doesn't work properly.
> > 
> > I'm expecting lazy consensus on fixing CI unless someone proposes a better plan. Let's keep everyone informed in this mailing list thread.
> > 
> > -Lari
> > 
> > 
> > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > > We are going to need to take actions to fix our problems. See https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > > 
> > > Jarek has done a large amount of GitHub Action work with Apache Airflow and his suggestions might be helpful. One of his suggestions was Apache Yetus. I think he means using the Maven plugins - https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > > 
> > > 
> > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lh...@apache.org> wrote:
> > > > 
> > > > The Apache Infra ticket is https://issues.apache.org/jira/browse/INFRA-23633 . 
> > > > 
> > > > -Lari
> > > > 
> > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > > >> I asked for an update on the Apache org GitHub Actions usage stats from Gavin McDonald on the-asf slack in this thread: https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8 .
> > > >> 
> > > >> I hope we get this issue resolved since it delays PR processing a lot.
> > > >> 
> > > >> -Lari
> > > >> 
> > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > > >>> Pulsar CI continues to be congested, and the build queue [1] is very long at the moment. There are 147 build jobs in the queue and 16 jobs in progress at the moment.
> > > >>> 
> > > >>> I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous emails in this thread.
> > > >>> 
> > > >>> -Lari
> > > >>> 
> > > >>> [1] - build queue: https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > > >>> 
> > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > > >>>> Pulsar CI continues to be congested, and the build queue is long.
> > > >>>> 
> > > >>>> I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous email in this thread.
> > > >>>> 
> > > >>>> Some updates:
> > > >>>> 
> > > >>>> There has been a discussion with Gavin McDonald from ASF infra on the-asf slack about getting usage reports from GitHub to support the investigation. Slack thread is the same one mentioned in the previous email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . Gavin already requested the usage report in GitHub UI, but it produced invalid results.
> > > >>>> 
> > > >>>> I made a change to mitigate a source of additional GitHub Actions overhead. 
> > > >>>> In the past, each cherry-picked commit to a maintenance branch of Pulsar has triggered a lot of workflow runs. 
> > > >>>> 
> > > >>>> The solution for cancelling duplicate builds automatically is to add this definition to the workflow definition:
> > > >>>> concurrency:
> > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > > >>>>  cancel-in-progress: true
> > > >>>> 
> > > >>>> I added this to all maintenance branch GitHub Actions workflows:
> > > >>>> 
> > > >>>> branch-2.10 change:
> > > >>>> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > > >>>> branch-2.9 change:
> > > >>>> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > > >>>> branch-2.8 change:
> > > >>>> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > > >>>> branch-2.7:
> > > >>>> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > > >>>> 
> > > >>>> branch-2.11 already contains the necessary config for cancelling duplicate builds.
> > > >>>> 
> > > >>>> The benefit of the above change is that when multiple commits are cherry-picked to a branch at once, only the build of the last commit will get run eventually. The builds for the intermediate commits will get cancelled. Obviously there's a tradeoff here that we don't get the information if one of the earlier commits breaks the build. It's the cost that we need to pay. Nevertheless our build is so flaky that it's hard to determine whether a failed build result is only caused by bad flaky test or whether it's an actual failure. Because of this we don't lose anything by cancelling builds. It's more important to save build resources. In the maintenance branches for 2.10 and older, the average total build time consumed is around 20 hours which is a lot.
> > > >>>> 
> > > >>>> At this time, the overhead of maintenance branch builds doesn't seem to be the source of the problems. There must be some other issue which is possibly related to exceeding a usage quota. Hopefully we get the CI slowness issue solved asap.
> > > >>>> 
> > > >>>> BR,
> > > >>>> 
> > > >>>> Lari
> > > >>>> 
> > > >>>> 
> > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > >>>>> Hi,
> > > >>>>> 
> > > >>>>> GitHub Actions builds have been piling up in the build queue in the last few days.
> > > >>>>> I posted on builds@apache.org https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633 about this issue.
> > > >>>>> There's also a thread on the-asf slack, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . 
> > > >>>>> 
> > > >>>>> It seems that our build queue is finally getting picked up, but it would be great to see if we hit quota and whether that is the cause of pauses. 
> > > >>>>> 
> > > >>>>> Another issue is that the master branch broke after merging 2 conflicting PRs. 
> > > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 . 
> > > >>>>> 
> > > >>>>> Merging PRs will be slow until we have these 2 problems solved and existing PRs rebased over the changes. Let's prioritize merging #17300 before pushing more changes.
> > > >>>>> 
> > > >>>>> I'd like to point out that a good way to get build feedback before sending a PR, is to run builds on your personal GitHub Actions CI. The benefit of this is that it doesn't consume the shared quota and builds usually start instantly.
> > > >>>>> There are instructions in the contributors guide about this. 
> > > >>>>> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > >>>>> You simply open PRs to your own fork of apache/pulsar to run builds on your personal GitHub Actions CI.
> > > >>>>> 
> > > >>>>> BR,
> > > >>>>> 
> > > >>>>> Lari
> > > >>>>> 
> > > >>>>> 
> > > >>>>> 
> > > >>>>> 
> > > >>>>> 
> > > >>>>> 
> > > >>>>> 
> > > >>>>> 
> > > >>>> 
> > > >>> 
> > > >> 
> > > 
> > > 
> > 
>

Re: Pulsar CI congested, master branch build broken

Posted by Lari Hotari <lh...@apache.org>.

One possible way forward:
1. Cancel all existing builds in_progress or queued
2. Edit .asf.yaml and drop the "required checks" requirement for merging PRs.
3. Wait for build to run for .asf.yaml change, merge it
4. Disable all workflows
5. Process specific PRs manually to improve the situation.
   - Make GHA workflow improvements such as https://github.com/apache/pulsar/pull/17491 and https://github.com/apache/pulsar/pull/17490
   - Quarantine all very flaky tests so that everyone doesn't waste time with those. It should be possible to merge a PR even when a quarantined test fails.
6. Rebase PRs (or close and re-open) that would be processed next so that changes are picked up
7. Enable workflows
8. Start processing PRs with checks to see if things are handled in a better way.
9. When things are stable, enable required checks again in .asf.yaml, in the meantime be careful about merging PRs
10. Fix quarantined flaky tests

-Lari

On 2022/09/07 16:47:09 Lari Hotari wrote:
> The problem with CI is becoming worse. The build queue is 235 jobs now and the queue time is over 7 hours.
> 
> We will need to start shedding load in the build queue and get some fixes in.
> https://issues.apache.org/jira/browse/INFRA-23633 continues to contain details about some activities. I have created 2 GitHub Support tickets, but usually it takes up to a week to get a response.
> 
> I have some assumptions about the issue, but they are just assumptions.
> One oddity is that when re-running failed jobs is used in a large workflow, the execution times for previously successful jobs get counted as if they have run. 
> Here's an example: https://github.com/apache/pulsar/actions/runs/3003787409/usage
> The reported usage is about 3x than the actual usage.
> The assumption that I have is that the "fairness algorithm" that GitHub uses to provide all Apache projects about the same amount of GitHub Actions resources would take this flawed usage as the basis of it's decisions.
> The reason why we are getting hit by this now is that there is a high number of flaky test failures that cause almost every build to fail and we are re-running a lot of builds.
> 
> Another problem there is that the GitHub Actions search doesn't always show all workflow runs that are running. This has happened before when the GitHub Actions workflow search index was corrupted. GitHub Support resolved that by rebuilding the search index with some manual admin operation behind the scenes.
> 
> I'm proposing that we start shedding load from CI by cancelling build jobs and selecting which jobs to process so that we get the CI issue resolved. We might also have to disable required checks so that we have some way to get changes merged while CI doesn't work properly.
> 
> I'm expecting lazy consensus on fixing CI unless someone proposes a better plan. Let's keep everyone informed in this mailing list thread.
> 
> -Lari
> 
> 
> On 2022/09/06 14:41:07 Dave Fisher wrote:
> > We are going to need to take actions to fix our problems. See https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > 
> > Jarek has done a large amount of GitHub Action work with Apache Airflow and his suggestions might be helpful. One of his suggestions was Apache Yetus. I think he means using the Maven plugins - https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > 
> > 
> > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lh...@apache.org> wrote:
> > > 
> > > The Apache Infra ticket is https://issues.apache.org/jira/browse/INFRA-23633 . 
> > > 
> > > -Lari
> > > 
> > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > >> I asked for an update on the Apache org GitHub Actions usage stats from Gavin McDonald on the-asf slack in this thread: https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8 .
> > >> 
> > >> I hope we get this issue resolved since it delays PR processing a lot.
> > >> 
> > >> -Lari
> > >> 
> > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > >>> Pulsar CI continues to be congested, and the build queue [1] is very long at the moment. There are 147 build jobs in the queue and 16 jobs in progress at the moment.
> > >>> 
> > >>> I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous emails in this thread.
> > >>> 
> > >>> -Lari
> > >>> 
> > >>> [1] - build queue: https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > >>> 
> > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > >>>> Pulsar CI continues to be congested, and the build queue is long.
> > >>>> 
> > >>>> I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous email in this thread.
> > >>>> 
> > >>>> Some updates:
> > >>>> 
> > >>>> There has been a discussion with Gavin McDonald from ASF infra on the-asf slack about getting usage reports from GitHub to support the investigation. Slack thread is the same one mentioned in the previous email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . Gavin already requested the usage report in GitHub UI, but it produced invalid results.
> > >>>> 
> > >>>> I made a change to mitigate a source of additional GitHub Actions overhead. 
> > >>>> In the past, each cherry-picked commit to a maintenance branch of Pulsar has triggered a lot of workflow runs. 
> > >>>> 
> > >>>> The solution for cancelling duplicate builds automatically is to add this definition to the workflow definition:
> > >>>> concurrency:
> > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > >>>>  cancel-in-progress: true
> > >>>> 
> > >>>> I added this to all maintenance branch GitHub Actions workflows:
> > >>>> 
> > >>>> branch-2.10 change:
> > >>>> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > >>>> branch-2.9 change:
> > >>>> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > >>>> branch-2.8 change:
> > >>>> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > >>>> branch-2.7:
> > >>>> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > >>>> 
> > >>>> branch-2.11 already contains the necessary config for cancelling duplicate builds.
> > >>>> 
> > >>>> The benefit of the above change is that when multiple commits are cherry-picked to a branch at once, only the build of the last commit will get run eventually. The builds for the intermediate commits will get cancelled. Obviously there's a tradeoff here that we don't get the information if one of the earlier commits breaks the build. It's the cost that we need to pay. Nevertheless our build is so flaky that it's hard to determine whether a failed build result is only caused by bad flaky test or whether it's an actual failure. Because of this we don't lose anything by cancelling builds. It's more important to save build resources. In the maintenance branches for 2.10 and older, the average total build time consumed is around 20 hours which is a lot.
> > >>>> 
> > >>>> At this time, the overhead of maintenance branch builds doesn't seem to be the source of the problems. There must be some other issue which is possibly related to exceeding a usage quota. Hopefully we get the CI slowness issue solved asap.
> > >>>> 
> > >>>> BR,
> > >>>> 
> > >>>> Lari
> > >>>> 
> > >>>> 
> > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > >>>>> Hi,
> > >>>>> 
> > >>>>> GitHub Actions builds have been piling up in the build queue in the last few days.
> > >>>>> I posted on builds@apache.org https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633 about this issue.
> > >>>>> There's also a thread on the-asf slack, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . 
> > >>>>> 
> > >>>>> It seems that our build queue is finally getting picked up, but it would be great to see if we hit quota and whether that is the cause of pauses. 
> > >>>>> 
> > >>>>> Another issue is that the master branch broke after merging 2 conflicting PRs. 
> > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 . 
> > >>>>> 
> > >>>>> Merging PRs will be slow until we have these 2 problems solved and existing PRs rebased over the changes. Let's prioritize merging #17300 before pushing more changes.
> > >>>>> 
> > >>>>> I'd like to point out that a good way to get build feedback before sending a PR, is to run builds on your personal GitHub Actions CI. The benefit of this is that it doesn't consume the shared quota and builds usually start instantly.
> > >>>>> There are instructions in the contributors guide about this. 
> > >>>>> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > >>>>> You simply open PRs to your own fork of apache/pulsar to run builds on your personal GitHub Actions CI.
> > >>>>> 
> > >>>>> BR,
> > >>>>> 
> > >>>>> Lari
> > >>>>> 
> > >>>>> 
> > >>>>> 
> > >>>>> 
> > >>>>> 
> > >>>>> 
> > >>>>> 
> > >>>>> 
> > >>>> 
> > >>> 
> > >> 
> > 
> > 
>

Re: Pulsar CI congested, master branch build broken

Posted by Lari Hotari <lh...@apache.org>.

On 2022/09/07 17:27:45 tison wrote:
> Today Pulsar repo runs almost up to one worflow run at the same time. It's
> a new situation I didn't notice before.
> 
> > drop the "required checks"
> 
> This can be dangerous to the repo status. I think the essential problem we
> meet here is about prioritizing specific PR, instead of releasing the guard
> to all PRs.

Life is dangerous. :) 
I suggested dropping required checks for the time until specific PRs to address the CI issues have been merged. If we don't drop the "required checks" temporarily, there will be additional delays. It's pointless to have required checks enabled while fixing the issue. 
In emergency situations, it's also possible to push changes directly to the master branch by-passing PRs. I'm not saying that we should do that. It's just that we have also that option available.

> 
> > Fix quarantined flaky tests
> 
> But yes, to overcome the workload brought by unnecessary reruns, it can be
> a solution that we treat all tests as "unstable" and un-require them while
> adding back in a timing manner.

Dropping "required checks" can solve that. Moving all tests to flaky or quarantine is our current solution doesn't make sense. It's better to disable required checks and disable specific workflows temporarily. I'm just clarifying that most of the action plan I proposed should be implemented in 1-2 days. We have already solutions to move flaky tests to groups that don't block merging and we should do that for the most flaky tests. There's no meaning in rerunning known flaky tests. 

-Lari


> 
> Best,
> tison.
> 
> 
> Lari Hotari <lh...@apache.org> 于2022年9月8日周四 01:15写道：
> 
> > On 2022/09/07 16:59:33 tison wrote:
> > > > selecting which jobs to process
> > >
> > > Do you have a patch to implement this? IIRC it requires interacting with
> > > outside service or at least we may add an ok-to-test label.
> >
> > Very good idea, I didn't think that far ahead. It seems that Apache Spark
> > has some solution
> > since in the the-asf slack channel discussion it was mentioned that Spark
> > requires
> > contributors to run validation in their own personal GHA quota.
> > I don't know how that is achieved.
> >
> > As you proposed, one possible solution would be to have a workflow that
> > only proceeds
> > when there's a "ok-to-test" label on the PR.
> >
> > For the immediate selection of jobs to process, I have ways to clear the
> > GHA build queue
> > for apache/pulsar using the GHA API.
> > I clarified the proposed action plan in a follow up message to the thread
> > [1].
> > We would primarily process PRs which help to get out of the situation
> > where we are.
> >
> > It would also be helpful if there would be a way to escalate
> > ASF INFRA support and GitHub Support. However, the ticket
> > https://issues.apache.org/jira/browse/INFRA-23633 discussion doesn't give
> > much hope
> > of this possibility.
> >
> >
> > -Lari
> >
> > [1] https://lists.apache.org/thread/rpq12tzm4hx8kozpkphd2jyqr8cj0yj5
> >
> > On 2022/09/07 16:59:33 tison wrote:
> > > > selecting which jobs to process
> > >
> > > Do you have a patch to implement this? IIRC it requires interacting with
> > > outside service or at least we may add an ok-to-test label.
> > >
> > > Besides, it increases committers/PMC members' workload - be aware of it,
> > or
> > > most of contributions will stall.
> > >
> > > Best,
> > > tison.
> > >
> > >
> > > Lari Hotari <lh...@apache.org> 于2022年9月8日周四 00:47写道：
> > >
> > > > The problem with CI is becoming worse. The build queue is 235 jobs now
> > and
> > > > the queue time is over 7 hours.
> > > >
> > > > We will need to start shedding load in the build queue and get some
> > fixes
> > > > in.
> > > > https://issues.apache.org/jira/browse/INFRA-23633 continues to contain
> > > > details about some activities. I have created 2 GitHub Support
> > tickets, but
> > > > usually it takes up to a week to get a response.
> > > >
> > > > I have some assumptions about the issue, but they are just assumptions.
> > > > One oddity is that when re-running failed jobs is used in a large
> > > > workflow, the execution times for previously successful jobs get
> > counted as
> > > > if they have run.
> > > > Here's an example:
> > > > https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > > The reported usage is about 3x than the actual usage.
> > > > The assumption that I have is that the "fairness algorithm" that GitHub
> > > > uses to provide all Apache projects about the same amount of GitHub
> > Actions
> > > > resources would take this flawed usage as the basis of it's decisions.
> > > > The reason why we are getting hit by this now is that there is a high
> > > > number of flaky test failures that cause almost every build to fail
> > and we
> > > > are re-running a lot of builds.
> > > >
> > > > Another problem there is that the GitHub Actions search doesn't always
> > > > show all workflow runs that are running. This has happened before when
> > the
> > > > GitHub Actions workflow search index was corrupted. GitHub Support
> > resolved
> > > > that by rebuilding the search index with some manual admin operation
> > behind
> > > > the scenes.
> > > >
> > > > I'm proposing that we start shedding load from CI by cancelling build
> > jobs
> > > > and selecting which jobs to process so that we get the CI issue
> > resolved.
> > > > We might also have to disable required checks so that we have some way
> > to
> > > > get changes merged while CI doesn't work properly.
> > > >
> > > > I'm expecting lazy consensus on fixing CI unless someone proposes a
> > better
> > > > plan. Let's keep everyone informed in this mailing list thread.
> > > >
> > > > -Lari
> > > >
> > > >
> > > > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > > > > We are going to need to take actions to fix our problems. See
> > > >
> > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > > > >
> > > > > Jarek has done a large amount of GitHub Action work with Apache
> > Airflow
> > > > and his suggestions might be helpful. One of his suggestions was Apache
> > > > Yetus. I think he means using the Maven plugins -
> > > > https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > > > >
> > > > >
> > > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lh...@apache.org>
> > wrote:
> > > > > >
> > > > > > The Apache Infra ticket is
> > > > https://issues.apache.org/jira/browse/INFRA-23633 .
> > > > > >
> > > > > > -Lari
> > > > > >
> > > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > > > > >> I asked for an update on the Apache org GitHub Actions usage stats
> > > > from Gavin McDonald on the-asf slack in this thread:
> > > >
> > https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> > > > .
> > > > > >>
> > > > > >> I hope we get this issue resolved since it delays PR processing a
> > lot.
> > > > > >>
> > > > > >> -Lari
> > > > > >>
> > > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > > > > >>> Pulsar CI continues to be congested, and the build queue [1] is
> > very
> > > > long at the moment. There are 147 build jobs in the queue and 16 jobs
> > in
> > > > progress at the moment.
> > > > > >>>
> > > > > >>> I would strongly advice everyone to use "personal CI" to mitigate
> > > > the issue of the long delay of CI feedback. You can simply open a PR to
> > > > your own personal fork of apache/pulsar to run the builds in your
> > "personal
> > > > CI". There's more details in the previous emails in this thread.
> > > > > >>>
> > > > > >>> -Lari
> > > > > >>>
> > > > > >>> [1] - build queue:
> > > > https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > > > > >>>
> > > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > > > > >>>> Pulsar CI continues to be congested, and the build queue is
> > long.
> > > > > >>>>
> > > > > >>>> I would strongly advice everyone to use "personal CI" to
> > mitigate
> > > > the issue of the long delay of CI feedback. You can simply open a PR to
> > > > your own personal fork of apache/pulsar to run the builds in your
> > "personal
> > > > CI". There's more details in the previous email in this thread.
> > > > > >>>>
> > > > > >>>> Some updates:
> > > > > >>>>
> > > > > >>>> There has been a discussion with Gavin McDonald from ASF infra
> > on
> > > > the-asf slack about getting usage reports from GitHub to support the
> > > > investigation. Slack thread is the same one mentioned in the previous
> > > > email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279
> > .
> > > > Gavin already requested the usage report in GitHub UI, but it produced
> > > > invalid results.
> > > > > >>>>
> > > > > >>>> I made a change to mitigate a source of additional GitHub
> > Actions
> > > > overhead.
> > > > > >>>> In the past, each cherry-picked commit to a maintenance branch
> > of
> > > > Pulsar has triggered a lot of workflow runs.
> > > > > >>>>
> > > > > >>>> The solution for cancelling duplicate builds automatically is to
> > > > add this definition to the workflow definition:
> > > > > >>>> concurrency:
> > > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > > > > >>>>  cancel-in-progress: true
> > > > > >>>>
> > > > > >>>> I added this to all maintenance branch GitHub Actions workflows:
> > > > > >>>>
> > > > > >>>> branch-2.10 change:
> > > > > >>>>
> > > >
> > https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > > > > >>>> branch-2.9 change:
> > > > > >>>>
> > > >
> > https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > > > > >>>> branch-2.8 change:
> > > > > >>>>
> > > >
> > https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > > > > >>>> branch-2.7:
> > > > > >>>>
> > > >
> > https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > > > > >>>>
> > > > > >>>> branch-2.11 already contains the necessary config for cancelling
> > > > duplicate builds.
> > > > > >>>>
> > > > > >>>> The benefit of the above change is that when multiple commits
> > are
> > > > cherry-picked to a branch at once, only the build of the last commit
> > will
> > > > get run eventually. The builds for the intermediate commits will get
> > > > cancelled. Obviously there's a tradeoff here that we don't get the
> > > > information if one of the earlier commits breaks the build. It's the
> > cost
> > > > that we need to pay. Nevertheless our build is so flaky that it's hard
> > to
> > > > determine whether a failed build result is only caused by bad flaky
> > test or
> > > > whether it's an actual failure. Because of this we don't lose anything
> > by
> > > > cancelling builds. It's more important to save build resources. In the
> > > > maintenance branches for 2.10 and older, the average total build time
> > > > consumed is around 20 hours which is a lot.
> > > > > >>>>
> > > > > >>>> At this time, the overhead of maintenance branch builds doesn't
> > > > seem to be the source of the problems. There must be some other issue
> > which
> > > > is possibly related to exceeding a usage quota. Hopefully we get the CI
> > > > slowness issue solved asap.
> > > > > >>>>
> > > > > >>>> BR,
> > > > > >>>>
> > > > > >>>> Lari
> > > > > >>>>
> > > > > >>>>
> > > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > > > >>>>> Hi,
> > > > > >>>>>
> > > > > >>>>> GitHub Actions builds have been piling up in the build queue in
> > > > the last few days.
> > > > > >>>>> I posted on builds@apache.org
> > > > https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and
> > > > created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633
> > > > about this issue.
> > > > > >>>>> There's also a thread on the-asf slack,
> > > > https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > > > > >>>>>
> > > > > >>>>> It seems that our build queue is finally getting picked up,
> > but it
> > > > would be great to see if we hit quota and whether that is the cause of
> > > > pauses.
> > > > > >>>>>
> > > > > >>>>> Another issue is that the master branch broke after merging 2
> > > > conflicting PRs.
> > > > > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 .
> > > > > >>>>>
> > > > > >>>>> Merging PRs will be slow until we have these 2 problems solved
> > and
> > > > existing PRs rebased over the changes. Let's prioritize merging #17300
> > > > before pushing more changes.
> > > > > >>>>>
> > > > > >>>>> I'd like to point out that a good way to get build feedback
> > before
> > > > sending a PR, is to run builds on your personal GitHub Actions CI. The
> > > > benefit of this is that it doesn't consume the shared quota and builds
> > > > usually start instantly.
> > > > > >>>>> There are instructions in the contributors guide about this.
> > > > > >>>>>
> > https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > > > >>>>> You simply open PRs to your own fork of apache/pulsar to run
> > > > builds on your personal GitHub Actions CI.
> > > > > >>>>>
> > > > > >>>>> BR,
> > > > > >>>>>
> > > > > >>>>> Lari
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>
> > > > > >>>
> > > > > >>
> > > > >
> > > > >
> > > >
> > >
> >
>

Re: Pulsar CI congested, master branch build broken

Posted by tison <wa...@gmail.com>.

Here is another patch that can reduce unnecessary workload:
https://github.com/apache/pulsar/pull/17529

We don't create flaky-test issues/PRs frequently; it's about tens in one
month. The project owner should be able to handle it manually in minutes
per month (since candidates are already labeled); compared with now, we run
for every issue/PR opened/labeled.

Best,
tison.


tison <wa...@gmail.com> 于2022年9月8日周四 01:27写道：

> Today Pulsar repo runs almost up to one worflow run at the same time. It's
> a new situation I didn't notice before.
>
> > drop the "required checks"
>
> This can be dangerous to the repo status. I think the essential problem
> we meet here is about prioritizing specific PR, instead of releasing the
> guard to all PRs.
>
> > Fix quarantined flaky tests
>
> But yes, to overcome the workload brought by unnecessary reruns, it can be
> a solution that we treat all tests as "unstable" and un-require them while
> adding back in a timing manner.
>
> Best,
> tison.
>
>
> Lari Hotari <lh...@apache.org> 于2022年9月8日周四 01:15写道：
>
>> On 2022/09/07 16:59:33 tison wrote:
>> > > selecting which jobs to process
>> >
>> > Do you have a patch to implement this? IIRC it requires interacting with
>> > outside service or at least we may add an ok-to-test label.
>>
>> Very good idea, I didn't think that far ahead. It seems that Apache Spark
>> has some solution
>> since in the the-asf slack channel discussion it was mentioned that Spark
>> requires
>> contributors to run validation in their own personal GHA quota.
>> I don't know how that is achieved.
>>
>> As you proposed, one possible solution would be to have a workflow that
>> only proceeds
>> when there's a "ok-to-test" label on the PR.
>>
>> For the immediate selection of jobs to process, I have ways to clear the
>> GHA build queue
>> for apache/pulsar using the GHA API.
>> I clarified the proposed action plan in a follow up message to the thread
>> [1].
>> We would primarily process PRs which help to get out of the situation
>> where we are.
>>
>> It would also be helpful if there would be a way to escalate
>> ASF INFRA support and GitHub Support. However, the ticket
>> https://issues.apache.org/jira/browse/INFRA-23633 discussion doesn't
>> give much hope
>> of this possibility.
>>
>>
>> -Lari
>>
>> [1] https://lists.apache.org/thread/rpq12tzm4hx8kozpkphd2jyqr8cj0yj5
>>
>> On 2022/09/07 16:59:33 tison wrote:
>> > > selecting which jobs to process
>> >
>> > Do you have a patch to implement this? IIRC it requires interacting with
>> > outside service or at least we may add an ok-to-test label.
>> >
>> > Besides, it increases committers/PMC members' workload - be aware of
>> it, or
>> > most of contributions will stall.
>> >
>> > Best,
>> > tison.
>> >
>> >
>> > Lari Hotari <lh...@apache.org> 于2022年9月8日周四 00:47写道：
>> >
>> > > The problem with CI is becoming worse. The build queue is 235 jobs
>> now and
>> > > the queue time is over 7 hours.
>> > >
>> > > We will need to start shedding load in the build queue and get some
>> fixes
>> > > in.
>> > > https://issues.apache.org/jira/browse/INFRA-23633 continues to
>> contain
>> > > details about some activities. I have created 2 GitHub Support
>> tickets, but
>> > > usually it takes up to a week to get a response.
>> > >
>> > > I have some assumptions about the issue, but they are just
>> assumptions.
>> > > One oddity is that when re-running failed jobs is used in a large
>> > > workflow, the execution times for previously successful jobs get
>> counted as
>> > > if they have run.
>> > > Here's an example:
>> > > https://github.com/apache/pulsar/actions/runs/3003787409/usage
>> > > The reported usage is about 3x than the actual usage.
>> > > The assumption that I have is that the "fairness algorithm" that
>> GitHub
>> > > uses to provide all Apache projects about the same amount of GitHub
>> Actions
>> > > resources would take this flawed usage as the basis of it's decisions.
>> > > The reason why we are getting hit by this now is that there is a high
>> > > number of flaky test failures that cause almost every build to fail
>> and we
>> > > are re-running a lot of builds.
>> > >
>> > > Another problem there is that the GitHub Actions search doesn't always
>> > > show all workflow runs that are running. This has happened before
>> when the
>> > > GitHub Actions workflow search index was corrupted. GitHub Support
>> resolved
>> > > that by rebuilding the search index with some manual admin operation
>> behind
>> > > the scenes.
>> > >
>> > > I'm proposing that we start shedding load from CI by cancelling build
>> jobs
>> > > and selecting which jobs to process so that we get the CI issue
>> resolved.
>> > > We might also have to disable required checks so that we have some
>> way to
>> > > get changes merged while CI doesn't work properly.
>> > >
>> > > I'm expecting lazy consensus on fixing CI unless someone proposes a
>> better
>> > > plan. Let's keep everyone informed in this mailing list thread.
>> > >
>> > > -Lari
>> > >
>> > >
>> > > On 2022/09/06 14:41:07 Dave Fisher wrote:
>> > > > We are going to need to take actions to fix our problems. See
>> > >
>> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
>> > > >
>> > > > Jarek has done a large amount of GitHub Action work with Apache
>> Airflow
>> > > and his suggestions might be helpful. One of his suggestions was
>> Apache
>> > > Yetus. I think he means using the Maven plugins -
>> > > https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
>> > > >
>> > > >
>> > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lh...@apache.org>
>> wrote:
>> > > > >
>> > > > > The Apache Infra ticket is
>> > > https://issues.apache.org/jira/browse/INFRA-23633 .
>> > > > >
>> > > > > -Lari
>> > > > >
>> > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
>> > > > >> I asked for an update on the Apache org GitHub Actions usage
>> stats
>> > > from Gavin McDonald on the-asf slack in this thread:
>> > >
>> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
>> > > .
>> > > > >>
>> > > > >> I hope we get this issue resolved since it delays PR processing
>> a lot.
>> > > > >>
>> > > > >> -Lari
>> > > > >>
>> > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
>> > > > >>> Pulsar CI continues to be congested, and the build queue [1] is
>> very
>> > > long at the moment. There are 147 build jobs in the queue and 16 jobs
>> in
>> > > progress at the moment.
>> > > > >>>
>> > > > >>> I would strongly advice everyone to use "personal CI" to
>> mitigate
>> > > the issue of the long delay of CI feedback. You can simply open a PR
>> to
>> > > your own personal fork of apache/pulsar to run the builds in your
>> "personal
>> > > CI". There's more details in the previous emails in this thread.
>> > > > >>>
>> > > > >>> -Lari
>> > > > >>>
>> > > > >>> [1] - build queue:
>> > > https://github.com/apache/pulsar/actions?query=is%3Aqueued
>> > > > >>>
>> > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
>> > > > >>>> Pulsar CI continues to be congested, and the build queue is
>> long.
>> > > > >>>>
>> > > > >>>> I would strongly advice everyone to use "personal CI" to
>> mitigate
>> > > the issue of the long delay of CI feedback. You can simply open a PR
>> to
>> > > your own personal fork of apache/pulsar to run the builds in your
>> "personal
>> > > CI". There's more details in the previous email in this thread.
>> > > > >>>>
>> > > > >>>> Some updates:
>> > > > >>>>
>> > > > >>>> There has been a discussion with Gavin McDonald from ASF infra
>> on
>> > > the-asf slack about getting usage reports from GitHub to support the
>> > > investigation. Slack thread is the same one mentioned in the previous
>> > > email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279
>> .
>> > > Gavin already requested the usage report in GitHub UI, but it produced
>> > > invalid results.
>> > > > >>>>
>> > > > >>>> I made a change to mitigate a source of additional GitHub
>> Actions
>> > > overhead.
>> > > > >>>> In the past, each cherry-picked commit to a maintenance branch
>> of
>> > > Pulsar has triggered a lot of workflow runs.
>> > > > >>>>
>> > > > >>>> The solution for cancelling duplicate builds automatically is
>> to
>> > > add this definition to the workflow definition:
>> > > > >>>> concurrency:
>> > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
>> > > > >>>>  cancel-in-progress: true
>> > > > >>>>
>> > > > >>>> I added this to all maintenance branch GitHub Actions
>> workflows:
>> > > > >>>>
>> > > > >>>> branch-2.10 change:
>> > > > >>>>
>> > >
>> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
>> > > > >>>> branch-2.9 change:
>> > > > >>>>
>> > >
>> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
>> > > > >>>> branch-2.8 change:
>> > > > >>>>
>> > >
>> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
>> > > > >>>> branch-2.7:
>> > > > >>>>
>> > >
>> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
>> > > > >>>>
>> > > > >>>> branch-2.11 already contains the necessary config for
>> cancelling
>> > > duplicate builds.
>> > > > >>>>
>> > > > >>>> The benefit of the above change is that when multiple commits
>> are
>> > > cherry-picked to a branch at once, only the build of the last commit
>> will
>> > > get run eventually. The builds for the intermediate commits will get
>> > > cancelled. Obviously there's a tradeoff here that we don't get the
>> > > information if one of the earlier commits breaks the build. It's the
>> cost
>> > > that we need to pay. Nevertheless our build is so flaky that it's
>> hard to
>> > > determine whether a failed build result is only caused by bad flaky
>> test or
>> > > whether it's an actual failure. Because of this we don't lose
>> anything by
>> > > cancelling builds. It's more important to save build resources. In the
>> > > maintenance branches for 2.10 and older, the average total build time
>> > > consumed is around 20 hours which is a lot.
>> > > > >>>>
>> > > > >>>> At this time, the overhead of maintenance branch builds doesn't
>> > > seem to be the source of the problems. There must be some other issue
>> which
>> > > is possibly related to exceeding a usage quota. Hopefully we get the
>> CI
>> > > slowness issue solved asap.
>> > > > >>>>
>> > > > >>>> BR,
>> > > > >>>>
>> > > > >>>> Lari
>> > > > >>>>
>> > > > >>>>
>> > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
>> > > > >>>>> Hi,
>> > > > >>>>>
>> > > > >>>>> GitHub Actions builds have been piling up in the build queue
>> in
>> > > the last few days.
>> > > > >>>>> I posted on builds@apache.org
>> > > https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and
>> > > created INFRA ticket
>> https://issues.apache.org/jira/browse/INFRA-23633
>> > > about this issue.
>> > > > >>>>> There's also a thread on the-asf slack,
>> > > https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
>> > > > >>>>>
>> > > > >>>>> It seems that our build queue is finally getting picked up,
>> but it
>> > > would be great to see if we hit quota and whether that is the cause of
>> > > pauses.
>> > > > >>>>>
>> > > > >>>>> Another issue is that the master branch broke after merging 2
>> > > conflicting PRs.
>> > > > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 .
>> > > > >>>>>
>> > > > >>>>> Merging PRs will be slow until we have these 2 problems
>> solved and
>> > > existing PRs rebased over the changes. Let's prioritize merging #17300
>> > > before pushing more changes.
>> > > > >>>>>
>> > > > >>>>> I'd like to point out that a good way to get build feedback
>> before
>> > > sending a PR, is to run builds on your personal GitHub Actions CI. The
>> > > benefit of this is that it doesn't consume the shared quota and builds
>> > > usually start instantly.
>> > > > >>>>> There are instructions in the contributors guide about this.
>> > > > >>>>>
>> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
>> > > > >>>>> You simply open PRs to your own fork of apache/pulsar to run
>> > > builds on your personal GitHub Actions CI.
>> > > > >>>>>
>> > > > >>>>> BR,
>> > > > >>>>>
>> > > > >>>>> Lari
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>>>>
>> > > > >>>>
>> > > > >>>
>> > > > >>
>> > > >
>> > > >
>> > >
>> >
>>
>

Re: Pulsar CI congested, master branch build broken

Posted by tison <wa...@gmail.com>.

Today Pulsar repo runs almost up to one worflow run at the same time. It's
a new situation I didn't notice before.

> drop the "required checks"

This can be dangerous to the repo status. I think the essential problem we
meet here is about prioritizing specific PR, instead of releasing the guard
to all PRs.

> Fix quarantined flaky tests

But yes, to overcome the workload brought by unnecessary reruns, it can be
a solution that we treat all tests as "unstable" and un-require them while
adding back in a timing manner.

Best,
tison.


Lari Hotari <lh...@apache.org> 于2022年9月8日周四 01:15写道：

> On 2022/09/07 16:59:33 tison wrote:
> > > selecting which jobs to process
> >
> > Do you have a patch to implement this? IIRC it requires interacting with
> > outside service or at least we may add an ok-to-test label.
>
> Very good idea, I didn't think that far ahead. It seems that Apache Spark
> has some solution
> since in the the-asf slack channel discussion it was mentioned that Spark
> requires
> contributors to run validation in their own personal GHA quota.
> I don't know how that is achieved.
>
> As you proposed, one possible solution would be to have a workflow that
> only proceeds
> when there's a "ok-to-test" label on the PR.
>
> For the immediate selection of jobs to process, I have ways to clear the
> GHA build queue
> for apache/pulsar using the GHA API.
> I clarified the proposed action plan in a follow up message to the thread
> [1].
> We would primarily process PRs which help to get out of the situation
> where we are.
>
> It would also be helpful if there would be a way to escalate
> ASF INFRA support and GitHub Support. However, the ticket
> https://issues.apache.org/jira/browse/INFRA-23633 discussion doesn't give
> much hope
> of this possibility.
>
>
> -Lari
>
> [1] https://lists.apache.org/thread/rpq12tzm4hx8kozpkphd2jyqr8cj0yj5
>
> On 2022/09/07 16:59:33 tison wrote:
> > > selecting which jobs to process
> >
> > Do you have a patch to implement this? IIRC it requires interacting with
> > outside service or at least we may add an ok-to-test label.
> >
> > Besides, it increases committers/PMC members' workload - be aware of it,
> or
> > most of contributions will stall.
> >
> > Best,
> > tison.
> >
> >
> > Lari Hotari <lh...@apache.org> 于2022年9月8日周四 00:47写道：
> >
> > > The problem with CI is becoming worse. The build queue is 235 jobs now
> and
> > > the queue time is over 7 hours.
> > >
> > > We will need to start shedding load in the build queue and get some
> fixes
> > > in.
> > > https://issues.apache.org/jira/browse/INFRA-23633 continues to contain
> > > details about some activities. I have created 2 GitHub Support
> tickets, but
> > > usually it takes up to a week to get a response.
> > >
> > > I have some assumptions about the issue, but they are just assumptions.
> > > One oddity is that when re-running failed jobs is used in a large
> > > workflow, the execution times for previously successful jobs get
> counted as
> > > if they have run.
> > > Here's an example:
> > > https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > > The reported usage is about 3x than the actual usage.
> > > The assumption that I have is that the "fairness algorithm" that GitHub
> > > uses to provide all Apache projects about the same amount of GitHub
> Actions
> > > resources would take this flawed usage as the basis of it's decisions.
> > > The reason why we are getting hit by this now is that there is a high
> > > number of flaky test failures that cause almost every build to fail
> and we
> > > are re-running a lot of builds.
> > >
> > > Another problem there is that the GitHub Actions search doesn't always
> > > show all workflow runs that are running. This has happened before when
> the
> > > GitHub Actions workflow search index was corrupted. GitHub Support
> resolved
> > > that by rebuilding the search index with some manual admin operation
> behind
> > > the scenes.
> > >
> > > I'm proposing that we start shedding load from CI by cancelling build
> jobs
> > > and selecting which jobs to process so that we get the CI issue
> resolved.
> > > We might also have to disable required checks so that we have some way
> to
> > > get changes merged while CI doesn't work properly.
> > >
> > > I'm expecting lazy consensus on fixing CI unless someone proposes a
> better
> > > plan. Let's keep everyone informed in this mailing list thread.
> > >
> > > -Lari
> > >
> > >
> > > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > > > We are going to need to take actions to fix our problems. See
> > >
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > > >
> > > > Jarek has done a large amount of GitHub Action work with Apache
> Airflow
> > > and his suggestions might be helpful. One of his suggestions was Apache
> > > Yetus. I think he means using the Maven plugins -
> > > https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > > >
> > > >
> > > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lh...@apache.org>
> wrote:
> > > > >
> > > > > The Apache Infra ticket is
> > > https://issues.apache.org/jira/browse/INFRA-23633 .
> > > > >
> > > > > -Lari
> > > > >
> > > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > > > >> I asked for an update on the Apache org GitHub Actions usage stats
> > > from Gavin McDonald on the-asf slack in this thread:
> > >
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> > > .
> > > > >>
> > > > >> I hope we get this issue resolved since it delays PR processing a
> lot.
> > > > >>
> > > > >> -Lari
> > > > >>
> > > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > > > >>> Pulsar CI continues to be congested, and the build queue [1] is
> very
> > > long at the moment. There are 147 build jobs in the queue and 16 jobs
> in
> > > progress at the moment.
> > > > >>>
> > > > >>> I would strongly advice everyone to use "personal CI" to mitigate
> > > the issue of the long delay of CI feedback. You can simply open a PR to
> > > your own personal fork of apache/pulsar to run the builds in your
> "personal
> > > CI". There's more details in the previous emails in this thread.
> > > > >>>
> > > > >>> -Lari
> > > > >>>
> > > > >>> [1] - build queue:
> > > https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > > > >>>
> > > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > > > >>>> Pulsar CI continues to be congested, and the build queue is
> long.
> > > > >>>>
> > > > >>>> I would strongly advice everyone to use "personal CI" to
> mitigate
> > > the issue of the long delay of CI feedback. You can simply open a PR to
> > > your own personal fork of apache/pulsar to run the builds in your
> "personal
> > > CI". There's more details in the previous email in this thread.
> > > > >>>>
> > > > >>>> Some updates:
> > > > >>>>
> > > > >>>> There has been a discussion with Gavin McDonald from ASF infra
> on
> > > the-asf slack about getting usage reports from GitHub to support the
> > > investigation. Slack thread is the same one mentioned in the previous
> > > email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279
> .
> > > Gavin already requested the usage report in GitHub UI, but it produced
> > > invalid results.
> > > > >>>>
> > > > >>>> I made a change to mitigate a source of additional GitHub
> Actions
> > > overhead.
> > > > >>>> In the past, each cherry-picked commit to a maintenance branch
> of
> > > Pulsar has triggered a lot of workflow runs.
> > > > >>>>
> > > > >>>> The solution for cancelling duplicate builds automatically is to
> > > add this definition to the workflow definition:
> > > > >>>> concurrency:
> > > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > > > >>>>  cancel-in-progress: true
> > > > >>>>
> > > > >>>> I added this to all maintenance branch GitHub Actions workflows:
> > > > >>>>
> > > > >>>> branch-2.10 change:
> > > > >>>>
> > >
> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > > > >>>> branch-2.9 change:
> > > > >>>>
> > >
> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > > > >>>> branch-2.8 change:
> > > > >>>>
> > >
> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > > > >>>> branch-2.7:
> > > > >>>>
> > >
> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > > > >>>>
> > > > >>>> branch-2.11 already contains the necessary config for cancelling
> > > duplicate builds.
> > > > >>>>
> > > > >>>> The benefit of the above change is that when multiple commits
> are
> > > cherry-picked to a branch at once, only the build of the last commit
> will
> > > get run eventually. The builds for the intermediate commits will get
> > > cancelled. Obviously there's a tradeoff here that we don't get the
> > > information if one of the earlier commits breaks the build. It's the
> cost
> > > that we need to pay. Nevertheless our build is so flaky that it's hard
> to
> > > determine whether a failed build result is only caused by bad flaky
> test or
> > > whether it's an actual failure. Because of this we don't lose anything
> by
> > > cancelling builds. It's more important to save build resources. In the
> > > maintenance branches for 2.10 and older, the average total build time
> > > consumed is around 20 hours which is a lot.
> > > > >>>>
> > > > >>>> At this time, the overhead of maintenance branch builds doesn't
> > > seem to be the source of the problems. There must be some other issue
> which
> > > is possibly related to exceeding a usage quota. Hopefully we get the CI
> > > slowness issue solved asap.
> > > > >>>>
> > > > >>>> BR,
> > > > >>>>
> > > > >>>> Lari
> > > > >>>>
> > > > >>>>
> > > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > > >>>>> Hi,
> > > > >>>>>
> > > > >>>>> GitHub Actions builds have been piling up in the build queue in
> > > the last few days.
> > > > >>>>> I posted on builds@apache.org
> > > https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and
> > > created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633
> > > about this issue.
> > > > >>>>> There's also a thread on the-asf slack,
> > > https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > > > >>>>>
> > > > >>>>> It seems that our build queue is finally getting picked up,
> but it
> > > would be great to see if we hit quota and whether that is the cause of
> > > pauses.
> > > > >>>>>
> > > > >>>>> Another issue is that the master branch broke after merging 2
> > > conflicting PRs.
> > > > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 .
> > > > >>>>>
> > > > >>>>> Merging PRs will be slow until we have these 2 problems solved
> and
> > > existing PRs rebased over the changes. Let's prioritize merging #17300
> > > before pushing more changes.
> > > > >>>>>
> > > > >>>>> I'd like to point out that a good way to get build feedback
> before
> > > sending a PR, is to run builds on your personal GitHub Actions CI. The
> > > benefit of this is that it doesn't consume the shared quota and builds
> > > usually start instantly.
> > > > >>>>> There are instructions in the contributors guide about this.
> > > > >>>>>
> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > > >>>>> You simply open PRs to your own fork of apache/pulsar to run
> > > builds on your personal GitHub Actions CI.
> > > > >>>>>
> > > > >>>>> BR,
> > > > >>>>>
> > > > >>>>> Lari
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: Pulsar CI congested, master branch build broken

Posted by Lari Hotari <lh...@apache.org>.

On 2022/09/07 16:59:33 tison wrote:
> > selecting which jobs to process
> 
> Do you have a patch to implement this? IIRC it requires interacting with
> outside service or at least we may add an ok-to-test label.

Very good idea, I didn't think that far ahead. It seems that Apache Spark has some solution
since in the the-asf slack channel discussion it was mentioned that Spark requires
contributors to run validation in their own personal GHA quota. 
I don't know how that is achieved.

As you proposed, one possible solution would be to have a workflow that only proceeds 
when there's a "ok-to-test" label on the PR.

For the immediate selection of jobs to process, I have ways to clear the GHA build queue
for apache/pulsar using the GHA API.
I clarified the proposed action plan in a follow up message to the thread [1].
We would primarily process PRs which help to get out of the situation where we are.

It would also be helpful if there would be a way to escalate
ASF INFRA support and GitHub Support. However, the ticket 
https://issues.apache.org/jira/browse/INFRA-23633 discussion doesn't give much hope 
of this possibility.


-Lari

[1] https://lists.apache.org/thread/rpq12tzm4hx8kozpkphd2jyqr8cj0yj5

On 2022/09/07 16:59:33 tison wrote:
> > selecting which jobs to process
> 
> Do you have a patch to implement this? IIRC it requires interacting with
> outside service or at least we may add an ok-to-test label.
> 
> Besides, it increases committers/PMC members' workload - be aware of it, or
> most of contributions will stall.
> 
> Best,
> tison.
> 
> 
> Lari Hotari <lh...@apache.org> 于2022年9月8日周四 00:47写道：
> 
> > The problem with CI is becoming worse. The build queue is 235 jobs now and
> > the queue time is over 7 hours.
> >
> > We will need to start shedding load in the build queue and get some fixes
> > in.
> > https://issues.apache.org/jira/browse/INFRA-23633 continues to contain
> > details about some activities. I have created 2 GitHub Support tickets, but
> > usually it takes up to a week to get a response.
> >
> > I have some assumptions about the issue, but they are just assumptions.
> > One oddity is that when re-running failed jobs is used in a large
> > workflow, the execution times for previously successful jobs get counted as
> > if they have run.
> > Here's an example:
> > https://github.com/apache/pulsar/actions/runs/3003787409/usage
> > The reported usage is about 3x than the actual usage.
> > The assumption that I have is that the "fairness algorithm" that GitHub
> > uses to provide all Apache projects about the same amount of GitHub Actions
> > resources would take this flawed usage as the basis of it's decisions.
> > The reason why we are getting hit by this now is that there is a high
> > number of flaky test failures that cause almost every build to fail and we
> > are re-running a lot of builds.
> >
> > Another problem there is that the GitHub Actions search doesn't always
> > show all workflow runs that are running. This has happened before when the
> > GitHub Actions workflow search index was corrupted. GitHub Support resolved
> > that by rebuilding the search index with some manual admin operation behind
> > the scenes.
> >
> > I'm proposing that we start shedding load from CI by cancelling build jobs
> > and selecting which jobs to process so that we get the CI issue resolved.
> > We might also have to disable required checks so that we have some way to
> > get changes merged while CI doesn't work properly.
> >
> > I'm expecting lazy consensus on fixing CI unless someone proposes a better
> > plan. Let's keep everyone informed in this mailing list thread.
> >
> > -Lari
> >
> >
> > On 2022/09/06 14:41:07 Dave Fisher wrote:
> > > We are going to need to take actions to fix our problems. See
> > https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> > >
> > > Jarek has done a large amount of GitHub Action work with Apache Airflow
> > and his suggestions might be helpful. One of his suggestions was Apache
> > Yetus. I think he means using the Maven plugins -
> > https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> > >
> > >
> > > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lh...@apache.org> wrote:
> > > >
> > > > The Apache Infra ticket is
> > https://issues.apache.org/jira/browse/INFRA-23633 .
> > > >
> > > > -Lari
> > > >
> > > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > > >> I asked for an update on the Apache org GitHub Actions usage stats
> > from Gavin McDonald on the-asf slack in this thread:
> > https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> > .
> > > >>
> > > >> I hope we get this issue resolved since it delays PR processing a lot.
> > > >>
> > > >> -Lari
> > > >>
> > > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > > >>> Pulsar CI continues to be congested, and the build queue [1] is very
> > long at the moment. There are 147 build jobs in the queue and 16 jobs in
> > progress at the moment.
> > > >>>
> > > >>> I would strongly advice everyone to use "personal CI" to mitigate
> > the issue of the long delay of CI feedback. You can simply open a PR to
> > your own personal fork of apache/pulsar to run the builds in your "personal
> > CI". There's more details in the previous emails in this thread.
> > > >>>
> > > >>> -Lari
> > > >>>
> > > >>> [1] - build queue:
> > https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > > >>>
> > > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > > >>>> Pulsar CI continues to be congested, and the build queue is long.
> > > >>>>
> > > >>>> I would strongly advice everyone to use "personal CI" to mitigate
> > the issue of the long delay of CI feedback. You can simply open a PR to
> > your own personal fork of apache/pulsar to run the builds in your "personal
> > CI". There's more details in the previous email in this thread.
> > > >>>>
> > > >>>> Some updates:
> > > >>>>
> > > >>>> There has been a discussion with Gavin McDonald from ASF infra on
> > the-asf slack about getting usage reports from GitHub to support the
> > investigation. Slack thread is the same one mentioned in the previous
> > email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > Gavin already requested the usage report in GitHub UI, but it produced
> > invalid results.
> > > >>>>
> > > >>>> I made a change to mitigate a source of additional GitHub Actions
> > overhead.
> > > >>>> In the past, each cherry-picked commit to a maintenance branch of
> > Pulsar has triggered a lot of workflow runs.
> > > >>>>
> > > >>>> The solution for cancelling duplicate builds automatically is to
> > add this definition to the workflow definition:
> > > >>>> concurrency:
> > > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > > >>>>  cancel-in-progress: true
> > > >>>>
> > > >>>> I added this to all maintenance branch GitHub Actions workflows:
> > > >>>>
> > > >>>> branch-2.10 change:
> > > >>>>
> > https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > > >>>> branch-2.9 change:
> > > >>>>
> > https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > > >>>> branch-2.8 change:
> > > >>>>
> > https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > > >>>> branch-2.7:
> > > >>>>
> > https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > > >>>>
> > > >>>> branch-2.11 already contains the necessary config for cancelling
> > duplicate builds.
> > > >>>>
> > > >>>> The benefit of the above change is that when multiple commits are
> > cherry-picked to a branch at once, only the build of the last commit will
> > get run eventually. The builds for the intermediate commits will get
> > cancelled. Obviously there's a tradeoff here that we don't get the
> > information if one of the earlier commits breaks the build. It's the cost
> > that we need to pay. Nevertheless our build is so flaky that it's hard to
> > determine whether a failed build result is only caused by bad flaky test or
> > whether it's an actual failure. Because of this we don't lose anything by
> > cancelling builds. It's more important to save build resources. In the
> > maintenance branches for 2.10 and older, the average total build time
> > consumed is around 20 hours which is a lot.
> > > >>>>
> > > >>>> At this time, the overhead of maintenance branch builds doesn't
> > seem to be the source of the problems. There must be some other issue which
> > is possibly related to exceeding a usage quota. Hopefully we get the CI
> > slowness issue solved asap.
> > > >>>>
> > > >>>> BR,
> > > >>>>
> > > >>>> Lari
> > > >>>>
> > > >>>>
> > > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > >>>>> Hi,
> > > >>>>>
> > > >>>>> GitHub Actions builds have been piling up in the build queue in
> > the last few days.
> > > >>>>> I posted on builds@apache.org
> > https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and
> > created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633
> > about this issue.
> > > >>>>> There's also a thread on the-asf slack,
> > https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > > >>>>>
> > > >>>>> It seems that our build queue is finally getting picked up, but it
> > would be great to see if we hit quota and whether that is the cause of
> > pauses.
> > > >>>>>
> > > >>>>> Another issue is that the master branch broke after merging 2
> > conflicting PRs.
> > > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 .
> > > >>>>>
> > > >>>>> Merging PRs will be slow until we have these 2 problems solved and
> > existing PRs rebased over the changes. Let's prioritize merging #17300
> > before pushing more changes.
> > > >>>>>
> > > >>>>> I'd like to point out that a good way to get build feedback before
> > sending a PR, is to run builds on your personal GitHub Actions CI. The
> > benefit of this is that it doesn't consume the shared quota and builds
> > usually start instantly.
> > > >>>>> There are instructions in the contributors guide about this.
> > > >>>>> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > >>>>> You simply open PRs to your own fork of apache/pulsar to run
> > builds on your personal GitHub Actions CI.
> > > >>>>>
> > > >>>>> BR,
> > > >>>>>
> > > >>>>> Lari
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>>
> > > >>>>
> > > >>>
> > > >>
> > >
> > >
> >
>

Re: Pulsar CI congested, master branch build broken

Posted by tison <wa...@gmail.com>.

> selecting which jobs to process

Do you have a patch to implement this? IIRC it requires interacting with
outside service or at least we may add an ok-to-test label.

Besides, it increases committers/PMC members' workload - be aware of it, or
most of contributions will stall.

Best,
tison.


Lari Hotari <lh...@apache.org> 于2022年9月8日周四 00:47写道：

> The problem with CI is becoming worse. The build queue is 235 jobs now and
> the queue time is over 7 hours.
>
> We will need to start shedding load in the build queue and get some fixes
> in.
> https://issues.apache.org/jira/browse/INFRA-23633 continues to contain
> details about some activities. I have created 2 GitHub Support tickets, but
> usually it takes up to a week to get a response.
>
> I have some assumptions about the issue, but they are just assumptions.
> One oddity is that when re-running failed jobs is used in a large
> workflow, the execution times for previously successful jobs get counted as
> if they have run.
> Here's an example:
> https://github.com/apache/pulsar/actions/runs/3003787409/usage
> The reported usage is about 3x than the actual usage.
> The assumption that I have is that the "fairness algorithm" that GitHub
> uses to provide all Apache projects about the same amount of GitHub Actions
> resources would take this flawed usage as the basis of it's decisions.
> The reason why we are getting hit by this now is that there is a high
> number of flaky test failures that cause almost every build to fail and we
> are re-running a lot of builds.
>
> Another problem there is that the GitHub Actions search doesn't always
> show all workflow runs that are running. This has happened before when the
> GitHub Actions workflow search index was corrupted. GitHub Support resolved
> that by rebuilding the search index with some manual admin operation behind
> the scenes.
>
> I'm proposing that we start shedding load from CI by cancelling build jobs
> and selecting which jobs to process so that we get the CI issue resolved.
> We might also have to disable required checks so that we have some way to
> get changes merged while CI doesn't work properly.
>
> I'm expecting lazy consensus on fixing CI unless someone proposes a better
> plan. Let's keep everyone informed in this mailing list thread.
>
> -Lari
>
>
> On 2022/09/06 14:41:07 Dave Fisher wrote:
> > We are going to need to take actions to fix our problems. See
> https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> >
> > Jarek has done a large amount of GitHub Action work with Apache Airflow
> and his suggestions might be helpful. One of his suggestions was Apache
> Yetus. I think he means using the Maven plugins -
> https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> >
> >
> > > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lh...@apache.org> wrote:
> > >
> > > The Apache Infra ticket is
> https://issues.apache.org/jira/browse/INFRA-23633 .
> > >
> > > -Lari
> > >
> > > On 2022/09/06 11:36:46 Lari Hotari wrote:
> > >> I asked for an update on the Apache org GitHub Actions usage stats
> from Gavin McDonald on the-asf slack in this thread:
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8
> .
> > >>
> > >> I hope we get this issue resolved since it delays PR processing a lot.
> > >>
> > >> -Lari
> > >>
> > >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > >>> Pulsar CI continues to be congested, and the build queue [1] is very
> long at the moment. There are 147 build jobs in the queue and 16 jobs in
> progress at the moment.
> > >>>
> > >>> I would strongly advice everyone to use "personal CI" to mitigate
> the issue of the long delay of CI feedback. You can simply open a PR to
> your own personal fork of apache/pulsar to run the builds in your "personal
> CI". There's more details in the previous emails in this thread.
> > >>>
> > >>> -Lari
> > >>>
> > >>> [1] - build queue:
> https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > >>>
> > >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > >>>> Pulsar CI continues to be congested, and the build queue is long.
> > >>>>
> > >>>> I would strongly advice everyone to use "personal CI" to mitigate
> the issue of the long delay of CI feedback. You can simply open a PR to
> your own personal fork of apache/pulsar to run the builds in your "personal
> CI". There's more details in the previous email in this thread.
> > >>>>
> > >>>> Some updates:
> > >>>>
> > >>>> There has been a discussion with Gavin McDonald from ASF infra on
> the-asf slack about getting usage reports from GitHub to support the
> investigation. Slack thread is the same one mentioned in the previous
> email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> Gavin already requested the usage report in GitHub UI, but it produced
> invalid results.
> > >>>>
> > >>>> I made a change to mitigate a source of additional GitHub Actions
> overhead.
> > >>>> In the past, each cherry-picked commit to a maintenance branch of
> Pulsar has triggered a lot of workflow runs.
> > >>>>
> > >>>> The solution for cancelling duplicate builds automatically is to
> add this definition to the workflow definition:
> > >>>> concurrency:
> > >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> > >>>>  cancel-in-progress: true
> > >>>>
> > >>>> I added this to all maintenance branch GitHub Actions workflows:
> > >>>>
> > >>>> branch-2.10 change:
> > >>>>
> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > >>>> branch-2.9 change:
> > >>>>
> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > >>>> branch-2.8 change:
> > >>>>
> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > >>>> branch-2.7:
> > >>>>
> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > >>>>
> > >>>> branch-2.11 already contains the necessary config for cancelling
> duplicate builds.
> > >>>>
> > >>>> The benefit of the above change is that when multiple commits are
> cherry-picked to a branch at once, only the build of the last commit will
> get run eventually. The builds for the intermediate commits will get
> cancelled. Obviously there's a tradeoff here that we don't get the
> information if one of the earlier commits breaks the build. It's the cost
> that we need to pay. Nevertheless our build is so flaky that it's hard to
> determine whether a failed build result is only caused by bad flaky test or
> whether it's an actual failure. Because of this we don't lose anything by
> cancelling builds. It's more important to save build resources. In the
> maintenance branches for 2.10 and older, the average total build time
> consumed is around 20 hours which is a lot.
> > >>>>
> > >>>> At this time, the overhead of maintenance branch builds doesn't
> seem to be the source of the problems. There must be some other issue which
> is possibly related to exceeding a usage quota. Hopefully we get the CI
> slowness issue solved asap.
> > >>>>
> > >>>> BR,
> > >>>>
> > >>>> Lari
> > >>>>
> > >>>>
> > >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > >>>>> Hi,
> > >>>>>
> > >>>>> GitHub Actions builds have been piling up in the build queue in
> the last few days.
> > >>>>> I posted on builds@apache.org
> https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and
> created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633
> about this issue.
> > >>>>> There's also a thread on the-asf slack,
> https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> > >>>>>
> > >>>>> It seems that our build queue is finally getting picked up, but it
> would be great to see if we hit quota and whether that is the cause of
> pauses.
> > >>>>>
> > >>>>> Another issue is that the master branch broke after merging 2
> conflicting PRs.
> > >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 .
> > >>>>>
> > >>>>> Merging PRs will be slow until we have these 2 problems solved and
> existing PRs rebased over the changes. Let's prioritize merging #17300
> before pushing more changes.
> > >>>>>
> > >>>>> I'd like to point out that a good way to get build feedback before
> sending a PR, is to run builds on your personal GitHub Actions CI. The
> benefit of this is that it doesn't consume the shared quota and builds
> usually start instantly.
> > >>>>> There are instructions in the contributors guide about this.
> > >>>>> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > >>>>> You simply open PRs to your own fork of apache/pulsar to run
> builds on your personal GitHub Actions CI.
> > >>>>>
> > >>>>> BR,
> > >>>>>
> > >>>>> Lari
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>>
> > >>>>
> > >>>
> > >>
> >
> >
>

Re: Pulsar CI congested, master branch build broken

Posted by Lari Hotari <lh...@apache.org>.

The problem with CI is becoming worse. The build queue is 235 jobs now and the queue time is over 7 hours.

We will need to start shedding load in the build queue and get some fixes in.
https://issues.apache.org/jira/browse/INFRA-23633 continues to contain details about some activities. I have created 2 GitHub Support tickets, but usually it takes up to a week to get a response.

I have some assumptions about the issue, but they are just assumptions.
One oddity is that when re-running failed jobs is used in a large workflow, the execution times for previously successful jobs get counted as if they have run. 
Here's an example: https://github.com/apache/pulsar/actions/runs/3003787409/usage
The reported usage is about 3x than the actual usage.
The assumption that I have is that the "fairness algorithm" that GitHub uses to provide all Apache projects about the same amount of GitHub Actions resources would take this flawed usage as the basis of it's decisions.
The reason why we are getting hit by this now is that there is a high number of flaky test failures that cause almost every build to fail and we are re-running a lot of builds.

Another problem there is that the GitHub Actions search doesn't always show all workflow runs that are running. This has happened before when the GitHub Actions workflow search index was corrupted. GitHub Support resolved that by rebuilding the search index with some manual admin operation behind the scenes.

I'm proposing that we start shedding load from CI by cancelling build jobs and selecting which jobs to process so that we get the CI issue resolved. We might also have to disable required checks so that we have some way to get changes merged while CI doesn't work properly.

I'm expecting lazy consensus on fixing CI unless someone proposes a better plan. Let's keep everyone informed in this mailing list thread.

-Lari


On 2022/09/06 14:41:07 Dave Fisher wrote:
> We are going to need to take actions to fix our problems. See https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> 
> Jarek has done a large amount of GitHub Action work with Apache Airflow and his suggestions might be helpful. One of his suggestions was Apache Yetus. I think he means using the Maven plugins - https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> 
> 
> > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lh...@apache.org> wrote:
> > 
> > The Apache Infra ticket is https://issues.apache.org/jira/browse/INFRA-23633 . 
> > 
> > -Lari
> > 
> > On 2022/09/06 11:36:46 Lari Hotari wrote:
> >> I asked for an update on the Apache org GitHub Actions usage stats from Gavin McDonald on the-asf slack in this thread: https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8 .
> >> 
> >> I hope we get this issue resolved since it delays PR processing a lot.
> >> 
> >> -Lari
> >> 
> >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> >>> Pulsar CI continues to be congested, and the build queue [1] is very long at the moment. There are 147 build jobs in the queue and 16 jobs in progress at the moment.
> >>> 
> >>> I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous emails in this thread.
> >>> 
> >>> -Lari
> >>> 
> >>> [1] - build queue: https://github.com/apache/pulsar/actions?query=is%3Aqueued
> >>> 
> >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> >>>> Pulsar CI continues to be congested, and the build queue is long.
> >>>> 
> >>>> I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous email in this thread.
> >>>> 
> >>>> Some updates:
> >>>> 
> >>>> There has been a discussion with Gavin McDonald from ASF infra on the-asf slack about getting usage reports from GitHub to support the investigation. Slack thread is the same one mentioned in the previous email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . Gavin already requested the usage report in GitHub UI, but it produced invalid results.
> >>>> 
> >>>> I made a change to mitigate a source of additional GitHub Actions overhead. 
> >>>> In the past, each cherry-picked commit to a maintenance branch of Pulsar has triggered a lot of workflow runs. 
> >>>> 
> >>>> The solution for cancelling duplicate builds automatically is to add this definition to the workflow definition:
> >>>> concurrency:
> >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> >>>>  cancel-in-progress: true
> >>>> 
> >>>> I added this to all maintenance branch GitHub Actions workflows:
> >>>> 
> >>>> branch-2.10 change:
> >>>> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> >>>> branch-2.9 change:
> >>>> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> >>>> branch-2.8 change:
> >>>> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> >>>> branch-2.7:
> >>>> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> >>>> 
> >>>> branch-2.11 already contains the necessary config for cancelling duplicate builds.
> >>>> 
> >>>> The benefit of the above change is that when multiple commits are cherry-picked to a branch at once, only the build of the last commit will get run eventually. The builds for the intermediate commits will get cancelled. Obviously there's a tradeoff here that we don't get the information if one of the earlier commits breaks the build. It's the cost that we need to pay. Nevertheless our build is so flaky that it's hard to determine whether a failed build result is only caused by bad flaky test or whether it's an actual failure. Because of this we don't lose anything by cancelling builds. It's more important to save build resources. In the maintenance branches for 2.10 and older, the average total build time consumed is around 20 hours which is a lot.
> >>>> 
> >>>> At this time, the overhead of maintenance branch builds doesn't seem to be the source of the problems. There must be some other issue which is possibly related to exceeding a usage quota. Hopefully we get the CI slowness issue solved asap.
> >>>> 
> >>>> BR,
> >>>> 
> >>>> Lari
> >>>> 
> >>>> 
> >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> >>>>> Hi,
> >>>>> 
> >>>>> GitHub Actions builds have been piling up in the build queue in the last few days.
> >>>>> I posted on builds@apache.org https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633 about this issue.
> >>>>> There's also a thread on the-asf slack, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . 
> >>>>> 
> >>>>> It seems that our build queue is finally getting picked up, but it would be great to see if we hit quota and whether that is the cause of pauses. 
> >>>>> 
> >>>>> Another issue is that the master branch broke after merging 2 conflicting PRs. 
> >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 . 
> >>>>> 
> >>>>> Merging PRs will be slow until we have these 2 problems solved and existing PRs rebased over the changes. Let's prioritize merging #17300 before pushing more changes.
> >>>>> 
> >>>>> I'd like to point out that a good way to get build feedback before sending a PR, is to run builds on your personal GitHub Actions CI. The benefit of this is that it doesn't consume the shared quota and builds usually start instantly.
> >>>>> There are instructions in the contributors guide about this. 
> >>>>> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> >>>>> You simply open PRs to your own fork of apache/pulsar to run builds on your personal GitHub Actions CI.
> >>>>> 
> >>>>> BR,
> >>>>> 
> >>>>> Lari
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>> 
> >>> 
> >> 
> 
>

Re: Pulsar CI congested, master branch build broken

Posted by Liu Yu <li...@apache.org>.

Thanks Lari!

Does this issue cause the tests for PRs like https://github.com/apache/pulsar/pull/17198 to be hang?

On 2022/09/06 14:41:07 Dave Fisher wrote:
> We are going to need to take actions to fix our problems. See https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749
> 
> Jarek has done a large amount of GitHub Action work with Apache Airflow and his suggestions might be helpful. One of his suggestions was Apache Yetus. I think he means using the Maven plugins - https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/
> 
> 
> > On Sep 6, 2022, at 4:48 AM, Lari Hotari <lh...@apache.org> wrote:
> > 
> > The Apache Infra ticket is https://issues.apache.org/jira/browse/INFRA-23633 . 
> > 
> > -Lari
> > 
> > On 2022/09/06 11:36:46 Lari Hotari wrote:
> >> I asked for an update on the Apache org GitHub Actions usage stats from Gavin McDonald on the-asf slack in this thread: https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8 .
> >> 
> >> I hope we get this issue resolved since it delays PR processing a lot.
> >> 
> >> -Lari
> >> 
> >> On 2022/09/06 11:16:07 Lari Hotari wrote:
> >>> Pulsar CI continues to be congested, and the build queue [1] is very long at the moment. There are 147 build jobs in the queue and 16 jobs in progress at the moment.
> >>> 
> >>> I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous emails in this thread.
> >>> 
> >>> -Lari
> >>> 
> >>> [1] - build queue: https://github.com/apache/pulsar/actions?query=is%3Aqueued
> >>> 
> >>> On 2022/08/30 12:39:19 Lari Hotari wrote:
> >>>> Pulsar CI continues to be congested, and the build queue is long.
> >>>> 
> >>>> I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous email in this thread.
> >>>> 
> >>>> Some updates:
> >>>> 
> >>>> There has been a discussion with Gavin McDonald from ASF infra on the-asf slack about getting usage reports from GitHub to support the investigation. Slack thread is the same one mentioned in the previous email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . Gavin already requested the usage report in GitHub UI, but it produced invalid results.
> >>>> 
> >>>> I made a change to mitigate a source of additional GitHub Actions overhead. 
> >>>> In the past, each cherry-picked commit to a maintenance branch of Pulsar has triggered a lot of workflow runs. 
> >>>> 
> >>>> The solution for cancelling duplicate builds automatically is to add this definition to the workflow definition:
> >>>> concurrency:
> >>>>  group: ${{ github.workflow }}-${{ github.ref }}
> >>>>  cancel-in-progress: true
> >>>> 
> >>>> I added this to all maintenance branch GitHub Actions workflows:
> >>>> 
> >>>> branch-2.10 change:
> >>>> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> >>>> branch-2.9 change:
> >>>> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> >>>> branch-2.8 change:
> >>>> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> >>>> branch-2.7:
> >>>> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> >>>> 
> >>>> branch-2.11 already contains the necessary config for cancelling duplicate builds.
> >>>> 
> >>>> The benefit of the above change is that when multiple commits are cherry-picked to a branch at once, only the build of the last commit will get run eventually. The builds for the intermediate commits will get cancelled. Obviously there's a tradeoff here that we don't get the information if one of the earlier commits breaks the build. It's the cost that we need to pay. Nevertheless our build is so flaky that it's hard to determine whether a failed build result is only caused by bad flaky test or whether it's an actual failure. Because of this we don't lose anything by cancelling builds. It's more important to save build resources. In the maintenance branches for 2.10 and older, the average total build time consumed is around 20 hours which is a lot.
> >>>> 
> >>>> At this time, the overhead of maintenance branch builds doesn't seem to be the source of the problems. There must be some other issue which is possibly related to exceeding a usage quota. Hopefully we get the CI slowness issue solved asap.
> >>>> 
> >>>> BR,
> >>>> 
> >>>> Lari
> >>>> 
> >>>> 
> >>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
> >>>>> Hi,
> >>>>> 
> >>>>> GitHub Actions builds have been piling up in the build queue in the last few days.
> >>>>> I posted on builds@apache.org https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633 about this issue.
> >>>>> There's also a thread on the-asf slack, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . 
> >>>>> 
> >>>>> It seems that our build queue is finally getting picked up, but it would be great to see if we hit quota and whether that is the cause of pauses. 
> >>>>> 
> >>>>> Another issue is that the master branch broke after merging 2 conflicting PRs. 
> >>>>> The fix is in https://github.com/apache/pulsar/pull/17300 . 
> >>>>> 
> >>>>> Merging PRs will be slow until we have these 2 problems solved and existing PRs rebased over the changes. Let's prioritize merging #17300 before pushing more changes.
> >>>>> 
> >>>>> I'd like to point out that a good way to get build feedback before sending a PR, is to run builds on your personal GitHub Actions CI. The benefit of this is that it doesn't consume the shared quota and builds usually start instantly.
> >>>>> There are instructions in the contributors guide about this. 
> >>>>> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> >>>>> You simply open PRs to your own fork of apache/pulsar to run builds on your personal GitHub Actions CI.
> >>>>> 
> >>>>> BR,
> >>>>> 
> >>>>> Lari
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>>> 
> >>>> 
> >>> 
> >> 
> 
>

Re: Pulsar CI congested, master branch build broken

Posted by Dave Fisher <wa...@apache.org>.

We are going to need to take actions to fix our problems. See https://issues.apache.org/jira/browse/INFRA-23633?focusedCommentId=17600749&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17600749

Jarek has done a large amount of GitHub Action work with Apache Airflow and his suggestions might be helpful. One of his suggestions was Apache Yetus. I think he means using the Maven plugins - https://yetus.apache.org/documentation/0.14.0/yetus-maven-plugin/


> On Sep 6, 2022, at 4:48 AM, Lari Hotari <lh...@apache.org> wrote:
> 
> The Apache Infra ticket is https://issues.apache.org/jira/browse/INFRA-23633 . 
> 
> -Lari
> 
> On 2022/09/06 11:36:46 Lari Hotari wrote:
>> I asked for an update on the Apache org GitHub Actions usage stats from Gavin McDonald on the-asf slack in this thread: https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8 .
>> 
>> I hope we get this issue resolved since it delays PR processing a lot.
>> 
>> -Lari
>> 
>> On 2022/09/06 11:16:07 Lari Hotari wrote:
>>> Pulsar CI continues to be congested, and the build queue [1] is very long at the moment. There are 147 build jobs in the queue and 16 jobs in progress at the moment.
>>> 
>>> I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous emails in this thread.
>>> 
>>> -Lari
>>> 
>>> [1] - build queue: https://github.com/apache/pulsar/actions?query=is%3Aqueued
>>> 
>>> On 2022/08/30 12:39:19 Lari Hotari wrote:
>>>> Pulsar CI continues to be congested, and the build queue is long.
>>>> 
>>>> I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous email in this thread.
>>>> 
>>>> Some updates:
>>>> 
>>>> There has been a discussion with Gavin McDonald from ASF infra on the-asf slack about getting usage reports from GitHub to support the investigation. Slack thread is the same one mentioned in the previous email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . Gavin already requested the usage report in GitHub UI, but it produced invalid results.
>>>> 
>>>> I made a change to mitigate a source of additional GitHub Actions overhead. 
>>>> In the past, each cherry-picked commit to a maintenance branch of Pulsar has triggered a lot of workflow runs. 
>>>> 
>>>> The solution for cancelling duplicate builds automatically is to add this definition to the workflow definition:
>>>> concurrency:
>>>>  group: ${{ github.workflow }}-${{ github.ref }}
>>>>  cancel-in-progress: true
>>>> 
>>>> I added this to all maintenance branch GitHub Actions workflows:
>>>> 
>>>> branch-2.10 change:
>>>> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
>>>> branch-2.9 change:
>>>> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
>>>> branch-2.8 change:
>>>> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
>>>> branch-2.7:
>>>> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
>>>> 
>>>> branch-2.11 already contains the necessary config for cancelling duplicate builds.
>>>> 
>>>> The benefit of the above change is that when multiple commits are cherry-picked to a branch at once, only the build of the last commit will get run eventually. The builds for the intermediate commits will get cancelled. Obviously there's a tradeoff here that we don't get the information if one of the earlier commits breaks the build. It's the cost that we need to pay. Nevertheless our build is so flaky that it's hard to determine whether a failed build result is only caused by bad flaky test or whether it's an actual failure. Because of this we don't lose anything by cancelling builds. It's more important to save build resources. In the maintenance branches for 2.10 and older, the average total build time consumed is around 20 hours which is a lot.
>>>> 
>>>> At this time, the overhead of maintenance branch builds doesn't seem to be the source of the problems. There must be some other issue which is possibly related to exceeding a usage quota. Hopefully we get the CI slowness issue solved asap.
>>>> 
>>>> BR,
>>>> 
>>>> Lari
>>>> 
>>>> 
>>>> On 2022/08/26 12:00:20 Lari Hotari wrote:
>>>>> Hi,
>>>>> 
>>>>> GitHub Actions builds have been piling up in the build queue in the last few days.
>>>>> I posted on builds@apache.org https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633 about this issue.
>>>>> There's also a thread on the-asf slack, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . 
>>>>> 
>>>>> It seems that our build queue is finally getting picked up, but it would be great to see if we hit quota and whether that is the cause of pauses. 
>>>>> 
>>>>> Another issue is that the master branch broke after merging 2 conflicting PRs. 
>>>>> The fix is in https://github.com/apache/pulsar/pull/17300 . 
>>>>> 
>>>>> Merging PRs will be slow until we have these 2 problems solved and existing PRs rebased over the changes. Let's prioritize merging #17300 before pushing more changes.
>>>>> 
>>>>> I'd like to point out that a good way to get build feedback before sending a PR, is to run builds on your personal GitHub Actions CI. The benefit of this is that it doesn't consume the shared quota and builds usually start instantly.
>>>>> There are instructions in the contributors guide about this. 
>>>>> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
>>>>> You simply open PRs to your own fork of apache/pulsar to run builds on your personal GitHub Actions CI.
>>>>> 
>>>>> BR,
>>>>> 
>>>>> Lari
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>> 
>>> 
>>

Re: Pulsar CI congested, master branch build broken

Posted by Lari Hotari <lh...@apache.org>.

The Apache Infra ticket is https://issues.apache.org/jira/browse/INFRA-23633 . 

-Lari

On 2022/09/06 11:36:46 Lari Hotari wrote:
> I asked for an update on the Apache org GitHub Actions usage stats from Gavin McDonald on the-asf slack in this thread: https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8 .
> 
> I hope we get this issue resolved since it delays PR processing a lot.
> 
> -Lari
> 
> On 2022/09/06 11:16:07 Lari Hotari wrote:
> > Pulsar CI continues to be congested, and the build queue [1] is very long at the moment. There are 147 build jobs in the queue and 16 jobs in progress at the moment.
> > 
> > I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous emails in this thread.
> > 
> > -Lari
> > 
> > [1] - build queue: https://github.com/apache/pulsar/actions?query=is%3Aqueued
> > 
> > On 2022/08/30 12:39:19 Lari Hotari wrote:
> > > Pulsar CI continues to be congested, and the build queue is long.
> > > 
> > > I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous email in this thread.
> > > 
> > > Some updates:
> > > 
> > > There has been a discussion with Gavin McDonald from ASF infra on the-asf slack about getting usage reports from GitHub to support the investigation. Slack thread is the same one mentioned in the previous email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . Gavin already requested the usage report in GitHub UI, but it produced invalid results.
> > > 
> > > I made a change to mitigate a source of additional GitHub Actions overhead. 
> > > In the past, each cherry-picked commit to a maintenance branch of Pulsar has triggered a lot of workflow runs. 
> > > 
> > > The solution for cancelling duplicate builds automatically is to add this definition to the workflow definition:
> > > concurrency:
> > >   group: ${{ github.workflow }}-${{ github.ref }}
> > >   cancel-in-progress: true
> > > 
> > > I added this to all maintenance branch GitHub Actions workflows:
> > > 
> > > branch-2.10 change:
> > > https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > > branch-2.9 change:
> > > https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > > branch-2.8 change:
> > > https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > > branch-2.7:
> > > https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > > 
> > > branch-2.11 already contains the necessary config for cancelling duplicate builds.
> > > 
> > > The benefit of the above change is that when multiple commits are cherry-picked to a branch at once, only the build of the last commit will get run eventually. The builds for the intermediate commits will get cancelled. Obviously there's a tradeoff here that we don't get the information if one of the earlier commits breaks the build. It's the cost that we need to pay. Nevertheless our build is so flaky that it's hard to determine whether a failed build result is only caused by bad flaky test or whether it's an actual failure. Because of this we don't lose anything by cancelling builds. It's more important to save build resources. In the maintenance branches for 2.10 and older, the average total build time consumed is around 20 hours which is a lot.
> > > 
> > > At this time, the overhead of maintenance branch builds doesn't seem to be the source of the problems. There must be some other issue which is possibly related to exceeding a usage quota. Hopefully we get the CI slowness issue solved asap.
> > > 
> > > BR,
> > > 
> > > Lari
> > > 
> > > 
> > > On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > > Hi,
> > > > 
> > > > GitHub Actions builds have been piling up in the build queue in the last few days.
> > > > I posted on builds@apache.org https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633 about this issue.
> > > > There's also a thread on the-asf slack, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . 
> > > > 
> > > > It seems that our build queue is finally getting picked up, but it would be great to see if we hit quota and whether that is the cause of pauses. 
> > > > 
> > > > Another issue is that the master branch broke after merging 2 conflicting PRs. 
> > > > The fix is in https://github.com/apache/pulsar/pull/17300 . 
> > > > 
> > > > Merging PRs will be slow until we have these 2 problems solved and existing PRs rebased over the changes. Let's prioritize merging #17300 before pushing more changes.
> > > > 
> > > > I'd like to point out that a good way to get build feedback before sending a PR, is to run builds on your personal GitHub Actions CI. The benefit of this is that it doesn't consume the shared quota and builds usually start instantly.
> > > > There are instructions in the contributors guide about this. 
> > > > https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > > You simply open PRs to your own fork of apache/pulsar to run builds on your personal GitHub Actions CI.
> > > > 
> > > > BR,
> > > > 
> > > > Lari
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > 
>

Re: Pulsar CI congested, master branch build broken

Posted by Lari Hotari <lh...@apache.org>.

I asked for an update on the Apache org GitHub Actions usage stats from Gavin McDonald on the-asf slack in this thread: https://the-asf.slack.com/archives/CBX4TSBQ8/p1662464113873539?thread_ts=1661512133.913279&cid=CBX4TSBQ8 .

I hope we get this issue resolved since it delays PR processing a lot.

-Lari

On 2022/09/06 11:16:07 Lari Hotari wrote:
> Pulsar CI continues to be congested, and the build queue [1] is very long at the moment. There are 147 build jobs in the queue and 16 jobs in progress at the moment.
> 
> I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous emails in this thread.
> 
> -Lari
> 
> [1] - build queue: https://github.com/apache/pulsar/actions?query=is%3Aqueued
> 
> On 2022/08/30 12:39:19 Lari Hotari wrote:
> > Pulsar CI continues to be congested, and the build queue is long.
> > 
> > I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous email in this thread.
> > 
> > Some updates:
> > 
> > There has been a discussion with Gavin McDonald from ASF infra on the-asf slack about getting usage reports from GitHub to support the investigation. Slack thread is the same one mentioned in the previous email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . Gavin already requested the usage report in GitHub UI, but it produced invalid results.
> > 
> > I made a change to mitigate a source of additional GitHub Actions overhead. 
> > In the past, each cherry-picked commit to a maintenance branch of Pulsar has triggered a lot of workflow runs. 
> > 
> > The solution for cancelling duplicate builds automatically is to add this definition to the workflow definition:
> > concurrency:
> >   group: ${{ github.workflow }}-${{ github.ref }}
> >   cancel-in-progress: true
> > 
> > I added this to all maintenance branch GitHub Actions workflows:
> > 
> > branch-2.10 change:
> > https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> > branch-2.9 change:
> > https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> > branch-2.8 change:
> > https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> > branch-2.7:
> > https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> > 
> > branch-2.11 already contains the necessary config for cancelling duplicate builds.
> > 
> > The benefit of the above change is that when multiple commits are cherry-picked to a branch at once, only the build of the last commit will get run eventually. The builds for the intermediate commits will get cancelled. Obviously there's a tradeoff here that we don't get the information if one of the earlier commits breaks the build. It's the cost that we need to pay. Nevertheless our build is so flaky that it's hard to determine whether a failed build result is only caused by bad flaky test or whether it's an actual failure. Because of this we don't lose anything by cancelling builds. It's more important to save build resources. In the maintenance branches for 2.10 and older, the average total build time consumed is around 20 hours which is a lot.
> > 
> > At this time, the overhead of maintenance branch builds doesn't seem to be the source of the problems. There must be some other issue which is possibly related to exceeding a usage quota. Hopefully we get the CI slowness issue solved asap.
> > 
> > BR,
> > 
> > Lari
> > 
> > 
> > On 2022/08/26 12:00:20 Lari Hotari wrote:
> > > Hi,
> > > 
> > > GitHub Actions builds have been piling up in the build queue in the last few days.
> > > I posted on builds@apache.org https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633 about this issue.
> > > There's also a thread on the-asf slack, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . 
> > > 
> > > It seems that our build queue is finally getting picked up, but it would be great to see if we hit quota and whether that is the cause of pauses. 
> > > 
> > > Another issue is that the master branch broke after merging 2 conflicting PRs. 
> > > The fix is in https://github.com/apache/pulsar/pull/17300 . 
> > > 
> > > Merging PRs will be slow until we have these 2 problems solved and existing PRs rebased over the changes. Let's prioritize merging #17300 before pushing more changes.
> > > 
> > > I'd like to point out that a good way to get build feedback before sending a PR, is to run builds on your personal GitHub Actions CI. The benefit of this is that it doesn't consume the shared quota and builds usually start instantly.
> > > There are instructions in the contributors guide about this. 
> > > https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > > You simply open PRs to your own fork of apache/pulsar to run builds on your personal GitHub Actions CI.
> > > 
> > > BR,
> > > 
> > > Lari
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > 
>

Re: Pulsar CI congested, master branch build broken

Posted by Lari Hotari <lh...@apache.org>.

Pulsar CI continues to be congested, and the build queue [1] is very long at the moment. There are 147 build jobs in the queue and 16 jobs in progress at the moment.

I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous emails in this thread.

-Lari

[1] - build queue: https://github.com/apache/pulsar/actions?query=is%3Aqueued

On 2022/08/30 12:39:19 Lari Hotari wrote:
> Pulsar CI continues to be congested, and the build queue is long.
> 
> I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous email in this thread.
> 
> Some updates:
> 
> There has been a discussion with Gavin McDonald from ASF infra on the-asf slack about getting usage reports from GitHub to support the investigation. Slack thread is the same one mentioned in the previous email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . Gavin already requested the usage report in GitHub UI, but it produced invalid results.
> 
> I made a change to mitigate a source of additional GitHub Actions overhead. 
> In the past, each cherry-picked commit to a maintenance branch of Pulsar has triggered a lot of workflow runs. 
> 
> The solution for cancelling duplicate builds automatically is to add this definition to the workflow definition:
> concurrency:
>   group: ${{ github.workflow }}-${{ github.ref }}
>   cancel-in-progress: true
> 
> I added this to all maintenance branch GitHub Actions workflows:
> 
> branch-2.10 change:
> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> branch-2.9 change:
> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> branch-2.8 change:
> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> branch-2.7:
> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
> 
> branch-2.11 already contains the necessary config for cancelling duplicate builds.
> 
> The benefit of the above change is that when multiple commits are cherry-picked to a branch at once, only the build of the last commit will get run eventually. The builds for the intermediate commits will get cancelled. Obviously there's a tradeoff here that we don't get the information if one of the earlier commits breaks the build. It's the cost that we need to pay. Nevertheless our build is so flaky that it's hard to determine whether a failed build result is only caused by bad flaky test or whether it's an actual failure. Because of this we don't lose anything by cancelling builds. It's more important to save build resources. In the maintenance branches for 2.10 and older, the average total build time consumed is around 20 hours which is a lot.
> 
> At this time, the overhead of maintenance branch builds doesn't seem to be the source of the problems. There must be some other issue which is possibly related to exceeding a usage quota. Hopefully we get the CI slowness issue solved asap.
> 
> BR,
> 
> Lari
> 
> 
> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > Hi,
> > 
> > GitHub Actions builds have been piling up in the build queue in the last few days.
> > I posted on builds@apache.org https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633 about this issue.
> > There's also a thread on the-asf slack, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . 
> > 
> > It seems that our build queue is finally getting picked up, but it would be great to see if we hit quota and whether that is the cause of pauses. 
> > 
> > Another issue is that the master branch broke after merging 2 conflicting PRs. 
> > The fix is in https://github.com/apache/pulsar/pull/17300 . 
> > 
> > Merging PRs will be slow until we have these 2 problems solved and existing PRs rebased over the changes. Let's prioritize merging #17300 before pushing more changes.
> > 
> > I'd like to point out that a good way to get build feedback before sending a PR, is to run builds on your personal GitHub Actions CI. The benefit of this is that it doesn't consume the shared quota and builds usually start instantly.
> > There are instructions in the contributors guide about this. 
> > https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > You simply open PRs to your own fork of apache/pulsar to run builds on your personal GitHub Actions CI.
> > 
> > BR,
> > 
> > Lari
> > 
> > 
> > 
> > 
> > 
> > 
> > 
> > 
>

Re: Pulsar CI congested, master branch build broken

Posted by Enrico Olivelli <eo...@gmail.com>.

Lari,

Il giorno mar 30 ago 2022 alle ore 14:39 Lari Hotari
<lh...@apache.org> ha scritto:
>
> Pulsar CI continues to be congested, and the build queue is long.
>
> I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous email in this thread.
>
> Some updates:
>
> There has been a discussion with Gavin McDonald from ASF infra on the-asf slack about getting usage reports from GitHub to support the investigation. Slack thread is the same one mentioned in the previous email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . Gavin already requested the usage report in GitHub UI, but it produced invalid results.
>
> I made a change to mitigate a source of additional GitHub Actions overhead.
> In the past, each cherry-picked commit to a maintenance branch of Pulsar has triggered a lot of workflow runs.
>
> The solution for cancelling duplicate builds automatically is to add this definition to the workflow definition:
> concurrency:
>   group: ${{ github.workflow }}-${{ github.ref }}
>   cancel-in-progress: true
>
> I added this to all maintenance branch GitHub Actions workflows:
>
> branch-2.10 change:
> https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
> branch-2.9 change:
> https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
> branch-2.8 change:
> https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
> branch-2.7:
> https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630
>
> branch-2.11 already contains the necessary config for cancelling duplicate builds.
>
> The benefit of the above change is that when multiple commits are cherry-picked to a branch at once, only the build of the last commit will get run eventually. The builds for the intermediate commits will get cancelled. Obviously there's a tradeoff here that we don't get the information if one of the earlier commits breaks the build. It's the cost that we need to pay. Nevertheless our build is so flaky that it's hard to determine whether a failed build result is only caused by bad flaky test or whether it's an actual failure. Because of this we don't lose anything by cancelling builds. It's more important to save build resources. In the maintenance branches for 2.10 and older, the average total build time consumed is around 20 hours which is a lot.

Thanks !

Enrico

>
> At this time, the overhead of maintenance branch builds doesn't seem to be the source of the problems. There must be some other issue which is possibly related to exceeding a usage quota. Hopefully we get the CI slowness issue solved asap.
>
> BR,
>
> Lari
>
>
> On 2022/08/26 12:00:20 Lari Hotari wrote:
> > Hi,
> >
> > GitHub Actions builds have been piling up in the build queue in the last few days.
> > I posted on builds@apache.org https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633 about this issue.
> > There's also a thread on the-asf slack, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 .
> >
> > It seems that our build queue is finally getting picked up, but it would be great to see if we hit quota and whether that is the cause of pauses.
> >
> > Another issue is that the master branch broke after merging 2 conflicting PRs.
> > The fix is in https://github.com/apache/pulsar/pull/17300 .
> >
> > Merging PRs will be slow until we have these 2 problems solved and existing PRs rebased over the changes. Let's prioritize merging #17300 before pushing more changes.
> >
> > I'd like to point out that a good way to get build feedback before sending a PR, is to run builds on your personal GitHub Actions CI. The benefit of this is that it doesn't consume the shared quota and builds usually start instantly.
> > There are instructions in the contributors guide about this.
> > https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> > You simply open PRs to your own fork of apache/pulsar to run builds on your personal GitHub Actions CI.
> >
> > BR,
> >
> > Lari
> >
> >
> >
> >
> >
> >
> >
> >

Re: Pulsar CI congested, master branch build broken

Posted by Lari Hotari <lh...@apache.org>.

Pulsar CI continues to be congested, and the build queue is long.

I would strongly advice everyone to use "personal CI" to mitigate the issue of the long delay of CI feedback. You can simply open a PR to your own personal fork of apache/pulsar to run the builds in your "personal CI". There's more details in the previous email in this thread.

Some updates:

There has been a discussion with Gavin McDonald from ASF infra on the-asf slack about getting usage reports from GitHub to support the investigation. Slack thread is the same one mentioned in the previous email, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . Gavin already requested the usage report in GitHub UI, but it produced invalid results.

I made a change to mitigate a source of additional GitHub Actions overhead. 
In the past, each cherry-picked commit to a maintenance branch of Pulsar has triggered a lot of workflow runs. 

The solution for cancelling duplicate builds automatically is to add this definition to the workflow definition:
concurrency:
  group: ${{ github.workflow }}-${{ github.ref }}
  cancel-in-progress: true

I added this to all maintenance branch GitHub Actions workflows:

branch-2.10 change:
https://github.com/apache/pulsar/commit/5d2c9851f4f4d70bfe74b1e683a41c5a040a6ca7
branch-2.9 change:
https://github.com/apache/pulsar/commit/3ea124924fecf636cc105de75c62b3a99050847b
branch-2.8 change:
https://github.com/apache/pulsar/commit/48187bb5d95e581f8322a019b61d986e18a31e54
branch-2.7:
https://github.com/apache/pulsar/commit/744b62c99344724eacdbe97c881311869d67f630

branch-2.11 already contains the necessary config for cancelling duplicate builds.

The benefit of the above change is that when multiple commits are cherry-picked to a branch at once, only the build of the last commit will get run eventually. The builds for the intermediate commits will get cancelled. Obviously there's a tradeoff here that we don't get the information if one of the earlier commits breaks the build. It's the cost that we need to pay. Nevertheless our build is so flaky that it's hard to determine whether a failed build result is only caused by bad flaky test or whether it's an actual failure. Because of this we don't lose anything by cancelling builds. It's more important to save build resources. In the maintenance branches for 2.10 and older, the average total build time consumed is around 20 hours which is a lot.

At this time, the overhead of maintenance branch builds doesn't seem to be the source of the problems. There must be some other issue which is possibly related to exceeding a usage quota. Hopefully we get the CI slowness issue solved asap.

BR,

Lari

On 2022/08/26 12:00:20 Lari Hotari wrote:
> Hi,
> 
> GitHub Actions builds have been piling up in the build queue in the last few days.
> I posted on builds@apache.org https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633 about this issue.
> There's also a thread on the-asf slack, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . 
> 
> It seems that our build queue is finally getting picked up, but it would be great to see if we hit quota and whether that is the cause of pauses. 
> 
> Another issue is that the master branch broke after merging 2 conflicting PRs. 
> The fix is in https://github.com/apache/pulsar/pull/17300 . 
> 
> Merging PRs will be slow until we have these 2 problems solved and existing PRs rebased over the changes. Let's prioritize merging #17300 before pushing more changes.
> 
> I'd like to point out that a good way to get build feedback before sending a PR, is to run builds on your personal GitHub Actions CI. The benefit of this is that it doesn't consume the shared quota and builds usually start instantly.
> There are instructions in the contributors guide about this. 
> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> You simply open PRs to your own fork of apache/pulsar to run builds on your personal GitHub Actions CI.
> 
> BR,
> 
> Lari
> 
> 
> 
> 
> 
> 
> 
>

Re: Pulsar CI congested, master branch build broken

Posted by Lari Hotari <lh...@apache.org>.

master branch is broken once again. Here's the fix:
https://github.com/apache/pulsar/pull/17339

Please review and merge

-Lari

On 2022/08/26 12:00:20 Lari Hotari wrote:
> Hi,
> 
> GitHub Actions builds have been piling up in the build queue in the last few days.
> I posted on builds@apache.org https://lists.apache.org/thread/6lbqr0f6mqt9s8ggollp5kj2nv7rlo9s and created INFRA ticket https://issues.apache.org/jira/browse/INFRA-23633 about this issue.
> There's also a thread on the-asf slack, https://the-asf.slack.com/archives/CBX4TSBQ8/p1661512133913279 . 
> 
> It seems that our build queue is finally getting picked up, but it would be great to see if we hit quota and whether that is the cause of pauses. 
> 
> Another issue is that the master branch broke after merging 2 conflicting PRs. 
> The fix is in https://github.com/apache/pulsar/pull/17300 . 
> 
> Merging PRs will be slow until we have these 2 problems solved and existing PRs rebased over the changes. Let's prioritize merging #17300 before pushing more changes.
> 
> I'd like to point out that a good way to get build feedback before sending a PR, is to run builds on your personal GitHub Actions CI. The benefit of this is that it doesn't consume the shared quota and builds usually start instantly.
> There are instructions in the contributors guide about this. 
> https://pulsar.apache.org/contributing/#ci-testing-in-your-fork
> You simply open PRs to your own fork of apache/pulsar to run builds on your personal GitHub Actions CI.
> 
> BR,
> 
> Lari
> 
> 
> 
> 
> 
> 
> 
>