You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@pulsar.apache.org by Lari Hotari <lh...@apache.org> on 2022/04/01 08:38:54 UTC

Re-running failed flaky builds in refactored Pulsar CI GitHub Actions workflow

Hi all,

There's a small limitation in re-running failed jobs (builds that fail because of flaky tests) in the refactored Pulsar CI workflow which combines multiple jobs into a single workflow.

The limitation is that you need to wait for all jobs to complete before failed jobs can be re-run.
Yesterday there was some issue with GitHub Actions and the build queue was several hours long. When there's enough build capacity and no build queue, the new workflow finishes in about 1 hour 20 minutes.

Re-running failed jobs can be requested by commenting "/pulsarbot rerun-failure-checks" on the  PR. This won't do anything if one of the jobs in the workflow is still executing.

Another confusion has been the new test reporting, which shows all test results and test failures as checks and annotations in the GitHub UI. 

Here's an example:
https://github.com/apache/pulsar/pull/14805/checks?check_run_id=5777139002

There's a limitation in GitHub Actions that the test reports get attached to the first workflow when a PR triggers more than one workflow. We still have multiple workflows and the test reports get attached to the "CI - CPP, Python Tests" workflow. Failed tests will show up as red check marks and in the case of retries, the test might have succeeded in a later attempt, but the check shows as failed. This won't prevent merging the PR. Please keep this small detail in mind when interpreting the build results.

The test reports are very verbose at the moment. This is a problem when checking the PR build results on GitHub Mobile app. I have created a PR to reduce test reporting to GitHub Actions UI in this PR: https://github.com/apache/pulsar/pull/14959

Please let me know if there are any other questions or problems that have come up with the new refactored Pulsar CI GitHub Actions workflow.

-Lari

Re: Re-running failed flaky builds in refactored Pulsar CI GitHub Actions workflow

Posted by Lari Hotari <lh...@apache.org>.
Unfortunately, the previous change had issues restarting more than 1 build job.
The problem has been resolved now. The change was https://github.com/apache/pulsar-test-infra/pull/34 . I merged the change, so please do post-merge reviews.

"/pulsarbot rerun-failure-checks" should work now. I'm sorry for the inconvenience that it caused when it wasn't working for all cases. Please let me know if there are any remaining issues.

-Lari

On 2022/04/21 09:45:37 Lari Hotari wrote:
> I have made a fix to the problem described below.
> Please review https://github.com/apache/pulsar-test-infra/pull/33 .
> 
> After this change is merged, closing and reopening PRs could be used to pick up most recent change from the master branch and "/pulsarbot rerun-failure-checks" will be able to rerun the failed jobs.
> 
> -Lari
> 
> On 2022/04/01 14:34:02 Lari Hotari wrote:
> > I now realized that my advice to close & reopen PRs to pick up master branch changes is problematic. This will cause issues with "/pulsarbot rerun-failure-checks". The script currently looks for the build to restart with the PR's head commit sha. If closing and reopening is used to start new PR build jobs, all build jobs will have the same head commit sha attached to them. When checking for that failed builds, the script will find also old builds with the same head commit sha and also restart them.
> > 
> > Please rebased your PR (or merge master branch changes to it) to pick up changes from master. Don't close & reopen PRs as I had advised earlier since it causes problems. The wrong builds will be run and that adds up in the build queue.
> > 
> > -Lari
> > 
> > 
> > 
> > On 2022/04/01 08:38:54 Lari Hotari wrote:
> > > Hi all,
> > > 
> > > There's a small limitation in re-running failed jobs (builds that fail because of flaky tests) in the refactored Pulsar CI workflow which combines multiple jobs into a single workflow.
> > > 
> > > The limitation is that you need to wait for all jobs to complete before failed jobs can be re-run.
> > > Yesterday there was some issue with GitHub Actions and the build queue was several hours long. When there's enough build capacity and no build queue, the new workflow finishes in about 1 hour 20 minutes.
> > > 
> > > Re-running failed jobs can be requested by commenting "/pulsarbot rerun-failure-checks" on the  PR. This won't do anything if one of the jobs in the workflow is still executing.
> > > 
> > > Another confusion has been the new test reporting, which shows all test results and test failures as checks and annotations in the GitHub UI. 
> > > 
> > > Here's an example:
> > > https://github.com/apache/pulsar/pull/14805/checks?check_run_id=5777139002
> > > 
> > > There's a limitation in GitHub Actions that the test reports get attached to the first workflow when a PR triggers more than one workflow. We still have multiple workflows and the test reports get attached to the "CI - CPP, Python Tests" workflow. Failed tests will show up as red check marks and in the case of retries, the test might have succeeded in a later attempt, but the check shows as failed. This won't prevent merging the PR. Please keep this small detail in mind when interpreting the build results.
> > > 
> > > The test reports are very verbose at the moment. This is a problem when checking the PR build results on GitHub Mobile app. I have created a PR to reduce test reporting to GitHub Actions UI in this PR: https://github.com/apache/pulsar/pull/14959
> > > 
> > > Please let me know if there are any other questions or problems that have come up with the new refactored Pulsar CI GitHub Actions workflow.
> > > 
> > > -Lari
> > > 
> > 
> 

Re: Re-running failed flaky builds in refactored Pulsar CI GitHub Actions workflow

Posted by Lari Hotari <lh...@apache.org>.
I have made a fix to the problem described below.
Please review https://github.com/apache/pulsar-test-infra/pull/33 .

After this change is merged, closing and reopening PRs could be used to pick up most recent change from the master branch and "/pulsarbot rerun-failure-checks" will be able to rerun the failed jobs.

-Lari

On 2022/04/01 14:34:02 Lari Hotari wrote:
> I now realized that my advice to close & reopen PRs to pick up master branch changes is problematic. This will cause issues with "/pulsarbot rerun-failure-checks". The script currently looks for the build to restart with the PR's head commit sha. If closing and reopening is used to start new PR build jobs, all build jobs will have the same head commit sha attached to them. When checking for that failed builds, the script will find also old builds with the same head commit sha and also restart them.
> 
> Please rebased your PR (or merge master branch changes to it) to pick up changes from master. Don't close & reopen PRs as I had advised earlier since it causes problems. The wrong builds will be run and that adds up in the build queue.
> 
> -Lari
> 
> 
> 
> On 2022/04/01 08:38:54 Lari Hotari wrote:
> > Hi all,
> > 
> > There's a small limitation in re-running failed jobs (builds that fail because of flaky tests) in the refactored Pulsar CI workflow which combines multiple jobs into a single workflow.
> > 
> > The limitation is that you need to wait for all jobs to complete before failed jobs can be re-run.
> > Yesterday there was some issue with GitHub Actions and the build queue was several hours long. When there's enough build capacity and no build queue, the new workflow finishes in about 1 hour 20 minutes.
> > 
> > Re-running failed jobs can be requested by commenting "/pulsarbot rerun-failure-checks" on the  PR. This won't do anything if one of the jobs in the workflow is still executing.
> > 
> > Another confusion has been the new test reporting, which shows all test results and test failures as checks and annotations in the GitHub UI. 
> > 
> > Here's an example:
> > https://github.com/apache/pulsar/pull/14805/checks?check_run_id=5777139002
> > 
> > There's a limitation in GitHub Actions that the test reports get attached to the first workflow when a PR triggers more than one workflow. We still have multiple workflows and the test reports get attached to the "CI - CPP, Python Tests" workflow. Failed tests will show up as red check marks and in the case of retries, the test might have succeeded in a later attempt, but the check shows as failed. This won't prevent merging the PR. Please keep this small detail in mind when interpreting the build results.
> > 
> > The test reports are very verbose at the moment. This is a problem when checking the PR build results on GitHub Mobile app. I have created a PR to reduce test reporting to GitHub Actions UI in this PR: https://github.com/apache/pulsar/pull/14959
> > 
> > Please let me know if there are any other questions or problems that have come up with the new refactored Pulsar CI GitHub Actions workflow.
> > 
> > -Lari
> > 
> 

Re: Re-running failed flaky builds in refactored Pulsar CI GitHub Actions workflow

Posted by Lari Hotari <lh...@apache.org>.
GitHub Actions has some problem and the UI has a warning
"We are having problems searching workflow runs. The results may not be complete."
(I can see this warning on https://github.com/apache/pulsar/actions)

The impact of this is that "/pulsarbot rerun-failure-checks" doesn't work when it cannot find the failed or cancelled workflow runs.

-Lari

On 2022/04/08 07:01:33 Lari Hotari wrote:
> With the new GitHub Actions CI workflow there are cases where you see a red mark as a failure, but there's no need to rerun failed jobs since the red failure marks are a result of failed test reports (usually from failed flaky tests).
> 
> The new Pulsar CI workflow renders Junit xml test reports and integrates them to the GitHub UI. There are multiple benefits of this. The test failures will be shown directly in the PR review. 
> 
> You will see red failure marks without a failed job when flaky tests fail, but later pass in a retry. The failed test result will get recorded to a test report, but there's no need to rerun failed jobs. 
> 
> This doesn't block merging, but will show up so that the failures can be inspected.  This can be confusing at first, since everyone has been used to rerunning jobs when there's a red failure mark shown in the PR.
> 
> It might appear that "/pulsarbot rerun-failure-checks" is broken. That's not the case. Usually the issue is that there's no failed job or the workflow where a job has failed is still executing. A failed job in a workflow can only be rerun after the complete workflow completes. That's explained in an earlier message in this thread.
> 
> With test reports, there's an additional confusion, since GitHub Actions has a bug that the test reports get attached randomly to a workflow when multiple workflows are executing. It's a known issue and once GitHub fixes the bug, it will be resolved.
> (here's a link to one of the reports about the GitHub Actions bug: https://github.community/t/github-actions-status-checks-created-on-incorrect-check-suite-id/16685)
> 
> Please let me know if you have trouble with the new Pulsar CI GitHub Actions workflow and let's try to resolve the issues together.
> 
> I'll try to find a place to document the details that are mentioned in this email thread.
> 
> -Lari
> 
> 
> On 2022/04/01 14:34:02 Lari Hotari wrote:
> > I now realized that my advice to close & reopen PRs to pick up master branch changes is problematic. This will cause issues with "/pulsarbot rerun-failure-checks". The script currently looks for the build to restart with the PR's head commit sha. If closing and reopening is used to start new PR build jobs, all build jobs will have the same head commit sha attached to them. When checking for that failed builds, the script will find also old builds with the same head commit sha and also restart them.
> > 
> > Please rebased your PR (or merge master branch changes to it) to pick up changes from master. Don't close & reopen PRs as I had advised earlier since it causes problems. The wrong builds will be run and that adds up in the build queue.
> > 
> > -Lari
> > 
> > 
> > 
> > On 2022/04/01 08:38:54 Lari Hotari wrote:
> > > Hi all,
> > > 
> > > There's a small limitation in re-running failed jobs (builds that fail because of flaky tests) in the refactored Pulsar CI workflow which combines multiple jobs into a single workflow.
> > > 
> > > The limitation is that you need to wait for all jobs to complete before failed jobs can be re-run.
> > > Yesterday there was some issue with GitHub Actions and the build queue was several hours long. When there's enough build capacity and no build queue, the new workflow finishes in about 1 hour 20 minutes.
> > > 
> > > Re-running failed jobs can be requested by commenting "/pulsarbot rerun-failure-checks" on the  PR. This won't do anything if one of the jobs in the workflow is still executing.
> > > 
> > > Another confusion has been the new test reporting, which shows all test results and test failures as checks and annotations in the GitHub UI. 
> > > 
> > > Here's an example:
> > > https://github.com/apache/pulsar/pull/14805/checks?check_run_id=5777139002
> > > 
> > > There's a limitation in GitHub Actions that the test reports get attached to the first workflow when a PR triggers more than one workflow. We still have multiple workflows and the test reports get attached to the "CI - CPP, Python Tests" workflow. Failed tests will show up as red check marks and in the case of retries, the test might have succeeded in a later attempt, but the check shows as failed. This won't prevent merging the PR. Please keep this small detail in mind when interpreting the build results.
> > > 
> > > The test reports are very verbose at the moment. This is a problem when checking the PR build results on GitHub Mobile app. I have created a PR to reduce test reporting to GitHub Actions UI in this PR: https://github.com/apache/pulsar/pull/14959
> > > 
> > > Please let me know if there are any other questions or problems that have come up with the new refactored Pulsar CI GitHub Actions workflow.
> > > 
> > > -Lari
> > > 
> > 
> 

Re: Re-running failed flaky builds in refactored Pulsar CI GitHub Actions workflow

Posted by Lari Hotari <lh...@apache.org>.
With the new GitHub Actions CI workflow there are cases where you see a red mark as a failure, but there's no need to rerun failed jobs since the red failure marks are a result of failed test reports (usually from failed flaky tests).

The new Pulsar CI workflow renders Junit xml test reports and integrates them to the GitHub UI. There are multiple benefits of this. The test failures will be shown directly in the PR review. 

You will see red failure marks without a failed job when flaky tests fail, but later pass in a retry. The failed test result will get recorded to a test report, but there's no need to rerun failed jobs. 

This doesn't block merging, but will show up so that the failures can be inspected.  This can be confusing at first, since everyone has been used to rerunning jobs when there's a red failure mark shown in the PR.

It might appear that "/pulsarbot rerun-failure-checks" is broken. That's not the case. Usually the issue is that there's no failed job or the workflow where a job has failed is still executing. A failed job in a workflow can only be rerun after the complete workflow completes. That's explained in an earlier message in this thread.

With test reports, there's an additional confusion, since GitHub Actions has a bug that the test reports get attached randomly to a workflow when multiple workflows are executing. It's a known issue and once GitHub fixes the bug, it will be resolved.
(here's a link to one of the reports about the GitHub Actions bug: https://github.community/t/github-actions-status-checks-created-on-incorrect-check-suite-id/16685)

Please let me know if you have trouble with the new Pulsar CI GitHub Actions workflow and let's try to resolve the issues together.

I'll try to find a place to document the details that are mentioned in this email thread.

-Lari


On 2022/04/01 14:34:02 Lari Hotari wrote:
> I now realized that my advice to close & reopen PRs to pick up master branch changes is problematic. This will cause issues with "/pulsarbot rerun-failure-checks". The script currently looks for the build to restart with the PR's head commit sha. If closing and reopening is used to start new PR build jobs, all build jobs will have the same head commit sha attached to them. When checking for that failed builds, the script will find also old builds with the same head commit sha and also restart them.
> 
> Please rebased your PR (or merge master branch changes to it) to pick up changes from master. Don't close & reopen PRs as I had advised earlier since it causes problems. The wrong builds will be run and that adds up in the build queue.
> 
> -Lari
> 
> 
> 
> On 2022/04/01 08:38:54 Lari Hotari wrote:
> > Hi all,
> > 
> > There's a small limitation in re-running failed jobs (builds that fail because of flaky tests) in the refactored Pulsar CI workflow which combines multiple jobs into a single workflow.
> > 
> > The limitation is that you need to wait for all jobs to complete before failed jobs can be re-run.
> > Yesterday there was some issue with GitHub Actions and the build queue was several hours long. When there's enough build capacity and no build queue, the new workflow finishes in about 1 hour 20 minutes.
> > 
> > Re-running failed jobs can be requested by commenting "/pulsarbot rerun-failure-checks" on the  PR. This won't do anything if one of the jobs in the workflow is still executing.
> > 
> > Another confusion has been the new test reporting, which shows all test results and test failures as checks and annotations in the GitHub UI. 
> > 
> > Here's an example:
> > https://github.com/apache/pulsar/pull/14805/checks?check_run_id=5777139002
> > 
> > There's a limitation in GitHub Actions that the test reports get attached to the first workflow when a PR triggers more than one workflow. We still have multiple workflows and the test reports get attached to the "CI - CPP, Python Tests" workflow. Failed tests will show up as red check marks and in the case of retries, the test might have succeeded in a later attempt, but the check shows as failed. This won't prevent merging the PR. Please keep this small detail in mind when interpreting the build results.
> > 
> > The test reports are very verbose at the moment. This is a problem when checking the PR build results on GitHub Mobile app. I have created a PR to reduce test reporting to GitHub Actions UI in this PR: https://github.com/apache/pulsar/pull/14959
> > 
> > Please let me know if there are any other questions or problems that have come up with the new refactored Pulsar CI GitHub Actions workflow.
> > 
> > -Lari
> > 
> 

Re: Re-running failed flaky builds in refactored Pulsar CI GitHub Actions workflow

Posted by Lari Hotari <lh...@apache.org>.
I now realized that my advice to close & reopen PRs to pick up master branch changes is problematic. This will cause issues with "/pulsarbot rerun-failure-checks". The script currently looks for the build to restart with the PR's head commit sha. If closing and reopening is used to start new PR build jobs, all build jobs will have the same head commit sha attached to them. When checking for that failed builds, the script will find also old builds with the same head commit sha and also restart them.

Please rebased your PR (or merge master branch changes to it) to pick up changes from master. Don't close & reopen PRs as I had advised earlier since it causes problems. The wrong builds will be run and that adds up in the build queue.

-Lari



On 2022/04/01 08:38:54 Lari Hotari wrote:
> Hi all,
> 
> There's a small limitation in re-running failed jobs (builds that fail because of flaky tests) in the refactored Pulsar CI workflow which combines multiple jobs into a single workflow.
> 
> The limitation is that you need to wait for all jobs to complete before failed jobs can be re-run.
> Yesterday there was some issue with GitHub Actions and the build queue was several hours long. When there's enough build capacity and no build queue, the new workflow finishes in about 1 hour 20 minutes.
> 
> Re-running failed jobs can be requested by commenting "/pulsarbot rerun-failure-checks" on the  PR. This won't do anything if one of the jobs in the workflow is still executing.
> 
> Another confusion has been the new test reporting, which shows all test results and test failures as checks and annotations in the GitHub UI. 
> 
> Here's an example:
> https://github.com/apache/pulsar/pull/14805/checks?check_run_id=5777139002
> 
> There's a limitation in GitHub Actions that the test reports get attached to the first workflow when a PR triggers more than one workflow. We still have multiple workflows and the test reports get attached to the "CI - CPP, Python Tests" workflow. Failed tests will show up as red check marks and in the case of retries, the test might have succeeded in a later attempt, but the check shows as failed. This won't prevent merging the PR. Please keep this small detail in mind when interpreting the build results.
> 
> The test reports are very verbose at the moment. This is a problem when checking the PR build results on GitHub Mobile app. I have created a PR to reduce test reporting to GitHub Actions UI in this PR: https://github.com/apache/pulsar/pull/14959
> 
> Please let me know if there are any other questions or problems that have come up with the new refactored Pulsar CI GitHub Actions workflow.
> 
> -Lari
>