You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@pulsar.apache.org by GitBox <gi...@apache.org> on 2021/04/06 08:32:29 UTC

[GitHub] [pulsar] lhotari opened a new pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

lhotari opened a new pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148


   ### Motivation
   
   There are a few tests that fail very often. This is blocking the merging of PRs currently.
   
   ### Modifications
   
   Move the problematic tests to "quarantine" test group. This test group will be run, but the test failures will be ignored.
   - Configure "excludedGroups" property with "quarantine" default value
   - Make modifications so that tests in the quarantine group can be run. It is required to make changes to "BeforeMethod" groups so that the method will get run for the quarantine group.
    


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] eolivelli commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
eolivelli commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814083612


   @codelipenghui the problem is that currently those tests are very flaky and we are ignoring them anyway.
   Having a test that is failing and if it fails we simply ignore it it is useless.
   In order to not lose them I suggest to create "blocker" issues for the next release and to fix them one at a time.
   
   If we have a panel that tells that a test failed with a disclaimer "do not worry about this test", then what is the meaning of that test ? it is simply a waste of resources to run it and more noise for the reviewers and for the contributors.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817642715


   I have removed all changes that aren't related to moving the most flaky tests to the "quarantine" test group. I'll create separate PRs for fixing resource leaks (memory leaks, thread leaks) and for other improvements to tests stability.
   @eolivelli @codelipenghui please review this PR again, hopefully we could get it merged since the scope is reduced a lot.
   
   These are the PRs that have been split out of this PR:
   * #10191
   * #10192
   * #10193
   * #10194
   * #10195
   * #10196
   * #10197
   * #10198
   * #10199
   I kept back a few remaining changes until these PRs have been processed.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814072886


   > I'm worried if we can't easily see the failed test, the PR will get merged, but the PR might introduce more failed tests (the change breaks some tests under the quarantine group).
   
   That's a valid concern. It's something that could be addressed after we have the initial solution in place.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-813962924


   > @lhotari How do we get the test result of the "quarantine" test group?
   
   @codelipenghui  There's no special reporting since we don't have a proper solution for reporting test reports.
   A warning will get added to the build summary page if one of tests in the "quarantine" group fails.
   This uses the GitHub Actions built-in feature. When there is `::warning::` as the prefix in console output, GitHub Actions adds a warning to the build summary page.
   
   One would have to check the logs to see what quarantined tests failed. The reporting can be improved later. Currently the CI pipeline is almost completely clogged because a few flaky tests make most builds to fail.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-813962924


   > @lhotari How do we get the test result of the "quarantine" test group?
   
   There's no special reporting since we don't have a proper solution for reporting test reports.
   A warning will get added to the build summary page if one of tests in the "quarantine" group fails.
   This uses the GitHub Actions built-in feature. When there is "::warning::" as the prefix in console output, GitHub Actions adds a warning to the build summary page.
   
   One would have to check the logs to see what quarantined tests failed. The reporting can be improved later. Currently the CI pipeline is almost completely clogged because a few flaky tests make most builds to fail.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] eolivelli commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
eolivelli commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817534354


   @lhotari good idea to split the patch
   
   you can set status to "draft" on this patch in order to keep the comments


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari removed a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari removed a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817801853






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817642715


   I have removed all changes that aren't related to moving the most flaky tests to the "quarantine" test group. I'll create separate PRs for fixing resource leaks (memory leaks, thread leaks) and for other improvements to tests stability.
   @eolivelli @codelipenghui please review this PR again, hopefully we could get it merged since the scope is reduced a lot


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari removed a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari removed a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-816865118


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-818233718


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] aahmed-se commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
aahmed-se commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814630438


   This pr shows the general direction that should be taken
   https://github.com/apache/pulsar/pull/10158


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817138225


   > I think current issue is also we're grouping too much flaky tests together which take a lot of time to execute, and if some of them time out then remaining tests don't even get a chance to be executed or retried, like normal flow usually have 3 retry but flaky group can barely finish 1 single run.
   
   @MarvinCai That is true before this PR goes in. :) 
   
   The root cause of the problems in CI has been the problems that a lot of resources haven't been properly released in tests and this caused memory and thread leaks. The shutdown sequence of the broker wasn't synchronous and that caused some tests to use a lot of resources since the shutdown of previous broker(s) could have been executing while the next test started. This together with quite a few PulsarClient and ExecutorService  leaks has been the root cause of many CI and test problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari removed a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari removed a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817334843


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari removed a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari removed a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-818473914


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari removed a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari removed a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-815366880


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-818473914


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817332938


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814091933


   > the problem is that currently those tests are very flaky and we are ignoring them anyway.
   Having a test that is failing and if it fails we simply ignore it it is useless.
   In order to not lose them I suggest to create "blocker" issues for the next release and to fix them one at a time.
   
   If we have a panel that tells that a test failed with a disclaimer "do not worry about this test", then what is the meaning of that test ? it is simply a waste of resources to run it and more noise for the reviewers and for the contributors.
   
   The flaky test does not mean it is a useless test, it can fail with A problem and B problem, A and B introduced the `flaky`. But the new change might cause C problem. If we always ignore the flaky test, the new break change might be merged into the master branch.
   
   And if the test is very flaky, my point is we need to fix it. If it has not failed frequently before, but frequently now, this is most likely caused by some concurrent merge. A similar situation happened before, most of them are caused by the concurrent merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] MarvinCai commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
MarvinCai commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817124198


   I think current issue is also we're grouping too much flaky tests together which take a lot of time to execute, and if some of them time out then remaining tests don't even get a chance to be executed or retried, like normal flow usually have 3 retry but flaky group can barely finish 1 single run.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] aahmed-se commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
aahmed-se commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-815149967


   Here is the updated pr https://github.com/apache/pulsar/pull/10158
   I have isolated all the unstable tests with extra group annotation, also there is template jos to run those tests on nightly cadence. This is temporary but suitable solution to unblock ci issues.
   
   ```When mixing class level @Test(group="...") definitions and method level @Test(group="...") definitions, it's necessary. My previous comment explains it. Perhaps you didn't have that type of use case?```
   
   I don't have that issue , we shouldn't be filtering things at the test method level, grouping should stay at the class level, mixing both is not a good idea.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-813985376


   @lhotari How about adding a workflow to run the "quarantine" test group? By default, the new workflow is not required and the author and reviewers can get the error from the workflow, they can run locally first to make sure the failed test is not introduced by the new change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817138225


   > I think current issue is also we're grouping too much flaky tests together which take a lot of time to execute, and if some of them time out then remaining tests don't even get a chance to be executed or retried, like normal flow usually have 3 retry but flaky group can barely finish 1 single run.
   
   That is true before this PR goes in. :) 
   
   The root cause of the problems in CI has been the problems that a lot of resources haven't been properly released in tests and this caused memory and thread leaks. The shutdown sequence of the broker wasn't synchronous and that caused some tests to use a lot of resources since the shutdown of previous broker(s) could have been executing while the next test started. This together with quite a few PulsarClient and ExecutorService  leaks has been the root cause of many CI and test problems.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-815177804


   > I don't have that issue , we shouldn't be filtering things at the test method level, grouping should stay at the class level, mixing both is not a good idea.
   
   @aahmed-se It seems to be pretty useful to be able to quarantine just a single test method instead of moving all tests of the test class to the quarantine. Why would it be a bad idea?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-818402308


   Sorry @lhotari, I want to check the test state of the tests under the quarantine group. But I'm not sure how to check them, Could you please point me how can I find the failed tests in the quarantine group for this PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-815315686


   > The test failures are github specific issues, 
   
   @aahmed-se Please elaborate more.
   
   > we don't want to exclude things be default when devs are running things locally or in a separate ci env.
   
   Making `<excludedGroups>quarantine</excludedGroups>` the default is a very minor change. Why wouldn't this change be desired? Please provide some examples of the problems it could cause.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] aahmed-se commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
aahmed-se commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814621082


   We do need the "excludedGroups" we already have group separation of broker tests running in gitlab workflows. They do no conflict with each other.  I have used the extra group successfully in our internal branch to exclude tests from all runs, it does not require any changes in pom.xml. In regards to the "BeforeMethod" more most of the them extra in already in place if it's missing we can add them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] aahmed-se commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
aahmed-se commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-815184453


   It will create confusion, user context in java is at a class level, this is enforced at file per class convention. Saying there are two executions of test one under a standard context one under a quarantine context will only confuse individuals.
   
   To run them you will need to create boolean set evaluation rules to isolate methods, testng does not have a clean abstraction to do so.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari removed a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari removed a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-815266973


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814685496


   > We do need the "excludedGroups" we already have group separation of broker tests running in gitlab workflows. They do no conflict with each other. I have used the extra group successfully in our internal branch to exclude tests from all runs, it does not require any changes in pom.xml. In regards to the "BeforeMethod" more most of the them extra in already in place if it's missing we can add them.
   
   When mixing class level `@Test(group="...")` definitions and method level `@Test(group="...")` definitions, it's necessary. My previous comment explains it. Perhaps you didn't have that type of use case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814616764


   > This seems wrong, I already added this functionality, there is group "extra" available to segregate tests, we can just change the one we wish to exclude to that , they will not be run among the broker tests. We don't need all these changes.
   
   It might be wrong currently. I hope we get the changes eventually "right" together. 
   
   The "extra" group didn't really exist. It was only mentioned in the `group` parameter of a few `@BeforeMethod` annotations. 
   This isn't a sufficient solution. I learned by experimenting, that to properly exclude a group in TestNG, the group has to be listed in the pom.xml's default value for `excludedGroups`. This will resolve the issue with TestNG where a single test method is added to another group:
   
   for example:
   ```
   @Test(groups = "flaky")
   public class DeadLetterTopicTest extends ProducerConsumerBase {
   ...
       @Test
       public void testDeadLetterTopic() throws Exception {
       ....
       }
   ...
       @Test(groups = "quarantine")
       public void testDeadLetterTopicByCustomTopicName() throws Exception {
       ...
       }
   ...
   ...
   }
   ```
   In the above example, `testDeadLetterTopicByCustomTopicName` test case will belong to groups `flaky` and `quarantine` at runtime ([explanation in TestNG documentation](https://testng.org/doc/documentation-main.html#partial-groups)). If there's no `excludedGroups` setting, this method will also get run for the `flaky` group. This problem is resolved by adding `quarantine` to the `excludedGroups`. 
   
   Since it's necessary to add the `quarantine` group to the `excludedGroups`, that is the reason why it doesn't work as a solution to list all possible groups in `@BeforeMethod` annotations. 
   
   For example, when `excludedGroups` contains `quarantine`, this `BeforeMethod` doesn't get run at all, for any group:
   ```
       @BeforeMethod(groups = { "broker", "websocket", "broker-api", "broker-discovery", "broker-impl", "extra", "flaky", "quarantine" })
        public void beforeMethod(Method m) throws Exception {
            methodName = m.getName();
        }
   ```
   This problem is resolved by using
   ```
       @BeforeMethod(alwaysRun = true)
        public void beforeMethod(Method m) throws Exception {
            methodName = m.getName();
        }
   ```
   
   I hope this explanation clarifies why this PR contains necessary changes if we want to add a new TestNG group for quarantined tests. If our goal is something else, this PR isn't necessary at all. In that case, we could simply use TestNG's `org.testng.annotations.Ignore` annotation or `@Test(enabled=false)`. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari removed a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari removed a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817709292


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-818499916


   > Sorry @lhotari, I want to check the test state of the tests under the quarantine group. But I'm not sure how to check them, Could you please point me how can I find the failed tests in the quarantine group for this PR?
   
   @codelipenghui  The quarantined test reporting isn't fully working, but it's possible to navigate to the logs. Here are the quarantined test failures as part of "CI - Unit - Brokers - Other / unit-tests" check: https://github.com/apache/pulsar/runs/2327040599?check_suite_focus=true#step:8:133
   
   I know that it would be easier to see the results if it was a separate build job. That would add more overhead to our builds and I'd like to avoid that since the resource consumption is already really high for our builds.
   
   It seems that the intended reporting solution broke when I moved ReplicatorTests to run separately from other Quarantined tests. There are a few reasons why I had to do this, mainly because there are so many resource leakages that are fixed in PRs #10192 , #10195 , #10196 , #10197, #10198 and #10199 . I'd like to propose that we get these PRs merged and I can improve the quarantined test reporting after that. @codelipenghui Are you fine with that?
   
   It should always be a blocker issue to fix quarantined tests. Therefore the need for good reporting might not be so relevant if fixing quarantined tests is taken seriously. 
   
   Some of the problems with quarantined tests will get resolved after the PRs to fix resource leakages have been merged. For example, the stability of ReplicatorTests will improve significantly and it should be possible to move the test out of the quarantine group. There are 1 or 2 flaky test methods which need slight fixes, but the root cause has been the resource leaks and asynchronous shutdown of broker instances in tests (fixed by #10199).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-815315686


   > The test failures are github specific issues, 
   
   Please elaborate more.
   
   > we don't want to exclude things be default when devs are running things locally or in a separate ci env.
   
   Making `<excludedGroups>quarantine</excludedGroups>` the default is a very minor change. Why wouldn't this change be desired? Please provide some examples of the problems it could cause.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817709292


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814196768


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-815175202


   > The flaky test does not mean it is a useless test, it can fail with A problem and B problem, A and B introduced the `flaky`. But the new change might cause C problem. If we always ignore the flaky test, the new break change might be merged into the master branch.
   
   @codelipenghui I have added test reporting for quarantined tests in this PR. The [PR description contains examples](#pullrequest-609647815-permalink).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814067045


   > That's possible, but it adds yet another workflow. Each new workflow requires more build resources.
   I'll first focus on making this PR pass without adding a new workflow. Perhaps the new workflow could be added in a separate PR?
   
   I'm worried if we can't easily see the failed test, the PR will get merged, but the PR might introduce more failed tests (the change breaks some tests under the quarantine group). 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814007129


   > @lhotari How about adding a workflow to run the "quarantine" test group? By default, the new workflow is not required and the author and reviewers can get the error from the workflow, they can run locally first to make sure the failed test is not introduced by the new change.
   
   That's possible, but it adds yet another workflow. Each new workflow requires more build resources. 
   I'll first focus on making this PR pass without adding a new workflow. Perhaps the new workflow could be added in a separate PR?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-816865118


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817530032


   I'm thinking of splitting this PR into 15 or 16 PRs since it contains a lot of unrelated changes. They have a common goal to improve test stability, but the changes are unrelated technically. It will lead to better commit history although the merging of the PRs will be some overhead comparing to the way of merging this large PR at once.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817337953


   This PR is know ready for final review. Please review again @eolivelli @codelipenghui @merlimat 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] eolivelli commented on a change in pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
eolivelli commented on a change in pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#discussion_r611331123



##########
File path: buildtools/pom.xml
##########
@@ -26,6 +26,7 @@
     <groupId>org.apache</groupId>
     <artifactId>apache</artifactId>
     <version>23</version>
+    <relativePath></relativePath>

Review comment:
       Thanks 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui edited a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
codelipenghui edited a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814091933


   > the problem is that currently those tests are very flaky and we are ignoring them anyway.
   Having a test that is failing and if it fails we simply ignore it it is useless.
   In order to not lose them I suggest to create "blocker" issues for the next release and to fix them one at a time.
   
   > If we have a panel that tells that a test failed with a disclaimer "do not worry about this test", then what is the meaning of that test ? it is simply a waste of resources to run it and more noise for the reviewers and for the contributors.
   
   The flaky test does not mean it is a useless test, it can fail with A problem and B problem, A and B introduced the `flaky`. But the new change might cause C problem. If we always ignore the flaky test, the new break change might be merged into the master branch.
   
   And if the test is very flaky, my point is we need to fix it. If it has not failed frequently before, but frequently now, this is most likely caused by some concurrent merge. A similar situation happened before, most of them are caused by the concurrent merge.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-815366880


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817883266


   /pulsarbot run-failure-check


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari removed a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari removed a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817954380






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814081083


   > There are still some problems with the way TestNG works. Passing "quarantine" in excludedGroups will prevent setup methods with annotation `@BeforeMethod(groups = {"broker", "flaky", "quarantine"})` from running. The alternative seems to be to use `@BeforeMethod(alwaysRun = true)` or use some other group in BeforeMethod ("setup"?)
   
   I resolved this issue by using `@BeforeMethod(alwaysRun = true)`. That seems to be the proper solution in this case.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814008544


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-815175202


   > The flaky test does not mean it is a useless test, it can fail with A problem and B problem, A and B introduced the `flaky`. But the new change might cause C problem. If we always ignore the flaky test, the new break change might be merged into the master branch.
   
   @codelipenghui I have added test reporting for quarantined tests in this PR. The [PR description contains examples](https://github.com/apache/pulsar/pull/10148#pullrequest-609647815-permalink:~:text=Detailed%20quarantined%20test%20results).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817954380


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-818474620


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817954088


   /pulsarbot run-failure-check


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814106963


   > The flaky test does not mean it is a useless test, it can fail with A problem and B problem, A and B introduced the `flaky`. But the new change might cause C problem. If we always ignore the flaky test, the new break change might be merged into the master branch.
   
   I agree with you. There's just a special urgency now all PRs are blocked because of the very high flakiness of a few tests.
   
   > And if the test is very flaky, my point is we need to fix it. If it has not failed frequently before, but frequently now, this is most likely caused by some concurrent merge. A similar situation happened before, most of them are caused by the concurrent merge.
   
   Yes, the tests have to be fixed as top priority. However we need a way to unblock the CI. That is by moving the most flaky tests to a quarantine test group.
   
   The tests have been very flaky for a longer period of time. One fairly recent change in GitHub Actions has been the change in `ubuntu-latest` image. It now points to `ubuntu-20.04` instead of `ubuntu-18.04`. Another change that has increased the flakiness is simply the regrouping of tests. Since some tests are very fragile and flaky, a small change in the environment, such as regrouping the tests, can increase the flakiness. These 2 major changes seems to be the reason for the sudden increase in flakiness. However, the tests have been flaky even before the changes, so it's the tests that should be fixed instead of rolling back to the old environmental setup (test grouping and ubuntu-18.04 image).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-818602100


   @lhotari Sounds good to me. We can improve the quarantined test reporting later. It's useful for the reviewers to check if the new change break some tests which is under the quarantined group.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814025029


   There are still some problems with the way TestNG works. Passing "quarantine" in excludedGroups will prevent setup methods with annotation `@BeforeMethod(groups = {"broker", "flaky", "quarantine"})` from running. The alternative seems to be to use `@BeforeMethod(alwaysRun = true)` or use some other group in BeforeMethod ("setup"?)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814682985


   > This pr shows the general direction that should be taken
   > #10158
   
   @aahmed-se Please elaborate since there is no explanation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari edited a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari edited a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817337953


   This PR is now ready for final review. Please review again @eolivelli @codelipenghui @merlimat 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-815266973


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
codelipenghui commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-813953402


   @lhotari How do we get the test result of the "quarantine" test group?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814685496


   > We do need the "excludedGroups" we already have group separation of broker tests running in gitlab workflows. They do no conflict with each other. I have used the extra group successfully in our internal branch to exclude tests from all runs, it does not require any changes in pom.xml. In regards to the "BeforeMethod" more most of the them extra in already in place if it's missing we can add them.
   
   When mixing class level @Test group definitions and method level @Test group definitions, it's necessary. My previous comment explains it. Perhaps you didn't have that type of use case?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814237768


   This seems to be pretty hard to get right. TestNG has this particular detail: https://testng.org/doc/documentation-main.html#partial-groups
   If the test group is defined at the test class level, the group gets assigned to all test methods. It's not possible to override the group of a single test method, it will just add additional groups to the method. This is one reason why the current solution doesn't work. Tests get executed as part of the "quarantine" group in addition.
   I found some answer on Stackoverflow where adding a custom test listener can solve this problem. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] aahmed-se commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
aahmed-se commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-815255288


   The test failures  are github specific issues, we don't want to exclude things be default when devs are running things locally or in a separate ci env.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] codelipenghui merged pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
codelipenghui merged pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] aahmed-se commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
aahmed-se commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814411853


   This seems wrong, I already added this functionality, there is group "extra" available to segregate tests, we can just change the one we wish to exclude to that , they will not be run among the broker tests. We don't need all these changes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] aahmed-se edited a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
aahmed-se edited a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814621082


   We don't need the "excludedGroups" we already have group separation of broker tests running in gitlab workflows. They do no conflict with each other.  I have used the extra group successfully in our internal branch to exclude tests from all runs, it does not require any changes in pom.xml. In regards to the "BeforeMethod" more most of the them have extra in already in place if it's missing we can add them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-813983905


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on a change in pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari commented on a change in pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#discussion_r611326376



##########
File path: buildtools/pom.xml
##########
@@ -26,6 +26,7 @@
     <groupId>org.apache</groupId>
     <artifactId>apache</artifactId>
     <version>23</version>
+    <relativePath></relativePath>

Review comment:
       There has been a similar problem (maven warning) as explained in this question: https://stackoverflow.com/questions/6003831/parent-relativepath-points-at-my-com-mycompanymyproject-instead-of-org-apache 
   




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] eolivelli commented on a change in pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
eolivelli commented on a change in pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#discussion_r611238825



##########
File path: buildtools/pom.xml
##########
@@ -26,6 +26,7 @@
     <groupId>org.apache</groupId>
     <artifactId>apache</artifactId>
     <version>23</version>
+    <relativePath></relativePath>

Review comment:
       Why?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-815221488


   > It will create confusion, user context in java is at a class level, this is enforced at file per class convention. Saying there are two executions of test one under a standard context one under a quarantine context will only confuse individuals.
   > 
   
   Things seem to work fine with the changes in this PR. Can you provide an example of a possible confusion? 
   
   > To run them you will need to create boolean set evaluation rules to isolate methods, testng does not have a clean abstraction to do so.
   
   The only requirement I found was to use `alwaysRun = true` for `@Before*` methods instead of listing groups. Besides that, the solution passes  `quarantine` in `excludedGroups` by default and when running quarantine tests, it passes `quarantine` in `groups` and an empty string to `excludedGroups`. Isn't this fine?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari removed a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari removed a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817332938


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817801853


   /pulsarbot run-failure-check


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-814241662


   Another problem seems to be the slowness of test execution, this is an issue at least when using `-DtestReuseFork=false`. Each test class takes remarkable time to process even when there aren't any tests for a a specific test group in the class. 
   This is one of the worst examples:
   ```
   [INFO] Tests run: 0, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 8.102 s - in org.apache.pulsar.common.naming.NamespaceBundleTest
   ```
   Usually the reported times are <0.5 s, but it's possible that it doesn't add the JVM startup time to the duration.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group and improve CI stability

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-817334843


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari commented on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari commented on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-818021703


   /pulsarbot run-failure-checks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [pulsar] lhotari removed a comment on pull request #10148: [CI] Move flaky tests that fail very often to "quarantine" test group

Posted by GitBox <gi...@apache.org>.
lhotari removed a comment on pull request #10148:
URL: https://github.com/apache/pulsar/pull/10148#issuecomment-813983905






-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org