You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@geode.apache.org by Kirk Lund <kl...@apache.org> on 2021/06/08 16:33:22 UTC

[DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Our requirement for stress-new-test-openjdk11 to pass before allowing merge
doesn't really work as intended for fixing distributed tests that contain
multiple flaky test methods. In fact, I think it causes contributors to
avoid tackling flaky tests.

I've been working on GEODE-9103: CI Failure:
PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
<https://issues.apache.org/jira/browse/GEODE-9103> and was able to fix it.

However, stress-new-test-openjdk11 then continued to fail for other flaky
tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
which remains flaky.

Unfortunately, I cannot merge any of my fixes for
PutAllClientServerDistributedTest unless every single flaky test in it is
fixed by my PR. I think this is undesirable because it would be better to
merge the fix for 3 flaky test methods than none.

UPDATE: After running my precheckin more times, I did get
stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
loophole than anything because I didn't manage to fix GEODE-9242.

Despite having PR #6542 eventually pass, I would like to discuss removing
or relaxing our requirement that stress-new-test-openjdk11 must pass in
order to merge a PR...

PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
PutAllClientServerDistributedTest
<https://github.com/apache/geode/pull/6542>

Fixed in PR #6542:
* GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
testPartialKeyInPRSingleHopWithRedundancy
<https://issues.apache.org/jira/browse/GEODE-9296>
* GEODE-9103: CI Failure:
PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
<https://issues.apache.org/jira/browse/GEODE-9103>
* GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
fails due to missing disk store after server restart
<https://issues.apache.org/jira/browse/GEODE-8528>

Still flaky:
* GEODE-9242: CI failure in PutAllClientServerDistributedTest >
testEventIdOutOfOrderInPartitionRegionSingleHop
<https://issues.apache.org/jira/browse/GEODE-9242>

Thanks,
Kirk

Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Owen Nichols <on...@vmware.com>.
Thanks Kirk for tackling some of our flakiest tests!  I agree, we don't want to discourage anyone interested in paying down tech debt.

The Geode community has spoken clearly against bypassing or weakening required PR checks, so relaxing requirements in general might be a tough sell, but I'm curious what you had in mind.

It might be easier to try to get consensus on the dev list to approve a one-time exception, when a solid argument can be made that all reasonable alternatives have been exhausted (e.g. splitting up the tests or PR, re-triggering several times, running with increased timeout, etc) and the risk is low (e.g. test-only fixes or reverts).

On 6/8/21, 9:33 AM, "Kirk Lund" <kl...@apache.org> wrote:

    Our requirement for stress-new-test-openjdk11 to pass before allowing merge
    doesn't really work as intended for fixing distributed tests that contain
    multiple flaky test methods. In fact, I think it causes contributors to
    avoid tackling flaky tests.

    I've been working on GEODE-9103: CI Failure:
    PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Conichols%40vmware.com%7Cb59c0fd23cc14802ff6708d92a9b29d4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668177121955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=wxvXQmgGu%2FVL%2Bn84QzumhDCVJwVHIeXfH9CnuJTOoS0%3D&amp;reserved=0> and was able to fix it.

    However, stress-new-test-openjdk11 then continued to fail for other flaky
    tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
    GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
    which remains flaky.

    Unfortunately, I cannot merge any of my fixes for
    PutAllClientServerDistributedTest unless every single flaky test in it is
    fixed by my PR. I think this is undesirable because it would be better to
    merge the fix for 3 flaky test methods than none.

    UPDATE: After running my precheckin more times, I did get
    stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
    loophole than anything because I didn't manage to fix GEODE-9242.

    Despite having PR #6542 eventually pass, I would like to discuss removing
    or relaxing our requirement that stress-new-test-openjdk11 must pass in
    order to merge a PR...

    PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
    PutAllClientServerDistributedTest
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Conichols%40vmware.com%7Cb59c0fd23cc14802ff6708d92a9b29d4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668177121955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=PnThYY0f8z7kX3nGIT9x4gqlIU%2B3NIRMlh%2B3XNphw5c%3D&amp;reserved=0>

    Fixed in PR #6542:
    * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
    testPartialKeyInPRSingleHopWithRedundancy
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Conichols%40vmware.com%7Cb59c0fd23cc14802ff6708d92a9b29d4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668177121955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=7bFHlYbkJ%2B1NIVnasz5%2B9ykmX8f766S13s6or82tPNs%3D&amp;reserved=0>
    * GEODE-9103: CI Failure:
    PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Conichols%40vmware.com%7Cb59c0fd23cc14802ff6708d92a9b29d4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668177121955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=wxvXQmgGu%2FVL%2Bn84QzumhDCVJwVHIeXfH9CnuJTOoS0%3D&amp;reserved=0>
    * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
    fails due to missing disk store after server restart
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Conichols%40vmware.com%7Cb59c0fd23cc14802ff6708d92a9b29d4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668177121955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=31oBR7OYsFvJqSbO7mTFfGC7pA9fptCqBOcN8AoA9a0%3D&amp;reserved=0>

    Still flaky:
    * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
    testEventIdOutOfOrderInPartitionRegionSingleHop
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Conichols%40vmware.com%7Cb59c0fd23cc14802ff6708d92a9b29d4%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668177121955%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=GovuvljP%2FX%2BeR6GohGIhfwFBeAbfwUdjqeHtMKFOzes%3D&amp;reserved=0>

    Thanks,
    Kirk


Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Mark Hanson <ha...@vmware.com>.
I don't generally review code until it has passed required tests, because of the high possibility of re-review. I think that is a pretty common approach ( I could be wrong). I am a code owner and I think this is the least burden I can see. I am already reading every line of the changes. Maybe that is just me. I will let others chime in.

Thanks,
Mark

On 6/9/21, 11:15 AM, "Owen Nichols" <on...@vmware.com> wrote:

    This would substantially increase the burden on codeowners, because now in addition to looking at the code itself, they would have to wait for any running PR checks to complete, as well as remember to come back and look at the PR after any subsequent commits to make sure the checks are still passing.

    On 6/9/21, 10:56 AM, "Mark Hanson" <ha...@vmware.com> wrote:

        I think that process is a bit much. PRs are already a challenge. What do people think about code owners being the gate. We can accept by custom that there should be no stress-new-test failures. If there is a failure, a code owner can request a change or decide to let it go. I think that is sufficient all things considered.

        Thanks,
        Mark

        On 6/9/21, 10:43 AM, "Owen Nichols" <on...@vmware.com> wrote:

            I feel that a traditional [DISCUSS] and [VOTE] on the dev list would be sufficient and proper to grant approval for an override.  Any PR already needs approval from 1 codeowner per area to merge, so codeowners already have a little more say because they hold veto power over the PR.

            In terms of "practicalities of how this would actually work":
            Step 1: start a [DISCUSS] thread explaining the problem and why you think an override is justified
            Step 2: if there is concensus, [VOTE]
            Step 3: Myself (or whoever performs the override) must cite a link to the vote thread


            On 6/9/21, 10:16 AM, "Dale Emery" <de...@vmware.com> wrote:

                I too like #1 best for now… assuming it’s possible to give code owners this ability.

                Coincidentally, about option #3, II was reading the git release notes just now, and noticed there’s a new “trailers” feature. It gives git the ability to parse “key: value” pairs at the end of a commit message. We could potentially (with a sufficiently current version of git) use that to exclude a test from a PR stress test run.

                And, yeah, option #2 brings back the @FlakyTest annotation that we worked so hard to eliminate.

                As Mark said, none of this fixes the underlying problem, which I’d frame as: We have too many tests whose results we don’t trust.

                Dale

                From: Kirk Lund <kl...@apache.org>
                Date: Wednesday, June 9, 2021 at 9:59 AM
                To: dev@geode.apache.org <de...@geode.apache.org>
                Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
                I do like the suggestions offered up by Dale and would encourage (or even
                plead) with my fellow contributors to consider these:

                  *   Allow code owners to override the block, if they can be convinced the
                > override is justified.
                >   *   Exclude troublesome tests from stress test runs, either via
                > annotations or via an `assumeThat(…)` that can detect that it’s running as
                > a stress test. Whatever the mechanism for excluding, it would be in the
                > code, and therefore subject to code owner review. (This, too, feels overly
                > broad to me, as it would exclude the test from all stress test runs.)
                >   *   A way to exclude a specific test method from running in the stress
                > tests for a specific PR or commit. I don’t have any ideas for how to
                > declare such an exclusion, but if it could be done via a file it would be
                > subject to code owner review.


                1) Allow code owners to override the block, if they can be convinced the
                override is justified.

                After all, if we don't trust our code owners...

                2-3) Use a custom annotation to exclude the test method or test class only
                from stress-new-test.

                At first I really liked this idea, but then we end up with growing a
                collection of flaky tests that are excluded in some way from
                stress-new-test that still occasionally fail in distributedTest.

                #1 really sounds like the best option to me. I believe that leaving our
                stress-new-test process as-is will only discourage everyone from fixing one
                or two flaky tests in a large dunit test. However, I also believe that if
                we give code owners the authority to override stress-new-test, then we
                need to encourage them not to override this too often.

                On Tue, Jun 8, 2021 at 11:11 AM Dale Emery <de...@vmware.com> wrote:

                > Maybe we can find a way to relax the requirement, or to allow addressing
                > specific situations like the tangle you find yourself in.
                >
                > Removing the requirement altogether feels overly broad. I fear it would
                > allow us to quietly disregard all intermittent test failures, and I think
                > we already quietly (or even actively) disregard way too many kinds of
                > failures.
                >
                > I would prefer some way to explicitly disregard only the specific test
                > failures that prevent us from merging, and only with some amount of
                > explicit justification.
                >
                > I’m not sure what that would look like. Some half-baked possibilities:
                >
                >   *   Allow code owners to override the block, if they can be convinced
                > the override is justified.
                >   *   Exclude troublesome tests from stress test runs, either via
                > annotations or via an `assumeThat(…)` that can detect that it’s running as
                > a stress test. Whatever the mechanism for excluding, it would be in the
                > code, and therefore subject to code owner review. (This, too, feels overly
                > broad to me, as it would exclude the test from all stress test runs.)
                >   *   A way to exclude a specific test method from running in the stress
                > tests for a specific PR or commit. I don’t have any ideas for how to
                > declare such an exclusion, but if it could be done via a file it would be
                > subject to code owner review.
                >
                > Dale
                >
                > From: Kirk Lund <kl...@apache.org>
                > Date: Tuesday, June 8, 2021 at 9:33 AM
                > To: dev@geode.apache.org <de...@geode.apache.org>
                > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
                > Our requirement for stress-new-test-openjdk11 to pass before allowing merge
                > doesn't really work as intended for fixing distributed tests that contain
                > multiple flaky test methods. In fact, I think it causes contributors to
                > avoid tackling flaky tests.
                >
                > I've been working on GEODE-9103: CI Failure:
                > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
                > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd2625fb5b9484e6968b108d92b727ef3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588593017487228%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=x%2BzRCnZb0LrEcHGKexboK52rOt4mPpcccr5g4UvL2QU%3D&amp;reserved=0> and was able to fix it.
                >
                > However, stress-new-test-openjdk11 then continued to fail for other flaky
                > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
                > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
                > which remains flaky.
                >
                > Unfortunately, I cannot merge any of my fixes for
                > PutAllClientServerDistributedTest unless every single flaky test in it is
                > fixed by my PR. I think this is undesirable because it would be better to
                > merge the fix for 3 flaky test methods than none.
                >
                > UPDATE: After running my precheckin more times, I did get
                > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
                > loophole than anything because I didn't manage to fix GEODE-9242.
                >
                > Despite having PR #6542 eventually pass, I would like to discuss removing
                > or relaxing our requirement that stress-new-test-openjdk11 must pass in
                > order to merge a PR...
                >
                > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
                > PutAllClientServerDistributedTest
                > <
                > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd2625fb5b9484e6968b108d92b727ef3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588593017487228%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=dSwQVAdaPXIwR%2FzOboes053p7N%2F3d%2FIVW9AMLQYPL70%3D&amp;reserved=0
                > >
                >
                > Fixed in PR #6542:
                > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
                > testPartialKeyInPRSingleHopWithRedundancy
                > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd2625fb5b9484e6968b108d92b727ef3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588593017487228%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=AC1JVhArzmzoOKLnJDRFCWhcFsaCku%2BLK%2F4ANTx8FFU%3D&amp;reserved=0>
                > * GEODE-9103: CI Failure:
                > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
                > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd2625fb5b9484e6968b108d92b727ef3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588593017487228%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=x%2BzRCnZb0LrEcHGKexboK52rOt4mPpcccr5g4UvL2QU%3D&amp;reserved=0>
                > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
                > fails due to missing disk store after server restart
                > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd2625fb5b9484e6968b108d92b727ef3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588593017487228%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=WuLkCGtfp%2BhnDcvb435K6%2FPfCH2UtWaz6wu6hXeDZUU%3D&amp;reserved=0>
                >
                > Still flaky:
                > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
                > testEventIdOutOfOrderInPartitionRegionSingleHop
                > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd2625fb5b9484e6968b108d92b727ef3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588593017497184%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=oVp2O4ZoTGFdcyaF3UBs6YcPOfBqtFT4JgKpiB3jz4s%3D&amp;reserved=0>
                >
                > Thanks,
                > Kirk
                >





Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Mark Hanson <ha...@vmware.com>.
I think this process sounds good. My only thought is how this works in github. I would 100% sign off on this, if it works. I know the way I proposed will work. I prefer your suggestion Dale, if we have a choice.

Thanks,
Mark

On 6/9/21, 11:22 AM, "Dale Emery" <de...@vmware.com> wrote:

    Here’s the kind of scenario I’m imagining:

      *   Code owners and other reviewers review the PR in the usual way (either before or after the tests finish).
      *   Stress new test fails, perhaps multiple times.
      *   The committer investigates and, upon concluding that the failed tests are not related to the change, requests an override from the code owners.
      *   Code owners review the failures and the committer’s justification, and decide whether to override the failures.

    In this scenario, the extra burden on code owners arises only at the committer’s request.

    Dale

    From: Owen Nichols <on...@vmware.com>
    Date: Wednesday, June 9, 2021 at 11:15 AM
    To: dev@geode.apache.org <de...@geode.apache.org>
    Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
    This would substantially increase the burden on codeowners, because now in addition to looking at the code itself, they would have to wait for any running PR checks to complete, as well as remember to come back and look at the PR after any subsequent commits to make sure the checks are still passing.

    On 6/9/21, 10:56 AM, "Mark Hanson" <ha...@vmware.com> wrote:

        I think that process is a bit much. PRs are already a challenge. What do people think about code owners being the gate. We can accept by custom that there should be no stress-new-test failures. If there is a failure, a code owner can request a change or decide to let it go. I think that is sufficient all things considered.

        Thanks,
        Mark

        On 6/9/21, 10:43 AM, "Owen Nichols" <on...@vmware.com> wrote:

            I feel that a traditional [DISCUSS] and [VOTE] on the dev list would be sufficient and proper to grant approval for an override.  Any PR already needs approval from 1 codeowner per area to merge, so codeowners already have a little more say because they hold veto power over the PR.

            In terms of "practicalities of how this would actually work":
            Step 1: start a [DISCUSS] thread explaining the problem and why you think an override is justified
            Step 2: if there is concensus, [VOTE]
            Step 3: Myself (or whoever performs the override) must cite a link to the vote thread


            On 6/9/21, 10:16 AM, "Dale Emery" <de...@vmware.com> wrote:

                I too like #1 best for now… assuming it’s possible to give code owners this ability.

                Coincidentally, about option #3, II was reading the git release notes just now, and noticed there’s a new “trailers” feature. It gives git the ability to parse “key: value” pairs at the end of a commit message. We could potentially (with a sufficiently current version of git) use that to exclude a test from a PR stress test run.

                And, yeah, option #2 brings back the @FlakyTest annotation that we worked so hard to eliminate.

                As Mark said, none of this fixes the underlying problem, which I’d frame as: We have too many tests whose results we don’t trust.

                Dale

                From: Kirk Lund <kl...@apache.org>
                Date: Wednesday, June 9, 2021 at 9:59 AM
                To: dev@geode.apache.org <de...@geode.apache.org>
                Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
                I do like the suggestions offered up by Dale and would encourage (or even
                plead) with my fellow contributors to consider these:

                  *   Allow code owners to override the block, if they can be convinced the
                > override is justified.
                >   *   Exclude troublesome tests from stress test runs, either via
                > annotations or via an `assumeThat(…)` that can detect that it’s running as
                > a stress test. Whatever the mechanism for excluding, it would be in the
                > code, and therefore subject to code owner review. (This, too, feels overly
                > broad to me, as it would exclude the test from all stress test runs.)
                >   *   A way to exclude a specific test method from running in the stress
                > tests for a specific PR or commit. I don’t have any ideas for how to
                > declare such an exclusion, but if it could be done via a file it would be
                > subject to code owner review.


                1) Allow code owners to override the block, if they can be convinced the
                override is justified.

                After all, if we don't trust our code owners...

                2-3) Use a custom annotation to exclude the test method or test class only
                from stress-new-test.

                At first I really liked this idea, but then we end up with growing a
                collection of flaky tests that are excluded in some way from
                stress-new-test that still occasionally fail in distributedTest.

                #1 really sounds like the best option to me. I believe that leaving our
                stress-new-test process as-is will only discourage everyone from fixing one
                or two flaky tests in a large dunit test. However, I also believe that if
                we give code owners the authority to override stress-new-test, then we
                need to encourage them not to override this too often.

                On Tue, Jun 8, 2021 at 11:11 AM Dale Emery <de...@vmware.com> wrote:

                > Maybe we can find a way to relax the requirement, or to allow addressing
                > specific situations like the tangle you find yourself in.
                >
                > Removing the requirement altogether feels overly broad. I fear it would
                > allow us to quietly disregard all intermittent test failures, and I think
                > we already quietly (or even actively) disregard way too many kinds of
                > failures.
                >
                > I would prefer some way to explicitly disregard only the specific test
                > failures that prevent us from merging, and only with some amount of
                > explicit justification.
                >
                > I’m not sure what that would look like. Some half-baked possibilities:
                >
                >   *   Allow code owners to override the block, if they can be convinced
                > the override is justified.
                >   *   Exclude troublesome tests from stress test runs, either via
                > annotations or via an `assumeThat(…)` that can detect that it’s running as
                > a stress test. Whatever the mechanism for excluding, it would be in the
                > code, and therefore subject to code owner review. (This, too, feels overly
                > broad to me, as it would exclude the test from all stress test runs.)
                >   *   A way to exclude a specific test method from running in the stress
                > tests for a specific PR or commit. I don’t have any ideas for how to
                > declare such an exclusion, but if it could be done via a file it would be
                > subject to code owner review.
                >
                > Dale
                >
                > From: Kirk Lund <kl...@apache.org>
                > Date: Tuesday, June 8, 2021 at 9:33 AM
                > To: dev@geode.apache.org <de...@geode.apache.org>
                > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
                > Our requirement for stress-new-test-openjdk11 to pass before allowing merge
                > doesn't really work as intended for fixing distributed tests that contain
                > multiple flaky test methods. In fact, I think it causes contributors to
                > avoid tackling flaky tests.
                >
                > I've been working on GEODE-9103: CI Failure:
                > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
                > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7C9c57a46fb57c4be4493008d92b738fed%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588597601235490%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=zZQJIni9QIZby%2F1EaaiCEPfA%2Ff1Xzz1yPCECj%2FM7zoI%3D&amp;reserved=0> and was able to fix it.
                >
                > However, stress-new-test-openjdk11 then continued to fail for other flaky
                > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
                > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
                > which remains flaky.
                >
                > Unfortunately, I cannot merge any of my fixes for
                > PutAllClientServerDistributedTest unless every single flaky test in it is
                > fixed by my PR. I think this is undesirable because it would be better to
                > merge the fix for 3 flaky test methods than none.
                >
                > UPDATE: After running my precheckin more times, I did get
                > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
                > loophole than anything because I didn't manage to fix GEODE-9242.
                >
                > Despite having PR #6542 eventually pass, I would like to discuss removing
                > or relaxing our requirement that stress-new-test-openjdk11 must pass in
                > order to merge a PR...
                >
                > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
                > PutAllClientServerDistributedTest
                > <
                > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Chansonm%40vmware.com%7C9c57a46fb57c4be4493008d92b738fed%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588597601235490%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=gP8qUDczeXrN60jOiay7%2BYIyZ1BA0V4W9bBGHVrzC2g%3D&amp;reserved=0
                > >
                >
                > Fixed in PR #6542:
                > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
                > testPartialKeyInPRSingleHopWithRedundancy
                > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Chansonm%40vmware.com%7C9c57a46fb57c4be4493008d92b738fed%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588597601245443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=vSDojPOGoC8xSWpGaWkMA6YWqXbFH%2F%2BPsiQYVtrJmMc%3D&amp;reserved=0>
                > * GEODE-9103: CI Failure:
                > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
                > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7C9c57a46fb57c4be4493008d92b738fed%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588597601245443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ItZlv1UJGjcAVgvelHRWnP4fLDBSPteeMrtA2JrYEr8%3D&amp;reserved=0>
                > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
                > fails due to missing disk store after server restart
                > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Chansonm%40vmware.com%7C9c57a46fb57c4be4493008d92b738fed%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588597601245443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=zQ5U6NZL1%2FtKZyXiIj07wFdKas1Jo8D9OU4bB7n3ZZM%3D&amp;reserved=0>
                >
                > Still flaky:
                > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
                > testEventIdOutOfOrderInPartitionRegionSingleHop
                > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Chansonm%40vmware.com%7C9c57a46fb57c4be4493008d92b738fed%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588597601245443%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=khR%2FegjqfAAOZp3nLEw3PHus8Lv2fdI4qlGIUDjFCxY%3D&amp;reserved=0>
                >
                > Thanks,
                > Kirk
                >




Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Owen Nichols <on...@vmware.com>.
@Naba, two places to find recent flaky failures are:
* Latest mass-test report[1]
* Recently-sighted "CI Failure" Jira tickets[2]

[1] https://concourse.apachegeode-ci.info/teams/main/pipelines/apache-develop-mass-test-run/jobs/create-mass-test-run-report/builds/19
[2] https://issues.apache.org/jira/issues/?jql=project%20%3D%20GEODE%20AND%20summary%20~%20%22CI%20Failure%22%20ORDER%20BY%20updated%20DESC

On 6/10/21, 9:50 AM, "Nabarun Nag" <nn...@vmware.com> wrote:

    Hi all,


      *   We need to discuss how to prevent more flaky tests to be introduced now that stress-test is not mandatory for PRs to be merged? Reviewers checking the PR must check the tests failing in stress test and if it is a test that has been introduced or changed in the PR, the PR must be blocked with a change request or rejected.
      *   Also, in my opinion, we need to re-introduce the stress test as a mandatory check for PRs to be merged once the flaky test percentage has been reduced.

    Owen, will it be possible to put out a list of all the flaky tests and we can try to get these resolved collectively as a community. (results of the mass tests maybe). Thank you.


    Regards
    Nabarun Nag
    ________________________________
    From: Kirk Lund <kl...@apache.org>
    Sent: Thursday, June 10, 2021 9:37 AM
    To: dev@geode.apache.org <de...@geode.apache.org>
    Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

    Ok, I wanted to give this discussion another night and we still have
    consensus for making both stress-new-test non-required.

    I just filed PR #6602 <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6602&amp;data=04%7C01%7Conichols%40vmware.com%7C11c07f7f9a6d4bb1304408d92c2fc9ea%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637589406033359496%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ZvEa3p0O0DXBZEd8u9l8vwsr%2Bhb3FgzycgVxR%2FT08gs%3D&amp;reserved=0> to change
    stress-new-test from required to non-required. Please review!

    On Wed, Jun 9, 2021 at 2:11 PM Anthony Baker <ba...@vmware.com> wrote:

    > If we have consensus, no need to VOTE.
    >
    > > On Jun 9, 2021, at 12:52 PM, Owen Nichols <on...@vmware.com> wrote:
    > >
    > > Ok, I'm on board with changing stress-new-test from a required PR check
    > to non-required.  It's simple, codeowners still have the final say, and it
    > neatly avoids the whole topic of overrides.  Time to take a [VOTE]?
    > >
    >
    >


Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Kirk Lund <kl...@apache.org>.
My preference would be to have ALL test targets be mandatory to merge any
PR, but in my opinion that would require some sort of plan to fix all
flickering tests. Currently, we have a person assigned to monitor CI
everyday who then files Jira tickets against any test that intermittently
fails, but presently the vast majority of those Jira tickets remain
unassigned and are never fixed.

A couple of options right now. Make both stress-new-tests required for
merge:

1) but only for new tests

2) with zero tolerance for flickering tests -- which means all PRs not
directly associated with fixing flickering tests are frozen indefinitely
until ALL Jira tickets for flickering tests are fixed, and this process is
then repeated every time we have an intermittent failure in CI

-Kirk

On Thu, Jun 10, 2021 at 10:56 AM Dan Smith <da...@vmware.com> wrote:

> I'm also -0 on this one. I personally kinda like having the merge button
> disabled on my PRs until all the checks are passing so I don't have to
> think as much, and I'm not sure this problem is going to come up again
> soon. But I'm willing go with disabling it.
>
> -Dan
> ________________________________
> From: Mark Hanson <ha...@vmware.com>
> Sent: Thursday, June 10, 2021 10:15 AM
> To: dev@geode.apache.org <de...@geode.apache.org>
> Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from
> PRs
>
> +1 to new report from Owen
> +1 to re-introduce the stress-test
>
> Great ideas Naba!
>
> On 6/10/21, 9:50 AM, "Nabarun Nag" <nn...@vmware.com> wrote:
>
>     Hi all,
>
>
>       *   We need to discuss how to prevent more flaky tests to be
> introduced now that stress-test is not mandatory for PRs to be merged?
> Reviewers checking the PR must check the tests failing in stress test and
> if it is a test that has been introduced or changed in the PR, the PR must
> be blocked with a change request or rejected.
>       *   Also, in my opinion, we need to re-introduce the stress test as
> a mandatory check for PRs to be merged once the flaky test percentage has
> been reduced.
>
>     Owen, will it be possible to put out a list of all the flaky tests and
> we can try to get these resolved collectively as a community. (results of
> the mass tests maybe). Thank you.
>
>
>     Regards
>     Nabarun Nag
>     ________________________________
>     From: Kirk Lund <kl...@apache.org>
>     Sent: Thursday, June 10, 2021 9:37 AM
>     To: dev@geode.apache.org <de...@geode.apache.org>
>     Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement
> from PRs
>
>     Ok, I wanted to give this discussion another night and we still have
>     consensus for making both stress-new-test non-required.
>
>     I just filed PR #6602 <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6602&amp;data=04%7C01%7Cdasmith%40vmware.com%7C58dd1b5852174343eebb08d92c335953%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637589421336670457%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2Fuc8WXORNjjYjFKNgGjqMdS%2FhK9FGFBepNCNqmfp1GQ%3D&amp;reserved=0>
> to change
>     stress-new-test from required to non-required. Please review!
>
>     On Wed, Jun 9, 2021 at 2:11 PM Anthony Baker <ba...@vmware.com>
> wrote:
>
>     > If we have consensus, no need to VOTE.
>     >
>     > > On Jun 9, 2021, at 12:52 PM, Owen Nichols <on...@vmware.com>
> wrote:
>     > >
>     > > Ok, I'm on board with changing stress-new-test from a required PR
> check
>     > to non-required.  It's simple, codeowners still have the final say,
> and it
>     > neatly avoids the whole topic of overrides.  Time to take a [VOTE]?
>     > >
>     >
>     >
>
>

Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Dan Smith <da...@vmware.com>.
I'm also -0 on this one. I personally kinda like having the merge button disabled on my PRs until all the checks are passing so I don't have to think as much, and I'm not sure this problem is going to come up again soon. But I'm willing go with disabling it.

-Dan
________________________________
From: Mark Hanson <ha...@vmware.com>
Sent: Thursday, June 10, 2021 10:15 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

+1 to new report from Owen
+1 to re-introduce the stress-test

Great ideas Naba!

On 6/10/21, 9:50 AM, "Nabarun Nag" <nn...@vmware.com> wrote:

    Hi all,


      *   We need to discuss how to prevent more flaky tests to be introduced now that stress-test is not mandatory for PRs to be merged? Reviewers checking the PR must check the tests failing in stress test and if it is a test that has been introduced or changed in the PR, the PR must be blocked with a change request or rejected.
      *   Also, in my opinion, we need to re-introduce the stress test as a mandatory check for PRs to be merged once the flaky test percentage has been reduced.

    Owen, will it be possible to put out a list of all the flaky tests and we can try to get these resolved collectively as a community. (results of the mass tests maybe). Thank you.


    Regards
    Nabarun Nag
    ________________________________
    From: Kirk Lund <kl...@apache.org>
    Sent: Thursday, June 10, 2021 9:37 AM
    To: dev@geode.apache.org <de...@geode.apache.org>
    Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

    Ok, I wanted to give this discussion another night and we still have
    consensus for making both stress-new-test non-required.

    I just filed PR #6602 <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6602&amp;data=04%7C01%7Cdasmith%40vmware.com%7C58dd1b5852174343eebb08d92c335953%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637589421336670457%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2Fuc8WXORNjjYjFKNgGjqMdS%2FhK9FGFBepNCNqmfp1GQ%3D&amp;reserved=0> to change
    stress-new-test from required to non-required. Please review!

    On Wed, Jun 9, 2021 at 2:11 PM Anthony Baker <ba...@vmware.com> wrote:

    > If we have consensus, no need to VOTE.
    >
    > > On Jun 9, 2021, at 12:52 PM, Owen Nichols <on...@vmware.com> wrote:
    > >
    > > Ok, I'm on board with changing stress-new-test from a required PR check
    > to non-required.  It's simple, codeowners still have the final say, and it
    > neatly avoids the whole topic of overrides.  Time to take a [VOTE]?
    > >
    >
    >


Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Mark Hanson <ha...@vmware.com>.
+1 to new report from Owen
+1 to re-introduce the stress-test

Great ideas Naba!

On 6/10/21, 9:50 AM, "Nabarun Nag" <nn...@vmware.com> wrote:

    Hi all,


      *   We need to discuss how to prevent more flaky tests to be introduced now that stress-test is not mandatory for PRs to be merged? Reviewers checking the PR must check the tests failing in stress test and if it is a test that has been introduced or changed in the PR, the PR must be blocked with a change request or rejected.
      *   Also, in my opinion, we need to re-introduce the stress test as a mandatory check for PRs to be merged once the flaky test percentage has been reduced.

    Owen, will it be possible to put out a list of all the flaky tests and we can try to get these resolved collectively as a community. (results of the mass tests maybe). Thank you.


    Regards
    Nabarun Nag
    ________________________________
    From: Kirk Lund <kl...@apache.org>
    Sent: Thursday, June 10, 2021 9:37 AM
    To: dev@geode.apache.org <de...@geode.apache.org>
    Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

    Ok, I wanted to give this discussion another night and we still have
    consensus for making both stress-new-test non-required.

    I just filed PR #6602 <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6602&amp;data=04%7C01%7Chansonm%40vmware.com%7C49c0edd9a83a4355b61d08d92c2fca78%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637589406063640277%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Hz77Wcq%2F44ybImp6qzx%2F4k4rCZ8eSBAS6adT5D0KvhM%3D&amp;reserved=0> to change
    stress-new-test from required to non-required. Please review!

    On Wed, Jun 9, 2021 at 2:11 PM Anthony Baker <ba...@vmware.com> wrote:

    > If we have consensus, no need to VOTE.
    >
    > > On Jun 9, 2021, at 12:52 PM, Owen Nichols <on...@vmware.com> wrote:
    > >
    > > Ok, I'm on board with changing stress-new-test from a required PR check
    > to non-required.  It's simple, codeowners still have the final say, and it
    > neatly avoids the whole topic of overrides.  Time to take a [VOTE]?
    > >
    >
    >


Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Nabarun Nag <nn...@vmware.com>.
Hi all,


  *   We need to discuss how to prevent more flaky tests to be introduced now that stress-test is not mandatory for PRs to be merged? Reviewers checking the PR must check the tests failing in stress test and if it is a test that has been introduced or changed in the PR, the PR must be blocked with a change request or rejected.
  *   Also, in my opinion, we need to re-introduce the stress test as a mandatory check for PRs to be merged once the flaky test percentage has been reduced.

Owen, will it be possible to put out a list of all the flaky tests and we can try to get these resolved collectively as a community. (results of the mass tests maybe). Thank you.


Regards
Nabarun Nag
________________________________
From: Kirk Lund <kl...@apache.org>
Sent: Thursday, June 10, 2021 9:37 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Ok, I wanted to give this discussion another night and we still have
consensus for making both stress-new-test non-required.

I just filed PR #6602 <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6602&amp;data=04%7C01%7Cnnag%40vmware.com%7C33d0e8c6a7d44c692dd808d92c2e1563%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637589398736131723%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=oxThcHojPNnLZmo8y2wJJDYnZzELNPN6cntY%2FMgvd%2Fw%3D&amp;reserved=0> to change
stress-new-test from required to non-required. Please review!

On Wed, Jun 9, 2021 at 2:11 PM Anthony Baker <ba...@vmware.com> wrote:

> If we have consensus, no need to VOTE.
>
> > On Jun 9, 2021, at 12:52 PM, Owen Nichols <on...@vmware.com> wrote:
> >
> > Ok, I'm on board with changing stress-new-test from a required PR check
> to non-required.  It's simple, codeowners still have the final say, and it
> neatly avoids the whole topic of overrides.  Time to take a [VOTE]?
> >
>
>

Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Kirk Lund <kl...@apache.org>.
Ok, I wanted to give this discussion another night and we still have
consensus for making both stress-new-test non-required.

I just filed PR #6602 <https://github.com/apache/geode/pull/6602> to change
stress-new-test from required to non-required. Please review!

On Wed, Jun 9, 2021 at 2:11 PM Anthony Baker <ba...@vmware.com> wrote:

> If we have consensus, no need to VOTE.
>
> > On Jun 9, 2021, at 12:52 PM, Owen Nichols <on...@vmware.com> wrote:
> >
> > Ok, I'm on board with changing stress-new-test from a required PR check
> to non-required.  It's simple, codeowners still have the final say, and it
> neatly avoids the whole topic of overrides.  Time to take a [VOTE]?
> >
>
>

Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Dale Emery <de...@vmware.com>.
Count me as -0. I have some concerns, but I’m okay trying this and seeing how it goes.

Dale

From: Owen Nichols <on...@vmware.com>
Date: Wednesday, June 9, 2021 at 2:25 PM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
Summarizing this thread so far:

In favor of making stress-new non-required:
Kirk
Mark
Myself

In favor of making all PR checks non-required:
Jake

In favor of hashing out a more nuanced balance between making it possible (but not too easy) to ignore stress-new failures:
Donal
Dale

Maybe that's rough concensus in terms of which option to recommend, but hardly a thread I would feel comfortable citing as evidence that the community is on board with striking down a longstanding protection that was enacted through a [VOTE].

On 6/9/21, 2:11 PM, "Anthony Baker" <ba...@vmware.com> wrote:

    If we have consensus, no need to VOTE.

    > On Jun 9, 2021, at 12:52 PM, Owen Nichols <on...@vmware.com> wrote:
    >
    > Ok, I'm on board with changing stress-new-test from a required PR check to non-required.  It's simple, codeowners still have the final say, and it neatly avoids the whole topic of overrides.  Time to take a [VOTE]?
    >


Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Owen Nichols <on...@vmware.com>.
Summarizing this thread so far:

In favor of making stress-new non-required:
Kirk
Mark
Myself

In favor of making all PR checks non-required:
Jake

In favor of hashing out a more nuanced balance between making it possible (but not too easy) to ignore stress-new failures:
Donal
Dale

Maybe that's rough concensus in terms of which option to recommend, but hardly a thread I would feel comfortable citing as evidence that the community is on board with striking down a longstanding protection that was enacted through a [VOTE]. 

On 6/9/21, 2:11 PM, "Anthony Baker" <ba...@vmware.com> wrote:

    If we have consensus, no need to VOTE.

    > On Jun 9, 2021, at 12:52 PM, Owen Nichols <on...@vmware.com> wrote:
    > 
    > Ok, I'm on board with changing stress-new-test from a required PR check to non-required.  It's simple, codeowners still have the final say, and it neatly avoids the whole topic of overrides.  Time to take a [VOTE]?
    > 



Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Anthony Baker <ba...@vmware.com>.
If we have consensus, no need to VOTE.

> On Jun 9, 2021, at 12:52 PM, Owen Nichols <on...@vmware.com> wrote:
> 
> Ok, I'm on board with changing stress-new-test from a required PR check to non-required.  It's simple, codeowners still have the final say, and it neatly avoids the whole topic of overrides.  Time to take a [VOTE]?
> 


Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Owen Nichols <on...@vmware.com>.
Ok, I'm on board with changing stress-new-test from a required PR check to non-required.  It's simple, codeowners still have the final say, and it neatly avoids the whole topic of overrides.  Time to take a [VOTE]?

On 6/9/21, 11:42 AM, "Mark Hanson" <ha...@vmware.com> wrote:

    I am agreeing with Kirk's original desire to remote stress-new-test as a requirement. All others would remain.

    Thanks,
    Mark

    On 6/9/21, 11:39 AM, "Owen Nichols" <on...@vmware.com> wrote:

        The concern of holding the discussion via PR comments, rather than the dev list, is just that many people will be excluded.  In the past the Geode community has viewed overrides as highly exceptional, and perhaps indicative of a larger problem, so I feel that any override situation is of interest to the entire community.  But at minimum, as long as there's a public recorder of decision *somewhere* that can be cited, the individual performing the override can prove that they are acting with concensus.

        Marks seems to be proposing something that goes far beyond that, essentially arguing that since we now have codeowners, we don't need gating PR checks at all, so there would be nothing to override.

        On 6/9/21, 11:22 AM, "Dale Emery" <de...@vmware.com> wrote:

            Here’s the kind of scenario I’m imagining:

              *   Code owners and other reviewers review the PR in the usual way (either before or after the tests finish).
              *   Stress new test fails, perhaps multiple times.
              *   The committer investigates and, upon concluding that the failed tests are not related to the change, requests an override from the code owners.
              *   Code owners review the failures and the committer’s justification, and decide whether to override the failures.

            In this scenario, the extra burden on code owners arises only at the committer’s request.

            Dale

            From: Owen Nichols <on...@vmware.com>
            Date: Wednesday, June 9, 2021 at 11:15 AM
            To: dev@geode.apache.org <de...@geode.apache.org>
            Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
            This would substantially increase the burden on codeowners, because now in addition to looking at the code itself, they would have to wait for any running PR checks to complete, as well as remember to come back and look at the PR after any subsequent commits to make sure the checks are still passing.

            On 6/9/21, 10:56 AM, "Mark Hanson" <ha...@vmware.com> wrote:

                I think that process is a bit much. PRs are already a challenge. What do people think about code owners being the gate. We can accept by custom that there should be no stress-new-test failures. If there is a failure, a code owner can request a change or decide to let it go. I think that is sufficient all things considered.

                Thanks,
                Mark

                On 6/9/21, 10:43 AM, "Owen Nichols" <on...@vmware.com> wrote:

                    I feel that a traditional [DISCUSS] and [VOTE] on the dev list would be sufficient and proper to grant approval for an override.  Any PR already needs approval from 1 codeowner per area to merge, so codeowners already have a little more say because they hold veto power over the PR.

                    In terms of "practicalities of how this would actually work":
                    Step 1: start a [DISCUSS] thread explaining the problem and why you think an override is justified
                    Step 2: if there is concensus, [VOTE]
                    Step 3: Myself (or whoever performs the override) must cite a link to the vote thread


                    On 6/9/21, 10:16 AM, "Dale Emery" <de...@vmware.com> wrote:

                        I too like #1 best for now… assuming it’s possible to give code owners this ability.

                        Coincidentally, about option #3, II was reading the git release notes just now, and noticed there’s a new “trailers” feature. It gives git the ability to parse “key: value” pairs at the end of a commit message. We could potentially (with a sufficiently current version of git) use that to exclude a test from a PR stress test run.

                        And, yeah, option #2 brings back the @FlakyTest annotation that we worked so hard to eliminate.

                        As Mark said, none of this fixes the underlying problem, which I’d frame as: We have too many tests whose results we don’t trust.

                        Dale

                        From: Kirk Lund <kl...@apache.org>
                        Date: Wednesday, June 9, 2021 at 9:59 AM
                        To: dev@geode.apache.org <de...@geode.apache.org>
                        Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
                        I do like the suggestions offered up by Dale and would encourage (or even
                        plead) with my fellow contributors to consider these:

                          *   Allow code owners to override the block, if they can be convinced the
                        > override is justified.
                        >   *   Exclude troublesome tests from stress test runs, either via
                        > annotations or via an `assumeThat(…)` that can detect that it’s running as
                        > a stress test. Whatever the mechanism for excluding, it would be in the
                        > code, and therefore subject to code owner review. (This, too, feels overly
                        > broad to me, as it would exclude the test from all stress test runs.)
                        >   *   A way to exclude a specific test method from running in the stress
                        > tests for a specific PR or commit. I don’t have any ideas for how to
                        > declare such an exclusion, but if it could be done via a file it would be
                        > subject to code owner review.


                        1) Allow code owners to override the block, if they can be convinced the
                        override is justified.

                        After all, if we don't trust our code owners...

                        2-3) Use a custom annotation to exclude the test method or test class only
                        from stress-new-test.

                        At first I really liked this idea, but then we end up with growing a
                        collection of flaky tests that are excluded in some way from
                        stress-new-test that still occasionally fail in distributedTest.

                        #1 really sounds like the best option to me. I believe that leaving our
                        stress-new-test process as-is will only discourage everyone from fixing one
                        or two flaky tests in a large dunit test. However, I also believe that if
                        we give code owners the authority to override stress-new-test, then we
                        need to encourage them not to override this too often.

                        On Tue, Jun 8, 2021 at 11:11 AM Dale Emery <de...@vmware.com> wrote:

                        > Maybe we can find a way to relax the requirement, or to allow addressing
                        > specific situations like the tangle you find yourself in.
                        >
                        > Removing the requirement altogether feels overly broad. I fear it would
                        > allow us to quietly disregard all intermittent test failures, and I think
                        > we already quietly (or even actively) disregard way too many kinds of
                        > failures.
                        >
                        > I would prefer some way to explicitly disregard only the specific test
                        > failures that prevent us from merging, and only with some amount of
                        > explicit justification.
                        >
                        > I’m not sure what that would look like. Some half-baked possibilities:
                        >
                        >   *   Allow code owners to override the block, if they can be convinced
                        > the override is justified.
                        >   *   Exclude troublesome tests from stress test runs, either via
                        > annotations or via an `assumeThat(…)` that can detect that it’s running as
                        > a stress test. Whatever the mechanism for excluding, it would be in the
                        > code, and therefore subject to code owner review. (This, too, feels overly
                        > broad to me, as it would exclude the test from all stress test runs.)
                        >   *   A way to exclude a specific test method from running in the stress
                        > tests for a specific PR or commit. I don’t have any ideas for how to
                        > declare such an exclusion, but if it could be done via a file it would be
                        > subject to code owner review.
                        >
                        > Dale
                        >
                        > From: Kirk Lund <kl...@apache.org>
                        > Date: Tuesday, June 8, 2021 at 9:33 AM
                        > To: dev@geode.apache.org <de...@geode.apache.org>
                        > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
                        > Our requirement for stress-new-test-openjdk11 to pass before allowing merge
                        > doesn't really work as intended for fixing distributed tests that contain
                        > multiple flaky test methods. In fact, I think it causes contributors to
                        > avoid tackling flaky tests.
                        >
                        > I've been working on GEODE-9103: CI Failure:
                        > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
                        > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Conichols%40vmware.com%7Ce73adc22dae04071c7a608d92b76452b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588609238102419%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=HsyvSZnA%2BhQGHRlPl5DDn07VdKGfKIAGs%2B3iKPTX9O8%3D&amp;reserved=0> and was able to fix it.
                        >
                        > However, stress-new-test-openjdk11 then continued to fail for other flaky
                        > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
                        > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
                        > which remains flaky.
                        >
                        > Unfortunately, I cannot merge any of my fixes for
                        > PutAllClientServerDistributedTest unless every single flaky test in it is
                        > fixed by my PR. I think this is undesirable because it would be better to
                        > merge the fix for 3 flaky test methods than none.
                        >
                        > UPDATE: After running my precheckin more times, I did get
                        > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
                        > loophole than anything because I didn't manage to fix GEODE-9242.
                        >
                        > Despite having PR #6542 eventually pass, I would like to discuss removing
                        > or relaxing our requirement that stress-new-test-openjdk11 must pass in
                        > order to merge a PR...
                        >
                        > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
                        > PutAllClientServerDistributedTest
                        > <
                        > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Conichols%40vmware.com%7Ce73adc22dae04071c7a608d92b76452b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588609238102419%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=AmruTqVahgbfMXYvLg9lnTTW8JSerEMSXEpwfa6G3PE%3D&amp;reserved=0
                        > >
                        >
                        > Fixed in PR #6542:
                        > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
                        > testPartialKeyInPRSingleHopWithRedundancy
                        > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Conichols%40vmware.com%7Ce73adc22dae04071c7a608d92b76452b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588609238102419%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=dC5Nj3R0tDqF7WjkFE1c25%2BBCVGF4mytHqDz6wxaijU%3D&amp;reserved=0>
                        > * GEODE-9103: CI Failure:
                        > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
                        > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Conichols%40vmware.com%7Ce73adc22dae04071c7a608d92b76452b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588609238112379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=q%2FHTvd%2FnuV5Ux0PRumrbULla89K%2FQ1ZO2or8HtErODs%3D&amp;reserved=0>
                        > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
                        > fails due to missing disk store after server restart
                        > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Conichols%40vmware.com%7Ce73adc22dae04071c7a608d92b76452b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588609238112379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qWEKdoH1%2BrOrg%2F5Rvu%2BUldRUKb53RC7kfkWMdT1DFa0%3D&amp;reserved=0>
                        >
                        > Still flaky:
                        > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
                        > testEventIdOutOfOrderInPartitionRegionSingleHop
                        > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Conichols%40vmware.com%7Ce73adc22dae04071c7a608d92b76452b%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588609238112379%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=XF65if3dFmsP653TERf4jDJxvxIyvEkianDwc%2BzEffY%3D&amp;reserved=0>
                        >
                        > Thanks,
                        > Kirk
                        >






Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Mark Hanson <ha...@vmware.com>.
I am agreeing with Kirk's original desire to remote stress-new-test as a requirement. All others would remain.

Thanks,
Mark

On 6/9/21, 11:39 AM, "Owen Nichols" <on...@vmware.com> wrote:

    The concern of holding the discussion via PR comments, rather than the dev list, is just that many people will be excluded.  In the past the Geode community has viewed overrides as highly exceptional, and perhaps indicative of a larger problem, so I feel that any override situation is of interest to the entire community.  But at minimum, as long as there's a public recorder of decision *somewhere* that can be cited, the individual performing the override can prove that they are acting with concensus.

    Marks seems to be proposing something that goes far beyond that, essentially arguing that since we now have codeowners, we don't need gating PR checks at all, so there would be nothing to override.

    On 6/9/21, 11:22 AM, "Dale Emery" <de...@vmware.com> wrote:

        Here’s the kind of scenario I’m imagining:

          *   Code owners and other reviewers review the PR in the usual way (either before or after the tests finish).
          *   Stress new test fails, perhaps multiple times.
          *   The committer investigates and, upon concluding that the failed tests are not related to the change, requests an override from the code owners.
          *   Code owners review the failures and the committer’s justification, and decide whether to override the failures.

        In this scenario, the extra burden on code owners arises only at the committer’s request.

        Dale

        From: Owen Nichols <on...@vmware.com>
        Date: Wednesday, June 9, 2021 at 11:15 AM
        To: dev@geode.apache.org <de...@geode.apache.org>
        Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
        This would substantially increase the burden on codeowners, because now in addition to looking at the code itself, they would have to wait for any running PR checks to complete, as well as remember to come back and look at the PR after any subsequent commits to make sure the checks are still passing.

        On 6/9/21, 10:56 AM, "Mark Hanson" <ha...@vmware.com> wrote:

            I think that process is a bit much. PRs are already a challenge. What do people think about code owners being the gate. We can accept by custom that there should be no stress-new-test failures. If there is a failure, a code owner can request a change or decide to let it go. I think that is sufficient all things considered.

            Thanks,
            Mark

            On 6/9/21, 10:43 AM, "Owen Nichols" <on...@vmware.com> wrote:

                I feel that a traditional [DISCUSS] and [VOTE] on the dev list would be sufficient and proper to grant approval for an override.  Any PR already needs approval from 1 codeowner per area to merge, so codeowners already have a little more say because they hold veto power over the PR.

                In terms of "practicalities of how this would actually work":
                Step 1: start a [DISCUSS] thread explaining the problem and why you think an override is justified
                Step 2: if there is concensus, [VOTE]
                Step 3: Myself (or whoever performs the override) must cite a link to the vote thread


                On 6/9/21, 10:16 AM, "Dale Emery" <de...@vmware.com> wrote:

                    I too like #1 best for now… assuming it’s possible to give code owners this ability.

                    Coincidentally, about option #3, II was reading the git release notes just now, and noticed there’s a new “trailers” feature. It gives git the ability to parse “key: value” pairs at the end of a commit message. We could potentially (with a sufficiently current version of git) use that to exclude a test from a PR stress test run.

                    And, yeah, option #2 brings back the @FlakyTest annotation that we worked so hard to eliminate.

                    As Mark said, none of this fixes the underlying problem, which I’d frame as: We have too many tests whose results we don’t trust.

                    Dale

                    From: Kirk Lund <kl...@apache.org>
                    Date: Wednesday, June 9, 2021 at 9:59 AM
                    To: dev@geode.apache.org <de...@geode.apache.org>
                    Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
                    I do like the suggestions offered up by Dale and would encourage (or even
                    plead) with my fellow contributors to consider these:

                      *   Allow code owners to override the block, if they can be convinced the
                    > override is justified.
                    >   *   Exclude troublesome tests from stress test runs, either via
                    > annotations or via an `assumeThat(…)` that can detect that it’s running as
                    > a stress test. Whatever the mechanism for excluding, it would be in the
                    > code, and therefore subject to code owner review. (This, too, feels overly
                    > broad to me, as it would exclude the test from all stress test runs.)
                    >   *   A way to exclude a specific test method from running in the stress
                    > tests for a specific PR or commit. I don’t have any ideas for how to
                    > declare such an exclusion, but if it could be done via a file it would be
                    > subject to code owner review.


                    1) Allow code owners to override the block, if they can be convinced the
                    override is justified.

                    After all, if we don't trust our code owners...

                    2-3) Use a custom annotation to exclude the test method or test class only
                    from stress-new-test.

                    At first I really liked this idea, but then we end up with growing a
                    collection of flaky tests that are excluded in some way from
                    stress-new-test that still occasionally fail in distributedTest.

                    #1 really sounds like the best option to me. I believe that leaving our
                    stress-new-test process as-is will only discourage everyone from fixing one
                    or two flaky tests in a large dunit test. However, I also believe that if
                    we give code owners the authority to override stress-new-test, then we
                    need to encourage them not to override this too often.

                    On Tue, Jun 8, 2021 at 11:11 AM Dale Emery <de...@vmware.com> wrote:

                    > Maybe we can find a way to relax the requirement, or to allow addressing
                    > specific situations like the tangle you find yourself in.
                    >
                    > Removing the requirement altogether feels overly broad. I fear it would
                    > allow us to quietly disregard all intermittent test failures, and I think
                    > we already quietly (or even actively) disregard way too many kinds of
                    > failures.
                    >
                    > I would prefer some way to explicitly disregard only the specific test
                    > failures that prevent us from merging, and only with some amount of
                    > explicit justification.
                    >
                    > I’m not sure what that would look like. Some half-baked possibilities:
                    >
                    >   *   Allow code owners to override the block, if they can be convinced
                    > the override is justified.
                    >   *   Exclude troublesome tests from stress test runs, either via
                    > annotations or via an `assumeThat(…)` that can detect that it’s running as
                    > a stress test. Whatever the mechanism for excluding, it would be in the
                    > code, and therefore subject to code owner review. (This, too, feels overly
                    > broad to me, as it would exclude the test from all stress test runs.)
                    >   *   A way to exclude a specific test method from running in the stress
                    > tests for a specific PR or commit. I don’t have any ideas for how to
                    > declare such an exclusion, but if it could be done via a file it would be
                    > subject to code owner review.
                    >
                    > Dale
                    >
                    > From: Kirk Lund <kl...@apache.org>
                    > Date: Tuesday, June 8, 2021 at 9:33 AM
                    > To: dev@geode.apache.org <de...@geode.apache.org>
                    > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
                    > Our requirement for stress-new-test-openjdk11 to pass before allowing merge
                    > doesn't really work as intended for fixing distributed tests that contain
                    > multiple flaky test methods. In fact, I think it causes contributors to
                    > avoid tackling flaky tests.
                    >
                    > I've been working on GEODE-9103: CI Failure:
                    > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
                    > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7C486b07d4c6e6454ed3b308d92b75e43d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588607622620801%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=r0xpSM5ftHCPKIjjTVhgJ9WuCWb0reaH4PhAviQWJNA%3D&amp;reserved=0> and was able to fix it.
                    >
                    > However, stress-new-test-openjdk11 then continued to fail for other flaky
                    > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
                    > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
                    > which remains flaky.
                    >
                    > Unfortunately, I cannot merge any of my fixes for
                    > PutAllClientServerDistributedTest unless every single flaky test in it is
                    > fixed by my PR. I think this is undesirable because it would be better to
                    > merge the fix for 3 flaky test methods than none.
                    >
                    > UPDATE: After running my precheckin more times, I did get
                    > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
                    > loophole than anything because I didn't manage to fix GEODE-9242.
                    >
                    > Despite having PR #6542 eventually pass, I would like to discuss removing
                    > or relaxing our requirement that stress-new-test-openjdk11 must pass in
                    > order to merge a PR...
                    >
                    > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
                    > PutAllClientServerDistributedTest
                    > <
                    > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Chansonm%40vmware.com%7C486b07d4c6e6454ed3b308d92b75e43d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588607622620801%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=XO4mOvidSJz4pRuIYg2ryesZt1oQ9oBO2YZ6ysWfwjM%3D&amp;reserved=0
                    > >
                    >
                    > Fixed in PR #6542:
                    > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
                    > testPartialKeyInPRSingleHopWithRedundancy
                    > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Chansonm%40vmware.com%7C486b07d4c6e6454ed3b308d92b75e43d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588607622630755%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=4CNSpaZEOk1fPxglyJoN5vFb9kmW93rBkjcusxcv15s%3D&amp;reserved=0>
                    > * GEODE-9103: CI Failure:
                    > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
                    > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7C486b07d4c6e6454ed3b308d92b75e43d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588607622630755%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=V3%2FxEILHSzBiXmbYHsblLbHZ2fNE5Sz84P%2FE2Pj0qA4%3D&amp;reserved=0>
                    > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
                    > fails due to missing disk store after server restart
                    > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Chansonm%40vmware.com%7C486b07d4c6e6454ed3b308d92b75e43d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588607622630755%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=lMmrMDntB7ydwlbm6tEAO2liSX0sWIgwyudjyfQCpv0%3D&amp;reserved=0>
                    >
                    > Still flaky:
                    > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
                    > testEventIdOutOfOrderInPartitionRegionSingleHop
                    > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Chansonm%40vmware.com%7C486b07d4c6e6454ed3b308d92b75e43d%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588607622630755%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=4%2FvQFkhPb1yn3PCW15wWWuYOqdL6KrQxB4bvF5giEd8%3D&amp;reserved=0>
                    >
                    > Thanks,
                    > Kirk
                    >





Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Owen Nichols <on...@vmware.com>.
The concern of holding the discussion via PR comments, rather than the dev list, is just that many people will be excluded.  In the past the Geode community has viewed overrides as highly exceptional, and perhaps indicative of a larger problem, so I feel that any override situation is of interest to the entire community.  But at minimum, as long as there's a public recorder of decision *somewhere* that can be cited, the individual performing the override can prove that they are acting with concensus.

Marks seems to be proposing something that goes far beyond that, essentially arguing that since we now have codeowners, we don't need gating PR checks at all, so there would be nothing to override.

On 6/9/21, 11:22 AM, "Dale Emery" <de...@vmware.com> wrote:

    Here’s the kind of scenario I’m imagining:

      *   Code owners and other reviewers review the PR in the usual way (either before or after the tests finish).
      *   Stress new test fails, perhaps multiple times.
      *   The committer investigates and, upon concluding that the failed tests are not related to the change, requests an override from the code owners.
      *   Code owners review the failures and the committer’s justification, and decide whether to override the failures.

    In this scenario, the extra burden on code owners arises only at the committer’s request.

    Dale

    From: Owen Nichols <on...@vmware.com>
    Date: Wednesday, June 9, 2021 at 11:15 AM
    To: dev@geode.apache.org <de...@geode.apache.org>
    Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
    This would substantially increase the burden on codeowners, because now in addition to looking at the code itself, they would have to wait for any running PR checks to complete, as well as remember to come back and look at the PR after any subsequent commits to make sure the checks are still passing.

    On 6/9/21, 10:56 AM, "Mark Hanson" <ha...@vmware.com> wrote:

        I think that process is a bit much. PRs are already a challenge. What do people think about code owners being the gate. We can accept by custom that there should be no stress-new-test failures. If there is a failure, a code owner can request a change or decide to let it go. I think that is sufficient all things considered.

        Thanks,
        Mark

        On 6/9/21, 10:43 AM, "Owen Nichols" <on...@vmware.com> wrote:

            I feel that a traditional [DISCUSS] and [VOTE] on the dev list would be sufficient and proper to grant approval for an override.  Any PR already needs approval from 1 codeowner per area to merge, so codeowners already have a little more say because they hold veto power over the PR.

            In terms of "practicalities of how this would actually work":
            Step 1: start a [DISCUSS] thread explaining the problem and why you think an override is justified
            Step 2: if there is concensus, [VOTE]
            Step 3: Myself (or whoever performs the override) must cite a link to the vote thread


            On 6/9/21, 10:16 AM, "Dale Emery" <de...@vmware.com> wrote:

                I too like #1 best for now… assuming it’s possible to give code owners this ability.

                Coincidentally, about option #3, II was reading the git release notes just now, and noticed there’s a new “trailers” feature. It gives git the ability to parse “key: value” pairs at the end of a commit message. We could potentially (with a sufficiently current version of git) use that to exclude a test from a PR stress test run.

                And, yeah, option #2 brings back the @FlakyTest annotation that we worked so hard to eliminate.

                As Mark said, none of this fixes the underlying problem, which I’d frame as: We have too many tests whose results we don’t trust.

                Dale

                From: Kirk Lund <kl...@apache.org>
                Date: Wednesday, June 9, 2021 at 9:59 AM
                To: dev@geode.apache.org <de...@geode.apache.org>
                Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
                I do like the suggestions offered up by Dale and would encourage (or even
                plead) with my fellow contributors to consider these:

                  *   Allow code owners to override the block, if they can be convinced the
                > override is justified.
                >   *   Exclude troublesome tests from stress test runs, either via
                > annotations or via an `assumeThat(…)` that can detect that it’s running as
                > a stress test. Whatever the mechanism for excluding, it would be in the
                > code, and therefore subject to code owner review. (This, too, feels overly
                > broad to me, as it would exclude the test from all stress test runs.)
                >   *   A way to exclude a specific test method from running in the stress
                > tests for a specific PR or commit. I don’t have any ideas for how to
                > declare such an exclusion, but if it could be done via a file it would be
                > subject to code owner review.


                1) Allow code owners to override the block, if they can be convinced the
                override is justified.

                After all, if we don't trust our code owners...

                2-3) Use a custom annotation to exclude the test method or test class only
                from stress-new-test.

                At first I really liked this idea, but then we end up with growing a
                collection of flaky tests that are excluded in some way from
                stress-new-test that still occasionally fail in distributedTest.

                #1 really sounds like the best option to me. I believe that leaving our
                stress-new-test process as-is will only discourage everyone from fixing one
                or two flaky tests in a large dunit test. However, I also believe that if
                we give code owners the authority to override stress-new-test, then we
                need to encourage them not to override this too often.

                On Tue, Jun 8, 2021 at 11:11 AM Dale Emery <de...@vmware.com> wrote:

                > Maybe we can find a way to relax the requirement, or to allow addressing
                > specific situations like the tangle you find yourself in.
                >
                > Removing the requirement altogether feels overly broad. I fear it would
                > allow us to quietly disregard all intermittent test failures, and I think
                > we already quietly (or even actively) disregard way too many kinds of
                > failures.
                >
                > I would prefer some way to explicitly disregard only the specific test
                > failures that prevent us from merging, and only with some amount of
                > explicit justification.
                >
                > I’m not sure what that would look like. Some half-baked possibilities:
                >
                >   *   Allow code owners to override the block, if they can be convinced
                > the override is justified.
                >   *   Exclude troublesome tests from stress test runs, either via
                > annotations or via an `assumeThat(…)` that can detect that it’s running as
                > a stress test. Whatever the mechanism for excluding, it would be in the
                > code, and therefore subject to code owner review. (This, too, feels overly
                > broad to me, as it would exclude the test from all stress test runs.)
                >   *   A way to exclude a specific test method from running in the stress
                > tests for a specific PR or commit. I don’t have any ideas for how to
                > declare such an exclusion, but if it could be done via a file it would be
                > subject to code owner review.
                >
                > Dale
                >
                > From: Kirk Lund <kl...@apache.org>
                > Date: Tuesday, June 8, 2021 at 9:33 AM
                > To: dev@geode.apache.org <de...@geode.apache.org>
                > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
                > Our requirement for stress-new-test-openjdk11 to pass before allowing merge
                > doesn't really work as intended for fixing distributed tests that contain
                > multiple flaky test methods. In fact, I think it causes contributors to
                > avoid tackling flaky tests.
                >
                > I've been working on GEODE-9103: CI Failure:
                > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
                > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Conichols%40vmware.com%7C7d38b98f0b9846d8560108d92b7390f2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588597621190796%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=H4UDEoU21g0RbHFyPSkBHRRtqQiMolWYSckhnUS5z7Q%3D&amp;reserved=0> and was able to fix it.
                >
                > However, stress-new-test-openjdk11 then continued to fail for other flaky
                > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
                > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
                > which remains flaky.
                >
                > Unfortunately, I cannot merge any of my fixes for
                > PutAllClientServerDistributedTest unless every single flaky test in it is
                > fixed by my PR. I think this is undesirable because it would be better to
                > merge the fix for 3 flaky test methods than none.
                >
                > UPDATE: After running my precheckin more times, I did get
                > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
                > loophole than anything because I didn't manage to fix GEODE-9242.
                >
                > Despite having PR #6542 eventually pass, I would like to discuss removing
                > or relaxing our requirement that stress-new-test-openjdk11 must pass in
                > order to merge a PR...
                >
                > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
                > PutAllClientServerDistributedTest
                > <
                > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Conichols%40vmware.com%7C7d38b98f0b9846d8560108d92b7390f2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588597621190796%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2B0QBhlQdTvhwlYUCd12rBevzRdbHlVXZfSlxaalWn6Y%3D&amp;reserved=0
                > >
                >
                > Fixed in PR #6542:
                > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
                > testPartialKeyInPRSingleHopWithRedundancy
                > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Conichols%40vmware.com%7C7d38b98f0b9846d8560108d92b7390f2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588597621190796%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=e7QVm8H1k49VWvO97U9c8GTY9qfzUlKHsy92UukNBdk%3D&amp;reserved=0>
                > * GEODE-9103: CI Failure:
                > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
                > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Conichols%40vmware.com%7C7d38b98f0b9846d8560108d92b7390f2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588597621190796%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=H4UDEoU21g0RbHFyPSkBHRRtqQiMolWYSckhnUS5z7Q%3D&amp;reserved=0>
                > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
                > fails due to missing disk store after server restart
                > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Conichols%40vmware.com%7C7d38b98f0b9846d8560108d92b7390f2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588597621190796%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Ig8ut8ZBoKJziFrNbQVydOyOZB3zLzADj%2BmmeeOwnss%3D&amp;reserved=0>
                >
                > Still flaky:
                > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
                > testEventIdOutOfOrderInPartitionRegionSingleHop
                > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Conichols%40vmware.com%7C7d38b98f0b9846d8560108d92b7390f2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588597621190796%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ImqDe%2FwLzgsTGCbAO9G1Gk0jjPLPezhoJKsZAgbIins%3D&amp;reserved=0>
                >
                > Thanks,
                > Kirk
                >




Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Dale Emery <de...@vmware.com>.
Here’s the kind of scenario I’m imagining:

  *   Code owners and other reviewers review the PR in the usual way (either before or after the tests finish).
  *   Stress new test fails, perhaps multiple times.
  *   The committer investigates and, upon concluding that the failed tests are not related to the change, requests an override from the code owners.
  *   Code owners review the failures and the committer’s justification, and decide whether to override the failures.

In this scenario, the extra burden on code owners arises only at the committer’s request.

Dale

From: Owen Nichols <on...@vmware.com>
Date: Wednesday, June 9, 2021 at 11:15 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
This would substantially increase the burden on codeowners, because now in addition to looking at the code itself, they would have to wait for any running PR checks to complete, as well as remember to come back and look at the PR after any subsequent commits to make sure the checks are still passing.

On 6/9/21, 10:56 AM, "Mark Hanson" <ha...@vmware.com> wrote:

    I think that process is a bit much. PRs are already a challenge. What do people think about code owners being the gate. We can accept by custom that there should be no stress-new-test failures. If there is a failure, a code owner can request a change or decide to let it go. I think that is sufficient all things considered.

    Thanks,
    Mark

    On 6/9/21, 10:43 AM, "Owen Nichols" <on...@vmware.com> wrote:

        I feel that a traditional [DISCUSS] and [VOTE] on the dev list would be sufficient and proper to grant approval for an override.  Any PR already needs approval from 1 codeowner per area to merge, so codeowners already have a little more say because they hold veto power over the PR.

        In terms of "practicalities of how this would actually work":
        Step 1: start a [DISCUSS] thread explaining the problem and why you think an override is justified
        Step 2: if there is concensus, [VOTE]
        Step 3: Myself (or whoever performs the override) must cite a link to the vote thread


        On 6/9/21, 10:16 AM, "Dale Emery" <de...@vmware.com> wrote:

            I too like #1 best for now… assuming it’s possible to give code owners this ability.

            Coincidentally, about option #3, II was reading the git release notes just now, and noticed there’s a new “trailers” feature. It gives git the ability to parse “key: value” pairs at the end of a commit message. We could potentially (with a sufficiently current version of git) use that to exclude a test from a PR stress test run.

            And, yeah, option #2 brings back the @FlakyTest annotation that we worked so hard to eliminate.

            As Mark said, none of this fixes the underlying problem, which I’d frame as: We have too many tests whose results we don’t trust.

            Dale

            From: Kirk Lund <kl...@apache.org>
            Date: Wednesday, June 9, 2021 at 9:59 AM
            To: dev@geode.apache.org <de...@geode.apache.org>
            Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
            I do like the suggestions offered up by Dale and would encourage (or even
            plead) with my fellow contributors to consider these:

              *   Allow code owners to override the block, if they can be convinced the
            > override is justified.
            >   *   Exclude troublesome tests from stress test runs, either via
            > annotations or via an `assumeThat(…)` that can detect that it’s running as
            > a stress test. Whatever the mechanism for excluding, it would be in the
            > code, and therefore subject to code owner review. (This, too, feels overly
            > broad to me, as it would exclude the test from all stress test runs.)
            >   *   A way to exclude a specific test method from running in the stress
            > tests for a specific PR or commit. I don’t have any ideas for how to
            > declare such an exclusion, but if it could be done via a file it would be
            > subject to code owner review.


            1) Allow code owners to override the block, if they can be convinced the
            override is justified.

            After all, if we don't trust our code owners...

            2-3) Use a custom annotation to exclude the test method or test class only
            from stress-new-test.

            At first I really liked this idea, but then we end up with growing a
            collection of flaky tests that are excluded in some way from
            stress-new-test that still occasionally fail in distributedTest.

            #1 really sounds like the best option to me. I believe that leaving our
            stress-new-test process as-is will only discourage everyone from fixing one
            or two flaky tests in a large dunit test. However, I also believe that if
            we give code owners the authority to override stress-new-test, then we
            need to encourage them not to override this too often.

            On Tue, Jun 8, 2021 at 11:11 AM Dale Emery <de...@vmware.com> wrote:

            > Maybe we can find a way to relax the requirement, or to allow addressing
            > specific situations like the tangle you find yourself in.
            >
            > Removing the requirement altogether feels overly broad. I fear it would
            > allow us to quietly disregard all intermittent test failures, and I think
            > we already quietly (or even actively) disregard way too many kinds of
            > failures.
            >
            > I would prefer some way to explicitly disregard only the specific test
            > failures that prevent us from merging, and only with some amount of
            > explicit justification.
            >
            > I’m not sure what that would look like. Some half-baked possibilities:
            >
            >   *   Allow code owners to override the block, if they can be convinced
            > the override is justified.
            >   *   Exclude troublesome tests from stress test runs, either via
            > annotations or via an `assumeThat(…)` that can detect that it’s running as
            > a stress test. Whatever the mechanism for excluding, it would be in the
            > code, and therefore subject to code owner review. (This, too, feels overly
            > broad to me, as it would exclude the test from all stress test runs.)
            >   *   A way to exclude a specific test method from running in the stress
            > tests for a specific PR or commit. I don’t have any ideas for how to
            > declare such an exclusion, but if it could be done via a file it would be
            > subject to code owner review.
            >
            > Dale
            >
            > From: Kirk Lund <kl...@apache.org>
            > Date: Tuesday, June 8, 2021 at 9:33 AM
            > To: dev@geode.apache.org <de...@geode.apache.org>
            > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
            > Our requirement for stress-new-test-openjdk11 to pass before allowing merge
            > doesn't really work as intended for fixing distributed tests that contain
            > multiple flaky test methods. In fact, I think it causes contributors to
            > avoid tackling flaky tests.
            >
            > I've been working on GEODE-9103: CI Failure:
            > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
            > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Cdemery%40vmware.com%7C8fc35334579b46342ba108d92b727d8f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588593008265840%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=memT7rIVg5MH%2BVE4H5BsgaqF03VxJySmsm4ALjcxdE0%3D&amp;reserved=0> and was able to fix it.
            >
            > However, stress-new-test-openjdk11 then continued to fail for other flaky
            > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
            > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
            > which remains flaky.
            >
            > Unfortunately, I cannot merge any of my fixes for
            > PutAllClientServerDistributedTest unless every single flaky test in it is
            > fixed by my PR. I think this is undesirable because it would be better to
            > merge the fix for 3 flaky test methods than none.
            >
            > UPDATE: After running my precheckin more times, I did get
            > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
            > loophole than anything because I didn't manage to fix GEODE-9242.
            >
            > Despite having PR #6542 eventually pass, I would like to discuss removing
            > or relaxing our requirement that stress-new-test-openjdk11 must pass in
            > order to merge a PR...
            >
            > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
            > PutAllClientServerDistributedTest
            > <
            > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Cdemery%40vmware.com%7C8fc35334579b46342ba108d92b727d8f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588593008275838%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=wAabvn1mSAqfME7gLOYXDW8v8uDJKpb2EprQgzdNhEM%3D&amp;reserved=0
            > >
            >
            > Fixed in PR #6542:
            > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
            > testPartialKeyInPRSingleHopWithRedundancy
            > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Cdemery%40vmware.com%7C8fc35334579b46342ba108d92b727d8f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588593008275838%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=ReDQVflxfbd5RKL8J8Tj1o27uzW0tJ2TCHfPaeNbPpg%3D&amp;reserved=0>
            > * GEODE-9103: CI Failure:
            > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
            > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Cdemery%40vmware.com%7C8fc35334579b46342ba108d92b727d8f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588593008275838%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=EfPYOhE242WN41MGqjSsZqXPEGccEi%2FeWmNVp7JuO10%3D&amp;reserved=0>
            > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
            > fails due to missing disk store after server restart
            > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Cdemery%40vmware.com%7C8fc35334579b46342ba108d92b727d8f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588593008275838%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=3YLbXf5pk47yoIpMRji175mow%2FC6zMRO4IKp%2FY84m3g%3D&amp;reserved=0>
            >
            > Still flaky:
            > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
            > testEventIdOutOfOrderInPartitionRegionSingleHop
            > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Cdemery%40vmware.com%7C8fc35334579b46342ba108d92b727d8f%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588593008275838%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2BYHeogNcjh3Z6yfrs7EDowWzVhlNcnPO2%2FmrbA0X3Vw%3D&amp;reserved=0>
            >
            > Thanks,
            > Kirk
            >



Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Owen Nichols <on...@vmware.com>.
This would substantially increase the burden on codeowners, because now in addition to looking at the code itself, they would have to wait for any running PR checks to complete, as well as remember to come back and look at the PR after any subsequent commits to make sure the checks are still passing.

On 6/9/21, 10:56 AM, "Mark Hanson" <ha...@vmware.com> wrote:

    I think that process is a bit much. PRs are already a challenge. What do people think about code owners being the gate. We can accept by custom that there should be no stress-new-test failures. If there is a failure, a code owner can request a change or decide to let it go. I think that is sufficient all things considered.

    Thanks,
    Mark

    On 6/9/21, 10:43 AM, "Owen Nichols" <on...@vmware.com> wrote:

        I feel that a traditional [DISCUSS] and [VOTE] on the dev list would be sufficient and proper to grant approval for an override.  Any PR already needs approval from 1 codeowner per area to merge, so codeowners already have a little more say because they hold veto power over the PR.

        In terms of "practicalities of how this would actually work":
        Step 1: start a [DISCUSS] thread explaining the problem and why you think an override is justified
        Step 2: if there is concensus, [VOTE]
        Step 3: Myself (or whoever performs the override) must cite a link to the vote thread


        On 6/9/21, 10:16 AM, "Dale Emery" <de...@vmware.com> wrote:

            I too like #1 best for now… assuming it’s possible to give code owners this ability.

            Coincidentally, about option #3, II was reading the git release notes just now, and noticed there’s a new “trailers” feature. It gives git the ability to parse “key: value” pairs at the end of a commit message. We could potentially (with a sufficiently current version of git) use that to exclude a test from a PR stress test run.

            And, yeah, option #2 brings back the @FlakyTest annotation that we worked so hard to eliminate.

            As Mark said, none of this fixes the underlying problem, which I’d frame as: We have too many tests whose results we don’t trust.

            Dale

            From: Kirk Lund <kl...@apache.org>
            Date: Wednesday, June 9, 2021 at 9:59 AM
            To: dev@geode.apache.org <de...@geode.apache.org>
            Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
            I do like the suggestions offered up by Dale and would encourage (or even
            plead) with my fellow contributors to consider these:

              *   Allow code owners to override the block, if they can be convinced the
            > override is justified.
            >   *   Exclude troublesome tests from stress test runs, either via
            > annotations or via an `assumeThat(…)` that can detect that it’s running as
            > a stress test. Whatever the mechanism for excluding, it would be in the
            > code, and therefore subject to code owner review. (This, too, feels overly
            > broad to me, as it would exclude the test from all stress test runs.)
            >   *   A way to exclude a specific test method from running in the stress
            > tests for a specific PR or commit. I don’t have any ideas for how to
            > declare such an exclusion, but if it could be done via a file it would be
            > subject to code owner review.


            1) Allow code owners to override the block, if they can be convinced the
            override is justified.

            After all, if we don't trust our code owners...

            2-3) Use a custom annotation to exclude the test method or test class only
            from stress-new-test.

            At first I really liked this idea, but then we end up with growing a
            collection of flaky tests that are excluded in some way from
            stress-new-test that still occasionally fail in distributedTest.

            #1 really sounds like the best option to me. I believe that leaving our
            stress-new-test process as-is will only discourage everyone from fixing one
            or two flaky tests in a large dunit test. However, I also believe that if
            we give code owners the authority to override stress-new-test, then we
            need to encourage them not to override this too often.

            On Tue, Jun 8, 2021 at 11:11 AM Dale Emery <de...@vmware.com> wrote:

            > Maybe we can find a way to relax the requirement, or to allow addressing
            > specific situations like the tangle you find yourself in.
            >
            > Removing the requirement altogether feels overly broad. I fear it would
            > allow us to quietly disregard all intermittent test failures, and I think
            > we already quietly (or even actively) disregard way too many kinds of
            > failures.
            >
            > I would prefer some way to explicitly disregard only the specific test
            > failures that prevent us from merging, and only with some amount of
            > explicit justification.
            >
            > I’m not sure what that would look like. Some half-baked possibilities:
            >
            >   *   Allow code owners to override the block, if they can be convinced
            > the override is justified.
            >   *   Exclude troublesome tests from stress test runs, either via
            > annotations or via an `assumeThat(…)` that can detect that it’s running as
            > a stress test. Whatever the mechanism for excluding, it would be in the
            > code, and therefore subject to code owner review. (This, too, feels overly
            > broad to me, as it would exclude the test from all stress test runs.)
            >   *   A way to exclude a specific test method from running in the stress
            > tests for a specific PR or commit. I don’t have any ideas for how to
            > declare such an exclusion, but if it could be done via a file it would be
            > subject to code owner review.
            >
            > Dale
            >
            > From: Kirk Lund <kl...@apache.org>
            > Date: Tuesday, June 8, 2021 at 9:33 AM
            > To: dev@geode.apache.org <de...@geode.apache.org>
            > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
            > Our requirement for stress-new-test-openjdk11 to pass before allowing merge
            > doesn't really work as intended for fixing distributed tests that contain
            > multiple flaky test methods. In fact, I think it causes contributors to
            > avoid tackling flaky tests.
            >
            > I've been working on GEODE-9103: CI Failure:
            > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
            > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Conichols%40vmware.com%7C04ab8ffbe5824149f3c408d92b6ff8b3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588582189619389%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6HhCuHXEJiWvGcjKr3UKyarforIrRHnPjhfcU4CCUIo%3D&amp;reserved=0> and was able to fix it.
            >
            > However, stress-new-test-openjdk11 then continued to fail for other flaky
            > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
            > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
            > which remains flaky.
            >
            > Unfortunately, I cannot merge any of my fixes for
            > PutAllClientServerDistributedTest unless every single flaky test in it is
            > fixed by my PR. I think this is undesirable because it would be better to
            > merge the fix for 3 flaky test methods than none.
            >
            > UPDATE: After running my precheckin more times, I did get
            > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
            > loophole than anything because I didn't manage to fix GEODE-9242.
            >
            > Despite having PR #6542 eventually pass, I would like to discuss removing
            > or relaxing our requirement that stress-new-test-openjdk11 must pass in
            > order to merge a PR...
            >
            > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
            > PutAllClientServerDistributedTest
            > <
            > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Conichols%40vmware.com%7C04ab8ffbe5824149f3c408d92b6ff8b3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588582189619389%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=dDGAwAuQQiTT9gwARFsccHEPhbW2nlX40O0XqWOpPmw%3D&amp;reserved=0
            > >
            >
            > Fixed in PR #6542:
            > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
            > testPartialKeyInPRSingleHopWithRedundancy
            > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Conichols%40vmware.com%7C04ab8ffbe5824149f3c408d92b6ff8b3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588582189619389%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=nvNvfHOeFuURlTPHeoKK9zv5YsCtH71wyrUU%2BMabexQ%3D&amp;reserved=0>
            > * GEODE-9103: CI Failure:
            > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
            > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Conichols%40vmware.com%7C04ab8ffbe5824149f3c408d92b6ff8b3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588582189619389%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=6HhCuHXEJiWvGcjKr3UKyarforIrRHnPjhfcU4CCUIo%3D&amp;reserved=0>
            > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
            > fails due to missing disk store after server restart
            > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Conichols%40vmware.com%7C04ab8ffbe5824149f3c408d92b6ff8b3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588582189629337%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=7TQOe44Mb%2BwM2f6JC94znuQoo53OplqDF4k4C6jkRBE%3D&amp;reserved=0>
            >
            > Still flaky:
            > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
            > testEventIdOutOfOrderInPartitionRegionSingleHop
            > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Conichols%40vmware.com%7C04ab8ffbe5824149f3c408d92b6ff8b3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588582189629337%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=qmFx1X78eoGaomBiukube64kInwYjiY%2BBa3hoyt%2B2BM%3D&amp;reserved=0>
            >
            > Thanks,
            > Kirk
            >




Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Mark Hanson <ha...@vmware.com>.
I think that process is a bit much. PRs are already a challenge. What do people think about code owners being the gate. We can accept by custom that there should be no stress-new-test failures. If there is a failure, a code owner can request a change or decide to let it go. I think that is sufficient all things considered.

Thanks,
Mark

On 6/9/21, 10:43 AM, "Owen Nichols" <on...@vmware.com> wrote:

    I feel that a traditional [DISCUSS] and [VOTE] on the dev list would be sufficient and proper to grant approval for an override.  Any PR already needs approval from 1 codeowner per area to merge, so codeowners already have a little more say because they hold veto power over the PR.

    In terms of "practicalities of how this would actually work":
    Step 1: start a [DISCUSS] thread explaining the problem and why you think an override is justified
    Step 2: if there is concensus, [VOTE]
    Step 3: Myself (or whoever performs the override) must cite a link to the vote thread


    On 6/9/21, 10:16 AM, "Dale Emery" <de...@vmware.com> wrote:

        I too like #1 best for now… assuming it’s possible to give code owners this ability.

        Coincidentally, about option #3, II was reading the git release notes just now, and noticed there’s a new “trailers” feature. It gives git the ability to parse “key: value” pairs at the end of a commit message. We could potentially (with a sufficiently current version of git) use that to exclude a test from a PR stress test run.

        And, yeah, option #2 brings back the @FlakyTest annotation that we worked so hard to eliminate.

        As Mark said, none of this fixes the underlying problem, which I’d frame as: We have too many tests whose results we don’t trust.

        Dale

        From: Kirk Lund <kl...@apache.org>
        Date: Wednesday, June 9, 2021 at 9:59 AM
        To: dev@geode.apache.org <de...@geode.apache.org>
        Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
        I do like the suggestions offered up by Dale and would encourage (or even
        plead) with my fellow contributors to consider these:

          *   Allow code owners to override the block, if they can be convinced the
        > override is justified.
        >   *   Exclude troublesome tests from stress test runs, either via
        > annotations or via an `assumeThat(…)` that can detect that it’s running as
        > a stress test. Whatever the mechanism for excluding, it would be in the
        > code, and therefore subject to code owner review. (This, too, feels overly
        > broad to me, as it would exclude the test from all stress test runs.)
        >   *   A way to exclude a specific test method from running in the stress
        > tests for a specific PR or commit. I don’t have any ideas for how to
        > declare such an exclusion, but if it could be done via a file it would be
        > subject to code owner review.


        1) Allow code owners to override the block, if they can be convinced the
        override is justified.

        After all, if we don't trust our code owners...

        2-3) Use a custom annotation to exclude the test method or test class only
        from stress-new-test.

        At first I really liked this idea, but then we end up with growing a
        collection of flaky tests that are excluded in some way from
        stress-new-test that still occasionally fail in distributedTest.

        #1 really sounds like the best option to me. I believe that leaving our
        stress-new-test process as-is will only discourage everyone from fixing one
        or two flaky tests in a large dunit test. However, I also believe that if
        we give code owners the authority to override stress-new-test, then we
        need to encourage them not to override this too often.

        On Tue, Jun 8, 2021 at 11:11 AM Dale Emery <de...@vmware.com> wrote:

        > Maybe we can find a way to relax the requirement, or to allow addressing
        > specific situations like the tangle you find yourself in.
        >
        > Removing the requirement altogether feels overly broad. I fear it would
        > allow us to quietly disregard all intermittent test failures, and I think
        > we already quietly (or even actively) disregard way too many kinds of
        > failures.
        >
        > I would prefer some way to explicitly disregard only the specific test
        > failures that prevent us from merging, and only with some amount of
        > explicit justification.
        >
        > I’m not sure what that would look like. Some half-baked possibilities:
        >
        >   *   Allow code owners to override the block, if they can be convinced
        > the override is justified.
        >   *   Exclude troublesome tests from stress test runs, either via
        > annotations or via an `assumeThat(…)` that can detect that it’s running as
        > a stress test. Whatever the mechanism for excluding, it would be in the
        > code, and therefore subject to code owner review. (This, too, feels overly
        > broad to me, as it would exclude the test from all stress test runs.)
        >   *   A way to exclude a specific test method from running in the stress
        > tests for a specific PR or commit. I don’t have any ideas for how to
        > declare such an exclusion, but if it could be done via a file it would be
        > subject to code owner review.
        >
        > Dale
        >
        > From: Kirk Lund <kl...@apache.org>
        > Date: Tuesday, June 8, 2021 at 9:33 AM
        > To: dev@geode.apache.org <de...@geode.apache.org>
        > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
        > Our requirement for stress-new-test-openjdk11 to pass before allowing merge
        > doesn't really work as intended for fixing distributed tests that contain
        > multiple flaky test methods. In fact, I think it causes contributors to
        > avoid tackling flaky tests.
        >
        > I've been working on GEODE-9103: CI Failure:
        > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
        > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7Ca12bbaae61754cc97aaf08d92b6e06f3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588573832729185%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=rytjjO4WYMyWVEFdgB6iiofdqnWfpfOsSbV5bYU2y7s%3D&amp;reserved=0> and was able to fix it.
        >
        > However, stress-new-test-openjdk11 then continued to fail for other flaky
        > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
        > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
        > which remains flaky.
        >
        > Unfortunately, I cannot merge any of my fixes for
        > PutAllClientServerDistributedTest unless every single flaky test in it is
        > fixed by my PR. I think this is undesirable because it would be better to
        > merge the fix for 3 flaky test methods than none.
        >
        > UPDATE: After running my precheckin more times, I did get
        > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
        > loophole than anything because I didn't manage to fix GEODE-9242.
        >
        > Despite having PR #6542 eventually pass, I would like to discuss removing
        > or relaxing our requirement that stress-new-test-openjdk11 must pass in
        > order to merge a PR...
        >
        > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
        > PutAllClientServerDistributedTest
        > <
        > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Chansonm%40vmware.com%7Ca12bbaae61754cc97aaf08d92b6e06f3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588573832729185%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=XyLDbqZHAfdn%2Fy4v9vzvyTHQ3sUdoeTmnR5x01XPG7I%3D&amp;reserved=0
        > >
        >
        > Fixed in PR #6542:
        > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
        > testPartialKeyInPRSingleHopWithRedundancy
        > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Chansonm%40vmware.com%7Ca12bbaae61754cc97aaf08d92b6e06f3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588573832739140%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=YLHji46Vn2V%2BEFN6zhXD16w6zFM80qW8PX8TvcQcNlU%3D&amp;reserved=0>
        > * GEODE-9103: CI Failure:
        > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
        > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7Ca12bbaae61754cc97aaf08d92b6e06f3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588573832739140%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=EIXQjyBK3IHowFbLzrc%2FP0Xp%2Bdd88nT1wI0pJ42ZVds%3D&amp;reserved=0>
        > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
        > fails due to missing disk store after server restart
        > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Chansonm%40vmware.com%7Ca12bbaae61754cc97aaf08d92b6e06f3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588573832739140%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=wYBThzd%2BWXC7SOhv6USn890EctcJd8oo%2FLhTnq8c9u0%3D&amp;reserved=0>
        >
        > Still flaky:
        > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
        > testEventIdOutOfOrderInPartitionRegionSingleHop
        > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Chansonm%40vmware.com%7Ca12bbaae61754cc97aaf08d92b6e06f3%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588573832739140%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=I7UR7JtFaIUsjFBcwGyA0Le%2FsDeq09gRS2QIcIBRsVU%3D&amp;reserved=0>
        >
        > Thanks,
        > Kirk
        >



Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Owen Nichols <on...@vmware.com>.
I feel that a traditional [DISCUSS] and [VOTE] on the dev list would be sufficient and proper to grant approval for an override.  Any PR already needs approval from 1 codeowner per area to merge, so codeowners already have a little more say because they hold veto power over the PR.

In terms of "practicalities of how this would actually work":
Step 1: start a [DISCUSS] thread explaining the problem and why you think an override is justified
Step 2: if there is concensus, [VOTE]
Step 3: Myself (or whoever performs the override) must cite a link to the vote thread


On 6/9/21, 10:16 AM, "Dale Emery" <de...@vmware.com> wrote:

    I too like #1 best for now… assuming it’s possible to give code owners this ability.

    Coincidentally, about option #3, II was reading the git release notes just now, and noticed there’s a new “trailers” feature. It gives git the ability to parse “key: value” pairs at the end of a commit message. We could potentially (with a sufficiently current version of git) use that to exclude a test from a PR stress test run.

    And, yeah, option #2 brings back the @FlakyTest annotation that we worked so hard to eliminate.

    As Mark said, none of this fixes the underlying problem, which I’d frame as: We have too many tests whose results we don’t trust.

    Dale

    From: Kirk Lund <kl...@apache.org>
    Date: Wednesday, June 9, 2021 at 9:59 AM
    To: dev@geode.apache.org <de...@geode.apache.org>
    Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
    I do like the suggestions offered up by Dale and would encourage (or even
    plead) with my fellow contributors to consider these:

      *   Allow code owners to override the block, if they can be convinced the
    > override is justified.
    >   *   Exclude troublesome tests from stress test runs, either via
    > annotations or via an `assumeThat(…)` that can detect that it’s running as
    > a stress test. Whatever the mechanism for excluding, it would be in the
    > code, and therefore subject to code owner review. (This, too, feels overly
    > broad to me, as it would exclude the test from all stress test runs.)
    >   *   A way to exclude a specific test method from running in the stress
    > tests for a specific PR or commit. I don’t have any ideas for how to
    > declare such an exclusion, but if it could be done via a file it would be
    > subject to code owner review.


    1) Allow code owners to override the block, if they can be convinced the
    override is justified.

    After all, if we don't trust our code owners...

    2-3) Use a custom annotation to exclude the test method or test class only
    from stress-new-test.

    At first I really liked this idea, but then we end up with growing a
    collection of flaky tests that are excluded in some way from
    stress-new-test that still occasionally fail in distributedTest.

    #1 really sounds like the best option to me. I believe that leaving our
    stress-new-test process as-is will only discourage everyone from fixing one
    or two flaky tests in a large dunit test. However, I also believe that if
    we give code owners the authority to override stress-new-test, then we
    need to encourage them not to override this too often.

    On Tue, Jun 8, 2021 at 11:11 AM Dale Emery <de...@vmware.com> wrote:

    > Maybe we can find a way to relax the requirement, or to allow addressing
    > specific situations like the tangle you find yourself in.
    >
    > Removing the requirement altogether feels overly broad. I fear it would
    > allow us to quietly disregard all intermittent test failures, and I think
    > we already quietly (or even actively) disregard way too many kinds of
    > failures.
    >
    > I would prefer some way to explicitly disregard only the specific test
    > failures that prevent us from merging, and only with some amount of
    > explicit justification.
    >
    > I’m not sure what that would look like. Some half-baked possibilities:
    >
    >   *   Allow code owners to override the block, if they can be convinced
    > the override is justified.
    >   *   Exclude troublesome tests from stress test runs, either via
    > annotations or via an `assumeThat(…)` that can detect that it’s running as
    > a stress test. Whatever the mechanism for excluding, it would be in the
    > code, and therefore subject to code owner review. (This, too, feels overly
    > broad to me, as it would exclude the test from all stress test runs.)
    >   *   A way to exclude a specific test method from running in the stress
    > tests for a specific PR or commit. I don’t have any ideas for how to
    > declare such an exclusion, but if it could be done via a file it would be
    > subject to code owner review.
    >
    > Dale
    >
    > From: Kirk Lund <kl...@apache.org>
    > Date: Tuesday, June 8, 2021 at 9:33 AM
    > To: dev@geode.apache.org <de...@geode.apache.org>
    > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
    > Our requirement for stress-new-test-openjdk11 to pass before allowing merge
    > doesn't really work as intended for fixing distributed tests that contain
    > multiple flaky test methods. In fact, I think it causes contributors to
    > avoid tackling flaky tests.
    >
    > I've been working on GEODE-9103: CI Failure:
    > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
    > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Conichols%40vmware.com%7C14d67e61cd184b8eb8ab08d92b6a46f8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588557725694117%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QcySjOaczA5DJFCJYZVjju%2FKzqJshAWPFvYS%2BfF8HDg%3D&amp;reserved=0> and was able to fix it.
    >
    > However, stress-new-test-openjdk11 then continued to fail for other flaky
    > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
    > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
    > which remains flaky.
    >
    > Unfortunately, I cannot merge any of my fixes for
    > PutAllClientServerDistributedTest unless every single flaky test in it is
    > fixed by my PR. I think this is undesirable because it would be better to
    > merge the fix for 3 flaky test methods than none.
    >
    > UPDATE: After running my precheckin more times, I did get
    > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
    > loophole than anything because I didn't manage to fix GEODE-9242.
    >
    > Despite having PR #6542 eventually pass, I would like to discuss removing
    > or relaxing our requirement that stress-new-test-openjdk11 must pass in
    > order to merge a PR...
    >
    > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
    > PutAllClientServerDistributedTest
    > <
    > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Conichols%40vmware.com%7C14d67e61cd184b8eb8ab08d92b6a46f8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588557725694117%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=1aAUOMAR9SlIeJxnE5uGa1n3HHVfi1LtP2veAt0Lreo%3D&amp;reserved=0
    > >
    >
    > Fixed in PR #6542:
    > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
    > testPartialKeyInPRSingleHopWithRedundancy
    > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Conichols%40vmware.com%7C14d67e61cd184b8eb8ab08d92b6a46f8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588557725694117%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=w%2F3Df%2FmWrm00OVk3TmdiX7xMgUTZ0sq9peoTHz45eT4%3D&amp;reserved=0>
    > * GEODE-9103: CI Failure:
    > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
    > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Conichols%40vmware.com%7C14d67e61cd184b8eb8ab08d92b6a46f8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588557725694117%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QcySjOaczA5DJFCJYZVjju%2FKzqJshAWPFvYS%2BfF8HDg%3D&amp;reserved=0>
    > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
    > fails due to missing disk store after server restart
    > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Conichols%40vmware.com%7C14d67e61cd184b8eb8ab08d92b6a46f8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588557725694117%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=z22vXRxOobWZsF9sDhfYgphT4nMtgXJcXjmn23KoIu0%3D&amp;reserved=0>
    >
    > Still flaky:
    > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
    > testEventIdOutOfOrderInPartitionRegionSingleHop
    > <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Conichols%40vmware.com%7C14d67e61cd184b8eb8ab08d92b6a46f8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588557725694117%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=jEhNfcqiIEgLKm3xZpdGw686%2FTGKJ3Hxi5nfW2vZDRQ%3D&amp;reserved=0>
    >
    > Thanks,
    > Kirk
    >


Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Dale Emery <de...@vmware.com>.
I too like #1 best for now… assuming it’s possible to give code owners this ability.

Coincidentally, about option #3, II was reading the git release notes just now, and noticed there’s a new “trailers” feature. It gives git the ability to parse “key: value” pairs at the end of a commit message. We could potentially (with a sufficiently current version of git) use that to exclude a test from a PR stress test run.

And, yeah, option #2 brings back the @FlakyTest annotation that we worked so hard to eliminate.

As Mark said, none of this fixes the underlying problem, which I’d frame as: We have too many tests whose results we don’t trust.

Dale

From: Kirk Lund <kl...@apache.org>
Date: Wednesday, June 9, 2021 at 9:59 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
I do like the suggestions offered up by Dale and would encourage (or even
plead) with my fellow contributors to consider these:

  *   Allow code owners to override the block, if they can be convinced the
> override is justified.
>   *   Exclude troublesome tests from stress test runs, either via
> annotations or via an `assumeThat(…)` that can detect that it’s running as
> a stress test. Whatever the mechanism for excluding, it would be in the
> code, and therefore subject to code owner review. (This, too, feels overly
> broad to me, as it would exclude the test from all stress test runs.)
>   *   A way to exclude a specific test method from running in the stress
> tests for a specific PR or commit. I don’t have any ideas for how to
> declare such an exclusion, but if it could be done via a file it would be
> subject to code owner review.


1) Allow code owners to override the block, if they can be convinced the
override is justified.

After all, if we don't trust our code owners...

2-3) Use a custom annotation to exclude the test method or test class only
from stress-new-test.

At first I really liked this idea, but then we end up with growing a
collection of flaky tests that are excluded in some way from
stress-new-test that still occasionally fail in distributedTest.

#1 really sounds like the best option to me. I believe that leaving our
stress-new-test process as-is will only discourage everyone from fixing one
or two flaky tests in a large dunit test. However, I also believe that if
we give code owners the authority to override stress-new-test, then we
need to encourage them not to override this too often.

On Tue, Jun 8, 2021 at 11:11 AM Dale Emery <de...@vmware.com> wrote:

> Maybe we can find a way to relax the requirement, or to allow addressing
> specific situations like the tangle you find yourself in.
>
> Removing the requirement altogether feels overly broad. I fear it would
> allow us to quietly disregard all intermittent test failures, and I think
> we already quietly (or even actively) disregard way too many kinds of
> failures.
>
> I would prefer some way to explicitly disregard only the specific test
> failures that prevent us from merging, and only with some amount of
> explicit justification.
>
> I’m not sure what that would look like. Some half-baked possibilities:
>
>   *   Allow code owners to override the block, if they can be convinced
> the override is justified.
>   *   Exclude troublesome tests from stress test runs, either via
> annotations or via an `assumeThat(…)` that can detect that it’s running as
> a stress test. Whatever the mechanism for excluding, it would be in the
> code, and therefore subject to code owner review. (This, too, feels overly
> broad to me, as it would exclude the test from all stress test runs.)
>   *   A way to exclude a specific test method from running in the stress
> tests for a specific PR or commit. I don’t have any ideas for how to
> declare such an exclusion, but if it could be done via a file it would be
> subject to code owner review.
>
> Dale
>
> From: Kirk Lund <kl...@apache.org>
> Date: Tuesday, June 8, 2021 at 9:33 AM
> To: dev@geode.apache.org <de...@geode.apache.org>
> Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
> Our requirement for stress-new-test-openjdk11 to pass before allowing merge
> doesn't really work as intended for fixing distributed tests that contain
> multiple flaky test methods. In fact, I think it causes contributors to
> avoid tackling flaky tests.
>
> I've been working on GEODE-9103: CI Failure:
> PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
> <https://issues.apache.org/jira/browse/GEODE-9103> and was able to fix it.
>
> However, stress-new-test-openjdk11 then continued to fail for other flaky
> tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
> GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
> which remains flaky.
>
> Unfortunately, I cannot merge any of my fixes for
> PutAllClientServerDistributedTest unless every single flaky test in it is
> fixed by my PR. I think this is undesirable because it would be better to
> merge the fix for 3 flaky test methods than none.
>
> UPDATE: After running my precheckin more times, I did get
> stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
> loophole than anything because I didn't manage to fix GEODE-9242.
>
> Despite having PR #6542 eventually pass, I would like to discuss removing
> or relaxing our requirement that stress-new-test-openjdk11 must pass in
> order to merge a PR...
>
> PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
> PutAllClientServerDistributedTest
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Cdemery%40vmware.com%7C2952fe3bb6af4bdeccb408d92b67f465%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588547766811809%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=cOS5Z8%2BlgjLFW%2FVdzix8bfGVsth4BnCl73HD6dvV8Y8%3D&amp;reserved=0
> >
>
> Fixed in PR #6542:
> * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
> testPartialKeyInPRSingleHopWithRedundancy
> <https://issues.apache.org/jira/browse/GEODE-9296>
> * GEODE-9103: CI Failure:
> PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
> <https://issues.apache.org/jira/browse/GEODE-9103>
> * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
> fails due to missing disk store after server restart
> <https://issues.apache.org/jira/browse/GEODE-8528>
>
> Still flaky:
> * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
> testEventIdOutOfOrderInPartitionRegionSingleHop
> <https://issues.apache.org/jira/browse/GEODE-9242>
>
> Thanks,
> Kirk
>

Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Kirk Lund <kl...@apache.org>.
I do like the suggestions offered up by Dale and would encourage (or even
plead) with my fellow contributors to consider these:

  *   Allow code owners to override the block, if they can be convinced the
> override is justified.
>   *   Exclude troublesome tests from stress test runs, either via
> annotations or via an `assumeThat(…)` that can detect that it’s running as
> a stress test. Whatever the mechanism for excluding, it would be in the
> code, and therefore subject to code owner review. (This, too, feels overly
> broad to me, as it would exclude the test from all stress test runs.)
>   *   A way to exclude a specific test method from running in the stress
> tests for a specific PR or commit. I don’t have any ideas for how to
> declare such an exclusion, but if it could be done via a file it would be
> subject to code owner review.


1) Allow code owners to override the block, if they can be convinced the
override is justified.

After all, if we don't trust our code owners...

2-3) Use a custom annotation to exclude the test method or test class only
from stress-new-test.

At first I really liked this idea, but then we end up with growing a
collection of flaky tests that are excluded in some way from
stress-new-test that still occasionally fail in distributedTest.

#1 really sounds like the best option to me. I believe that leaving our
stress-new-test process as-is will only discourage everyone from fixing one
or two flaky tests in a large dunit test. However, I also believe that if
we give code owners the authority to override stress-new-test, then we
need to encourage them not to override this too often.

On Tue, Jun 8, 2021 at 11:11 AM Dale Emery <de...@vmware.com> wrote:

> Maybe we can find a way to relax the requirement, or to allow addressing
> specific situations like the tangle you find yourself in.
>
> Removing the requirement altogether feels overly broad. I fear it would
> allow us to quietly disregard all intermittent test failures, and I think
> we already quietly (or even actively) disregard way too many kinds of
> failures.
>
> I would prefer some way to explicitly disregard only the specific test
> failures that prevent us from merging, and only with some amount of
> explicit justification.
>
> I’m not sure what that would look like. Some half-baked possibilities:
>
>   *   Allow code owners to override the block, if they can be convinced
> the override is justified.
>   *   Exclude troublesome tests from stress test runs, either via
> annotations or via an `assumeThat(…)` that can detect that it’s running as
> a stress test. Whatever the mechanism for excluding, it would be in the
> code, and therefore subject to code owner review. (This, too, feels overly
> broad to me, as it would exclude the test from all stress test runs.)
>   *   A way to exclude a specific test method from running in the stress
> tests for a specific PR or commit. I don’t have any ideas for how to
> declare such an exclusion, but if it could be done via a file it would be
> subject to code owner review.
>
> Dale
>
> From: Kirk Lund <kl...@apache.org>
> Date: Tuesday, June 8, 2021 at 9:33 AM
> To: dev@geode.apache.org <de...@geode.apache.org>
> Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
> Our requirement for stress-new-test-openjdk11 to pass before allowing merge
> doesn't really work as intended for fixing distributed tests that contain
> multiple flaky test methods. In fact, I think it causes contributors to
> avoid tackling flaky tests.
>
> I've been working on GEODE-9103: CI Failure:
> PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
> <https://issues.apache.org/jira/browse/GEODE-9103> and was able to fix it.
>
> However, stress-new-test-openjdk11 then continued to fail for other flaky
> tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
> GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
> which remains flaky.
>
> Unfortunately, I cannot merge any of my fixes for
> PutAllClientServerDistributedTest unless every single flaky test in it is
> fixed by my PR. I think this is undesirable because it would be better to
> merge the fix for 3 flaky test methods than none.
>
> UPDATE: After running my precheckin more times, I did get
> stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
> loophole than anything because I didn't manage to fix GEODE-9242.
>
> Despite having PR #6542 eventually pass, I would like to discuss removing
> or relaxing our requirement that stress-new-test-openjdk11 must pass in
> order to merge a PR...
>
> PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
> PutAllClientServerDistributedTest
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Cdemery%40vmware.com%7Ca1f2e89aed9a4cb0940c08d92a9b2ac6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668190431746%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=CEeIvbWkOY%2B00EEVMUIqcsil%2BYUXLcIiyhSiVUxlD%2Fs%3D&amp;reserved=0
> >
>
> Fixed in PR #6542:
> * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
> testPartialKeyInPRSingleHopWithRedundancy
> <https://issues.apache.org/jira/browse/GEODE-9296>
> * GEODE-9103: CI Failure:
> PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
> <https://issues.apache.org/jira/browse/GEODE-9103>
> * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
> fails due to missing disk store after server restart
> <https://issues.apache.org/jira/browse/GEODE-8528>
>
> Still flaky:
> * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
> testEventIdOutOfOrderInPartitionRegionSingleHop
> <https://issues.apache.org/jira/browse/GEODE-9242>
>
> Thanks,
> Kirk
>

Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Dale Emery <de...@vmware.com>.
Maybe we can find a way to relax the requirement, or to allow addressing specific situations like the tangle you find yourself in.

Removing the requirement altogether feels overly broad. I fear it would allow us to quietly disregard all intermittent test failures, and I think we already quietly (or even actively) disregard way too many kinds of failures.

I would prefer some way to explicitly disregard only the specific test failures that prevent us from merging, and only with some amount of explicit justification.

I’m not sure what that would look like. Some half-baked possibilities:

  *   Allow code owners to override the block, if they can be convinced the override is justified.
  *   Exclude troublesome tests from stress test runs, either via annotations or via an `assumeThat(…)` that can detect that it’s running as a stress test. Whatever the mechanism for excluding, it would be in the code, and therefore subject to code owner review. (This, too, feels overly broad to me, as it would exclude the test from all stress test runs.)
  *   A way to exclude a specific test method from running in the stress tests for a specific PR or commit. I don’t have any ideas for how to declare such an exclusion, but if it could be done via a file it would be subject to code owner review.

Dale

From: Kirk Lund <kl...@apache.org>
Date: Tuesday, June 8, 2021 at 9:33 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
Our requirement for stress-new-test-openjdk11 to pass before allowing merge
doesn't really work as intended for fixing distributed tests that contain
multiple flaky test methods. In fact, I think it causes contributors to
avoid tackling flaky tests.

I've been working on GEODE-9103: CI Failure:
PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
<https://issues.apache.org/jira/browse/GEODE-9103> and was able to fix it.

However, stress-new-test-openjdk11 then continued to fail for other flaky
tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
which remains flaky.

Unfortunately, I cannot merge any of my fixes for
PutAllClientServerDistributedTest unless every single flaky test in it is
fixed by my PR. I think this is undesirable because it would be better to
merge the fix for 3 flaky test methods than none.

UPDATE: After running my precheckin more times, I did get
stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
loophole than anything because I didn't manage to fix GEODE-9242.

Despite having PR #6542 eventually pass, I would like to discuss removing
or relaxing our requirement that stress-new-test-openjdk11 must pass in
order to merge a PR...

PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
PutAllClientServerDistributedTest
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Cdemery%40vmware.com%7Ca1f2e89aed9a4cb0940c08d92a9b2ac6%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668190431746%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=CEeIvbWkOY%2B00EEVMUIqcsil%2BYUXLcIiyhSiVUxlD%2Fs%3D&amp;reserved=0>

Fixed in PR #6542:
* GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
testPartialKeyInPRSingleHopWithRedundancy
<https://issues.apache.org/jira/browse/GEODE-9296>
* GEODE-9103: CI Failure:
PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
<https://issues.apache.org/jira/browse/GEODE-9103>
* GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
fails due to missing disk store after server restart
<https://issues.apache.org/jira/browse/GEODE-8528>

Still flaky:
* GEODE-9242: CI failure in PutAllClientServerDistributedTest >
testEventIdOutOfOrderInPartitionRegionSingleHop
<https://issues.apache.org/jira/browse/GEODE-9242>

Thanks,
Kirk

Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Jacob Barrett <ja...@vmware.com>.
I think repeated tests shouldn’t be a blocker to merging for the reasons outlined below. A committer that is a good steward for the project should be allowed to make the judgement call when merging a PR. We have placed too many rigid processes in place that eliminated the good judgement of committers. If we think committers lack good judgment and need these rigid checks then perhaps we should be reconsidering who we allow as committers.

-Jake


> On Jun 8, 2021, at 9:33 AM, Kirk Lund <kl...@apache.org> wrote:
> 
> Our requirement for stress-new-test-openjdk11 to pass before allowing merge
> doesn't really work as intended for fixing distributed tests that contain
> multiple flaky test methods. In fact, I think it causes contributors to
> avoid tackling flaky tests.
> 
> I've been working on GEODE-9103: CI Failure:
> PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
> <https://issues.apache.org/jira/browse/GEODE-9103> and was able to fix it.
> 
> However, stress-new-test-openjdk11 then continued to fail for other flaky
> tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
> GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
> which remains flaky.
> 
> Unfortunately, I cannot merge any of my fixes for
> PutAllClientServerDistributedTest unless every single flaky test in it is
> fixed by my PR. I think this is undesirable because it would be better to
> merge the fix for 3 flaky test methods than none.
> 
> UPDATE: After running my precheckin more times, I did get
> stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
> loophole than anything because I didn't manage to fix GEODE-9242.
> 
> Despite having PR #6542 eventually pass, I would like to discuss removing
> or relaxing our requirement that stress-new-test-openjdk11 must pass in
> order to merge a PR...
> 
> PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
> PutAllClientServerDistributedTest
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Cjabarrett%40vmware.com%7C010c8c9b76a84e44ee2f08d92a9b2a33%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668189086895%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=GlEh8TaK8eQH69NcD8bUKYe7hTl9LCdgc%2BVxwDFeE6U%3D&amp;reserved=0>
> 
> Fixed in PR #6542:
> * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
> testPartialKeyInPRSingleHopWithRedundancy
> <https://issues.apache.org/jira/browse/GEODE-9296>
> * GEODE-9103: CI Failure:
> PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
> <https://issues.apache.org/jira/browse/GEODE-9103>
> * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
> fails due to missing disk store after server restart
> <https://issues.apache.org/jira/browse/GEODE-8528>
> 
> Still flaky:
> * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
> testEventIdOutOfOrderInPartitionRegionSingleHop
> <https://issues.apache.org/jira/browse/GEODE-9242>
> 
> Thanks,
> Kirk


Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Donal Evans <do...@vmware.com>.
I think that there should be room for developers to bypass the stress-new-test jobs requirement in cases like this, where the test that's failing is an existing flaky test and that failure is blocking other flaky tests/bugs from being fixed. Splitting the class up won't save us, because stress-new-test runs on any changed or newly created test class, so no matter what you try and do, you'll have to run with the remaining flaky test at some point, which will block a PR. In Kirk's case, short of re-running the job until you get lucky and don't hit the existing flaky failure, there is no way to merge the fix for the other flaky tests other than bypassing stress-new-test or @Ignore-ing the existing flaky test (and if we're going to allow that approach, we may as well just ignore all of them until they're fixed).

We already allow stress-new-test to be skipped in situations where the number of modified test classes is large (like the recent change to use curly brackets on all control flow statements, which almost certainly would have hit flaky failures in pre-checkin if stress-new-test had run despite not changing any logic) so there's some precedence for bypassing the job when the value of the changes (or the cost of running the job) outweighs the benefits of ensuring that new flakiness isn't being introduced. However, I do think that in the majority of cases, stress-new-test should run, and should block PRs if a test fails, as there is real value being provided by it in terms of preventing new flaky tests being added.

I like the idea of allowing code owners to bypass stress-new-test failures, but I'm a little unclear on the practicalities of how this would actually work. How would this bypassing be achieved? If a PR has multiple code owners spanning multiple areas of code, would all of them be able to bypass stress-new-test, or just the code owners for the class showing the flaky failure? If the latter, what would happen if two classes with different code owners were modified and both had existing flaky tests in them? Would a separate override be needed from code owners for each class?

I think we need to strike a balance between making it *possible* to override stress-new-test failures and making it *easy* to override them, because I don't think it should be something that's ever done casually and without careful thought. With that in mind, I would suggest that it might be best to bring these rare cases to the dev list and make it a community decision (or at least not just a code-owner decision), even if that can be a slow and bureaucratic process at times.

Also, big +1 to the idea of having a continued, concerted effort to tackle flaky tests. Such efforts have been made in the past and were really successful, both in terms of reducing overall test flakiness and providing valuable insight into how to avoid creating flaky tests in the first place.
________________________________
From: Mark Hanson <ha...@vmware.com>
Sent: Wednesday, June 9, 2021 10:16 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

One other thing to think about is perhaps having a rotating team to deal with flaky tests,  a small team commissioned every three or 6 months to clear out flaky tests for 1 month. It is good experience

Thanks,
Mark

On 6/9/21, 10:04 AM, "Mark Hanson" <ha...@vmware.com> wrote:

    One other thing is that we have code owners. If the PR submitter decides to ignore the stress new test, the code owner can still request a fix. That is probably a sufficient process.

    On 6/9/21, 10:02 AM, "Mark Hanson" <ha...@vmware.com> wrote:

        I agree, I am willing to concede this discussion, as long as we are judicious and specifically only use it when it is not our test that is failing. It is a real problem.

        Thanks,
        Mark
        On 6/9/21, 9:46 AM, "Kirk Lund" <kl...@apache.org> wrote:

            I did think about splitting up dunit tests, but I believe
            testEventIdOutOfOrderInPartitionRegionSingleHop will remain flaky even if I
            move it to a new dunit test. No matter how you dice it up, we end up with a
            PR that cannot be merged to develop unless you get lucky after running
            stress-new-test many times.

            One could try being tricky by marking it with @ignore or deleting the flaky
            test in one PR, and then re-add it in a 2nd PR. But, even then that 2nd PR
            is very unlikely to pass stress-new-test unless someone fixes the cause of
            that test's flakiness.

            As it stands, stress-new-test just ends up being a dead-end or an endless
            time-sync for fixing multiple flaky tests in one dunit.

            On Tue, Jun 8, 2021 at 12:09 PM Dan Smith <da...@vmware.com> wrote:

            > Would it be possible to just split that test up into multiple classes? It
            > sounds like the issue is that there is so many flaky tests in that class
            > that you can't fix them all in one PR, which might indicate it's too big.
            >
            > If we can't get StressNewTest to pass - that means our builds are failing
            > more than 2% of the time due to this one test failure. Yikes!
            >
            > -Dan
            > ________________________________
            > From: Kirk Lund <kl...@apache.org>
            > Sent: Tuesday, June 8, 2021 9:33 AM
            > To: dev@geode.apache.org <de...@geode.apache.org>
            > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
            >
            > Our requirement for stress-new-test-openjdk11 to pass before allowing merge
            > doesn't really work as intended for fixing distributed tests that contain
            > multiple flaky test methods. In fact, I think it causes contributors to
            > avoid tackling flaky tests.
            >
            > I've been working on GEODE-9103: CI Failure:
            > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
            > <
            > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067757752%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=DBtwYMtKxKq7j1HZJLtkk4mE7PR7aIkuEPelAJzdb7w%3D&amp;reserved=0>
            > and was able to fix it.
            >
            > However, stress-new-test-openjdk11 then continued to fail for other flaky
            > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
            > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
            > which remains flaky.
            >
            > Unfortunately, I cannot merge any of my fixes for
            > PutAllClientServerDistributedTest unless every single flaky test in it is
            > fixed by my PR. I think this is undesirable because it would be better to
            > merge the fix for 3 flaky test methods than none.
            >
            > UPDATE: After running my precheckin more times, I did get
            > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
            > loophole than anything because I didn't manage to fix GEODE-9242.
            >
            > Despite having PR #6542 eventually pass, I would like to discuss removing
            > or relaxing our requirement that stress-new-test-openjdk11 must pass in
            > order to merge a PR...
            >
            > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
            > PutAllClientServerDistributedTest
            > <
            > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067757752%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=e%2Bi8qHJJ55q8EjoEVDdjlSx0nF5MEsaIzybLFIOdbbA%3D&amp;reserved=0
            > >
            >
            > Fixed in PR #6542:
            > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
            > testPartialKeyInPRSingleHopWithRedundancy
            > <
            > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067757752%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=O06rBnGzNbBobYGLbnBrbCv%2FhqRUyYzvAkod6XKjjEM%3D&amp;reserved=0
            > >
            > * GEODE-9103: CI Failure:
            > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
            > <
            > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067767717%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=zzuLonsCrI5T5PgFdp1a0bC2GzBk4Va17Ga5uRWWdsQ%3D&amp;reserved=0
            > >
            > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
            > fails due to missing disk store after server restart
            > <
            > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067767717%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=EX0QSFrj%2FvZXhGY8EuUEJUG5IMQH639C1ATfYF86mlo%3D&amp;reserved=0
            > >
            >
            > Still flaky:
            > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
            > testEventIdOutOfOrderInPartitionRegionSingleHop
            > <
            > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Cdoevans%40vmware.com%7C7a86cc46b6254f6a1fc808d92b6a5b68%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588558067767717%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=b%2B8g00GsW8H8RxmgWrvgDZU6HsOPzTGLZCahhCvhhsI%3D&amp;reserved=0
            > >
            >
            > Thanks,
            > Kirk
            >




Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Mark Hanson <ha...@vmware.com>.
One other thing to think about is perhaps having a rotating team to deal with flaky tests,  a small team commissioned every three or 6 months to clear out flaky tests for 1 month. It is good experience

Thanks,
Mark

On 6/9/21, 10:04 AM, "Mark Hanson" <ha...@vmware.com> wrote:

    One other thing is that we have code owners. If the PR submitter decides to ignore the stress new test, the code owner can still request a fix. That is probably a sufficient process.

    On 6/9/21, 10:02 AM, "Mark Hanson" <ha...@vmware.com> wrote:

        I agree, I am willing to concede this discussion, as long as we are judicious and specifically only use it when it is not our test that is failing. It is a real problem. 

        Thanks,
        Mark
        On 6/9/21, 9:46 AM, "Kirk Lund" <kl...@apache.org> wrote:

            I did think about splitting up dunit tests, but I believe
            testEventIdOutOfOrderInPartitionRegionSingleHop will remain flaky even if I
            move it to a new dunit test. No matter how you dice it up, we end up with a
            PR that cannot be merged to develop unless you get lucky after running
            stress-new-test many times.

            One could try being tricky by marking it with @ignore or deleting the flaky
            test in one PR, and then re-add it in a 2nd PR. But, even then that 2nd PR
            is very unlikely to pass stress-new-test unless someone fixes the cause of
            that test's flakiness.

            As it stands, stress-new-test just ends up being a dead-end or an endless
            time-sync for fixing multiple flaky tests in one dunit.

            On Tue, Jun 8, 2021 at 12:09 PM Dan Smith <da...@vmware.com> wrote:

            > Would it be possible to just split that test up into multiple classes? It
            > sounds like the issue is that there is so many flaky tests in that class
            > that you can't fix them all in one PR, which might indicate it's too big.
            >
            > If we can't get StressNewTest to pass - that means our builds are failing
            > more than 2% of the time due to this one test failure. Yikes!
            >
            > -Dan
            > ________________________________
            > From: Kirk Lund <kl...@apache.org>
            > Sent: Tuesday, June 8, 2021 9:33 AM
            > To: dev@geode.apache.org <de...@geode.apache.org>
            > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
            >
            > Our requirement for stress-new-test-openjdk11 to pass before allowing merge
            > doesn't really work as intended for fixing distributed tests that contain
            > multiple flaky test methods. In fact, I think it causes contributors to
            > avoid tackling flaky tests.
            >
            > I've been working on GEODE-9103: CI Failure:
            > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
            > <
            > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805554080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SJNMBA%2Fdu0w1%2BFINgw8tTVJhbro%2BIYQ3rGx6dgToWkk%3D&amp;reserved=0>
            > and was able to fix it.
            >
            > However, stress-new-test-openjdk11 then continued to fail for other flaky
            > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
            > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
            > which remains flaky.
            >
            > Unfortunately, I cannot merge any of my fixes for
            > PutAllClientServerDistributedTest unless every single flaky test in it is
            > fixed by my PR. I think this is undesirable because it would be better to
            > merge the fix for 3 flaky test methods than none.
            >
            > UPDATE: After running my precheckin more times, I did get
            > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
            > loophole than anything because I didn't manage to fix GEODE-9242.
            >
            > Despite having PR #6542 eventually pass, I would like to discuss removing
            > or relaxing our requirement that stress-new-test-openjdk11 must pass in
            > order to merge a PR...
            >
            > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
            > PutAllClientServerDistributedTest
            > <
            > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805554080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SNHVauGZeJaNr%2FexPEfcjZU8ci%2FJ6UKXlh8Jb8F0nqQ%3D&amp;reserved=0
            > >
            >
            > Fixed in PR #6542:
            > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
            > testPartialKeyInPRSingleHopWithRedundancy
            > <
            > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805554080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=yuk0ixdDUeqI%2BJgB1vmafUh2KrlYLhb1mtgzbklBXTE%3D&amp;reserved=0
            > >
            > * GEODE-9103: CI Failure:
            > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
            > <
            > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805554080%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SJNMBA%2Fdu0w1%2BFINgw8tTVJhbro%2BIYQ3rGx6dgToWkk%3D&amp;reserved=0
            > >
            > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
            > fails due to missing disk store after server restart
            > <
            > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805564035%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2BrAfkArsXmTtjE1cwak41WhTVsiZjYYnVjcK6TrJS8%3D&amp;reserved=0
            > >
            >
            > Still flaky:
            > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
            > testEventIdOutOfOrderInPartitionRegionSingleHop
            > <
            > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Chansonm%40vmware.com%7C800517c549824631351108d92b68aaa2%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588550805564035%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8nszzc01SnwsXR9MxNZ461LB0CYcdKlBdX6A7ihC74c%3D&amp;reserved=0
            > >
            >
            > Thanks,
            > Kirk
            >




Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Mark Hanson <ha...@vmware.com>.
One other thing is that we have code owners. If the PR submitter decides to ignore the stress new test, the code owner can still request a fix. That is probably a sufficient process.

On 6/9/21, 10:02 AM, "Mark Hanson" <ha...@vmware.com> wrote:

    I agree, I am willing to concede this discussion, as long as we are judicious and specifically only use it when it is not our test that is failing. It is a real problem. 

    Thanks,
    Mark
    On 6/9/21, 9:46 AM, "Kirk Lund" <kl...@apache.org> wrote:

        I did think about splitting up dunit tests, but I believe
        testEventIdOutOfOrderInPartitionRegionSingleHop will remain flaky even if I
        move it to a new dunit test. No matter how you dice it up, we end up with a
        PR that cannot be merged to develop unless you get lucky after running
        stress-new-test many times.

        One could try being tricky by marking it with @ignore or deleting the flaky
        test in one PR, and then re-add it in a 2nd PR. But, even then that 2nd PR
        is very unlikely to pass stress-new-test unless someone fixes the cause of
        that test's flakiness.

        As it stands, stress-new-test just ends up being a dead-end or an endless
        time-sync for fixing multiple flaky tests in one dunit.

        On Tue, Jun 8, 2021 at 12:09 PM Dan Smith <da...@vmware.com> wrote:

        > Would it be possible to just split that test up into multiple classes? It
        > sounds like the issue is that there is so many flaky tests in that class
        > that you can't fix them all in one PR, which might indicate it's too big.
        >
        > If we can't get StressNewTest to pass - that means our builds are failing
        > more than 2% of the time due to this one test failure. Yikes!
        >
        > -Dan
        > ________________________________
        > From: Kirk Lund <kl...@apache.org>
        > Sent: Tuesday, June 8, 2021 9:33 AM
        > To: dev@geode.apache.org <de...@geode.apache.org>
        > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
        >
        > Our requirement for stress-new-test-openjdk11 to pass before allowing merge
        > doesn't really work as intended for fixing distributed tests that contain
        > multiple flaky test methods. In fact, I think it causes contributors to
        > avoid tackling flaky tests.
        >
        > I've been working on GEODE-9103: CI Failure:
        > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
        > <
        > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7Cbf65c71249894b18853f08d92b6834f9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588549393164021%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=HJd0ZQcLPldhtt5%2Bw%2FxJSDcTOUxd658Um45esI33BNQ%3D&amp;reserved=0>
        > and was able to fix it.
        >
        > However, stress-new-test-openjdk11 then continued to fail for other flaky
        > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
        > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
        > which remains flaky.
        >
        > Unfortunately, I cannot merge any of my fixes for
        > PutAllClientServerDistributedTest unless every single flaky test in it is
        > fixed by my PR. I think this is undesirable because it would be better to
        > merge the fix for 3 flaky test methods than none.
        >
        > UPDATE: After running my precheckin more times, I did get
        > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
        > loophole than anything because I didn't manage to fix GEODE-9242.
        >
        > Despite having PR #6542 eventually pass, I would like to discuss removing
        > or relaxing our requirement that stress-new-test-openjdk11 must pass in
        > order to merge a PR...
        >
        > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
        > PutAllClientServerDistributedTest
        > <
        > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Chansonm%40vmware.com%7Cbf65c71249894b18853f08d92b6834f9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588549393164021%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=l8OHw3XqetJVrK3Ltp9ZjSUfxvRd5y1087m62eemskA%3D&amp;reserved=0
        > >
        >
        > Fixed in PR #6542:
        > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
        > testPartialKeyInPRSingleHopWithRedundancy
        > <
        > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Chansonm%40vmware.com%7Cbf65c71249894b18853f08d92b6834f9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588549393164021%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=SbcGFpxr9KJwnpFflJyP5nKLBqzPE%2BN2bnTTyHFjl7Q%3D&amp;reserved=0
        > >
        > * GEODE-9103: CI Failure:
        > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
        > <
        > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7Cbf65c71249894b18853f08d92b6834f9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588549393173973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=4XtlblZ%2FhLCGRZunt1o%2BpwRhtnQZ%2BFLweZ5%2B%2FdRR%2BlA%3D&amp;reserved=0
        > >
        > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
        > fails due to missing disk store after server restart
        > <
        > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Chansonm%40vmware.com%7Cbf65c71249894b18853f08d92b6834f9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588549393173973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=LYOOFXczCGemBLV%2FdREukaexvV93%2FmnlUzZamdsWzqI%3D&amp;reserved=0
        > >
        >
        > Still flaky:
        > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
        > testEventIdOutOfOrderInPartitionRegionSingleHop
        > <
        > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Chansonm%40vmware.com%7Cbf65c71249894b18853f08d92b6834f9%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588549393173973%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=voQKjCBsArlWaAvLVSpwRSLdp0yqcOjo6NAGBF9sP%2B4%3D&amp;reserved=0
        > >
        >
        > Thanks,
        > Kirk
        >



Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Mark Hanson <ha...@vmware.com>.
I agree, I am willing to concede this discussion, as long as we are judicious and specifically only use it when it is not our test that is failing. It is a real problem. 

Thanks,
Mark
On 6/9/21, 9:46 AM, "Kirk Lund" <kl...@apache.org> wrote:

    I did think about splitting up dunit tests, but I believe
    testEventIdOutOfOrderInPartitionRegionSingleHop will remain flaky even if I
    move it to a new dunit test. No matter how you dice it up, we end up with a
    PR that cannot be merged to develop unless you get lucky after running
    stress-new-test many times.

    One could try being tricky by marking it with @ignore or deleting the flaky
    test in one PR, and then re-add it in a 2nd PR. But, even then that 2nd PR
    is very unlikely to pass stress-new-test unless someone fixes the cause of
    that test's flakiness.

    As it stands, stress-new-test just ends up being a dead-end or an endless
    time-sync for fixing multiple flaky tests in one dunit.

    On Tue, Jun 8, 2021 at 12:09 PM Dan Smith <da...@vmware.com> wrote:

    > Would it be possible to just split that test up into multiple classes? It
    > sounds like the issue is that there is so many flaky tests in that class
    > that you can't fix them all in one PR, which might indicate it's too big.
    >
    > If we can't get StressNewTest to pass - that means our builds are failing
    > more than 2% of the time due to this one test failure. Yikes!
    >
    > -Dan
    > ________________________________
    > From: Kirk Lund <kl...@apache.org>
    > Sent: Tuesday, June 8, 2021 9:33 AM
    > To: dev@geode.apache.org <de...@geode.apache.org>
    > Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
    >
    > Our requirement for stress-new-test-openjdk11 to pass before allowing merge
    > doesn't really work as intended for fixing distributed tests that contain
    > multiple flaky test methods. In fact, I think it causes contributors to
    > avoid tackling flaky tests.
    >
    > I've been working on GEODE-9103: CI Failure:
    > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
    > <
    > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7C27b2978d53d64c9c886708d92b661269%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588539667053065%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=XNyeNQmLHIIyRgpnXRwbHn%2F2lph6S%2BhiO%2BdcOOgeoSc%3D&amp;reserved=0>
    > and was able to fix it.
    >
    > However, stress-new-test-openjdk11 then continued to fail for other flaky
    > tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
    > GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
    > which remains flaky.
    >
    > Unfortunately, I cannot merge any of my fixes for
    > PutAllClientServerDistributedTest unless every single flaky test in it is
    > fixed by my PR. I think this is undesirable because it would be better to
    > merge the fix for 3 flaky test methods than none.
    >
    > UPDATE: After running my precheckin more times, I did get
    > stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
    > loophole than anything because I didn't manage to fix GEODE-9242.
    >
    > Despite having PR #6542 eventually pass, I would like to discuss removing
    > or relaxing our requirement that stress-new-test-openjdk11 must pass in
    > order to merge a PR...
    >
    > PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
    > PutAllClientServerDistributedTest
    > <
    > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Chansonm%40vmware.com%7C27b2978d53d64c9c886708d92b661269%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588539667053065%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=gRAwmAsrHsH4p9jkKTuC63nJPRGBW6TGsYG41xpLF4w%3D&amp;reserved=0
    > >
    >
    > Fixed in PR #6542:
    > * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
    > testPartialKeyInPRSingleHopWithRedundancy
    > <
    > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Chansonm%40vmware.com%7C27b2978d53d64c9c886708d92b661269%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588539667053065%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=Q3ogdEG4omEyoc7Wrg2yMZEC0HWDTKhVqKMCJ8ZqTzc%3D&amp;reserved=0
    > >
    > * GEODE-9103: CI Failure:
    > PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
    > <
    > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7C27b2978d53d64c9c886708d92b661269%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588539667053065%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=XNyeNQmLHIIyRgpnXRwbHn%2F2lph6S%2BhiO%2BdcOOgeoSc%3D&amp;reserved=0
    > >
    > * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
    > fails due to missing disk store after server restart
    > <
    > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Chansonm%40vmware.com%7C27b2978d53d64c9c886708d92b661269%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588539667053065%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=2zkPkNsCFNbXHre2ekltWWrckA74YKD2SqUjp5Tehbg%3D&amp;reserved=0
    > >
    >
    > Still flaky:
    > * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
    > testEventIdOutOfOrderInPartitionRegionSingleHop
    > <
    > https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Chansonm%40vmware.com%7C27b2978d53d64c9c886708d92b661269%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637588539667053065%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=fINGTY1FLXuJkr14LZNasMed6%2B1aJw9nIldCzzSf6rM%3D&amp;reserved=0
    > >
    >
    > Thanks,
    > Kirk
    >


Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Kirk Lund <kl...@apache.org>.
I did think about splitting up dunit tests, but I believe
testEventIdOutOfOrderInPartitionRegionSingleHop will remain flaky even if I
move it to a new dunit test. No matter how you dice it up, we end up with a
PR that cannot be merged to develop unless you get lucky after running
stress-new-test many times.

One could try being tricky by marking it with @ignore or deleting the flaky
test in one PR, and then re-add it in a 2nd PR. But, even then that 2nd PR
is very unlikely to pass stress-new-test unless someone fixes the cause of
that test's flakiness.

As it stands, stress-new-test just ends up being a dead-end or an endless
time-sync for fixing multiple flaky tests in one dunit.

On Tue, Jun 8, 2021 at 12:09 PM Dan Smith <da...@vmware.com> wrote:

> Would it be possible to just split that test up into multiple classes? It
> sounds like the issue is that there is so many flaky tests in that class
> that you can't fix them all in one PR, which might indicate it's too big.
>
> If we can't get StressNewTest to pass - that means our builds are failing
> more than 2% of the time due to this one test failure. Yikes!
>
> -Dan
> ________________________________
> From: Kirk Lund <kl...@apache.org>
> Sent: Tuesday, June 8, 2021 9:33 AM
> To: dev@geode.apache.org <de...@geode.apache.org>
> Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs
>
> Our requirement for stress-new-test-openjdk11 to pass before allowing merge
> doesn't really work as intended for fixing distributed tests that contain
> multiple flaky test methods. In fact, I think it causes contributors to
> avoid tackling flaky tests.
>
> I've been working on GEODE-9103: CI Failure:
> PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391258409%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=9EmtMx3BhTe2zoLdUMOjCEblldw0VigUMnK3O2Ia%2FRY%3D&amp;reserved=0>
> and was able to fix it.
>
> However, stress-new-test-openjdk11 then continued to fail for other flaky
> tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
> GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
> which remains flaky.
>
> Unfortunately, I cannot merge any of my fixes for
> PutAllClientServerDistributedTest unless every single flaky test in it is
> fixed by my PR. I think this is undesirable because it would be better to
> merge the fix for 3 flaky test methods than none.
>
> UPDATE: After running my precheckin more times, I did get
> stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
> loophole than anything because I didn't manage to fix GEODE-9242.
>
> Despite having PR #6542 eventually pass, I would like to discuss removing
> or relaxing our requirement that stress-new-test-openjdk11 must pass in
> order to merge a PR...
>
> PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
> PutAllClientServerDistributedTest
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391258409%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=Z4WRZBNikFsIEDQuDmEpKwsZO2WqLATudaMix%2BDrfMs%3D&amp;reserved=0
> >
>
> Fixed in PR #6542:
> * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
> testPartialKeyInPRSingleHopWithRedundancy
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391263407%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2FiNd3TORn9Al4Y%2BTwbzMmfy3jB7%2F5XxNibhtWtiCOfM%3D&amp;reserved=0
> >
> * GEODE-9103: CI Failure:
> PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391263407%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=9LiNnczBArXP%2FoNjyyhRRgDnpJqWgMuQtYRwzscQ2TQ%3D&amp;reserved=0
> >
> * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
> fails due to missing disk store after server restart
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391263407%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=Cq1PCAjkoR3OoClV3NQTFrzyDpL6FXDs0LrI%2B4vyiac%3D&amp;reserved=0
> >
>
> Still flaky:
> * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
> testEventIdOutOfOrderInPartitionRegionSingleHop
> <
> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391263407%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2B20jJatRlhxWoYyihBD8yI%2FMaEVWJ0E17VgdjHODmvY%3D&amp;reserved=0
> >
>
> Thanks,
> Kirk
>

Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Mark Hanson <ha...@vmware.com>.
I think the basic problem is that we have too much tech debt in the form of dunit tests. We are proposing all of these "workarounds" to avoid dealing with the core problem.

On 6/8/21, 12:09 PM, "Dan Smith" <da...@vmware.com> wrote:

    Would it be possible to just split that test up into multiple classes? It sounds like the issue is that there is so many flaky tests in that class that you can't fix them all in one PR, which might indicate it's too big.

    If we can't get StressNewTest to pass - that means our builds are failing more than 2% of the time due to this one test failure. Yikes!

    -Dan
    ________________________________
    From: Kirk Lund <kl...@apache.org>
    Sent: Tuesday, June 8, 2021 9:33 AM
    To: dev@geode.apache.org <de...@geode.apache.org>
    Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

    Our requirement for stress-new-test-openjdk11 to pass before allowing merge
    doesn't really work as intended for fixing distributed tests that contain
    multiple flaky test methods. In fact, I think it causes contributors to
    avoid tackling flaky tests.

    I've been working on GEODE-9103: CI Failure:
    PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd125737262af4ab5bcdf08d92ab0f019%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637587761714858048%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=8AiHXLVBvN1woMC4Ow5VHBgkkmtsh1BJu7tpTyIu8KE%3D&amp;reserved=0> and was able to fix it.

    However, stress-new-test-openjdk11 then continued to fail for other flaky
    tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
    GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
    which remains flaky.

    Unfortunately, I cannot merge any of my fixes for
    PutAllClientServerDistributedTest unless every single flaky test in it is
    fixed by my PR. I think this is undesirable because it would be better to
    merge the fix for 3 flaky test methods than none.

    UPDATE: After running my precheckin more times, I did get
    stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
    loophole than anything because I didn't manage to fix GEODE-9242.

    Despite having PR #6542 eventually pass, I would like to discuss removing
    or relaxing our requirement that stress-new-test-openjdk11 must pass in
    order to merge a PR...

    PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
    PutAllClientServerDistributedTest
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd125737262af4ab5bcdf08d92ab0f019%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637587761714858048%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=n1lGURJOJcObxOUdULjQpMY44OdXOYmULmlZLzxIeQc%3D&amp;reserved=0>

    Fixed in PR #6542:
    * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
    testPartialKeyInPRSingleHopWithRedundancy
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd125737262af4ab5bcdf08d92ab0f019%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637587761714868006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=%2B%2Baskz2pZv9tlhwyHCN6hzpMvGJQIB25uRSPmpU0W84%3D&amp;reserved=0>
    * GEODE-9103: CI Failure:
    PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd125737262af4ab5bcdf08d92ab0f019%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637587761714868006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=MPM8Dl02u9nRN7rKIDGah8MAWhIPEymtrRcPtAAa9Fk%3D&amp;reserved=0>
    * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
    fails due to missing disk store after server restart
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd125737262af4ab5bcdf08d92ab0f019%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637587761714868006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=G9KG9Cndd7I4D1iVjDuK8%2FVVRgkFtVRpGopQFI2zmr4%3D&amp;reserved=0>

    Still flaky:
    * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
    testEventIdOutOfOrderInPartitionRegionSingleHop
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Chansonm%40vmware.com%7Cd125737262af4ab5bcdf08d92ab0f019%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637587761714868006%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=QYE5f1dG9L5u7Nylw5az%2BMrnMOS%2Bdai0Q2FQeTUFZvw%3D&amp;reserved=0>

    Thanks,
    Kirk


Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Dan Smith <da...@vmware.com>.
Would it be possible to just split that test up into multiple classes? It sounds like the issue is that there is so many flaky tests in that class that you can't fix them all in one PR, which might indicate it's too big.

If we can't get StressNewTest to pass - that means our builds are failing more than 2% of the time due to this one test failure. Yikes!

-Dan
________________________________
From: Kirk Lund <kl...@apache.org>
Sent: Tuesday, June 8, 2021 9:33 AM
To: dev@geode.apache.org <de...@geode.apache.org>
Subject: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Our requirement for stress-new-test-openjdk11 to pass before allowing merge
doesn't really work as intended for fixing distributed tests that contain
multiple flaky test methods. In fact, I think it causes contributors to
avoid tackling flaky tests.

I've been working on GEODE-9103: CI Failure:
PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391258409%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=9EmtMx3BhTe2zoLdUMOjCEblldw0VigUMnK3O2Ia%2FRY%3D&amp;reserved=0> and was able to fix it.

However, stress-new-test-openjdk11 then continued to fail for other flaky
tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
which remains flaky.

Unfortunately, I cannot merge any of my fixes for
PutAllClientServerDistributedTest unless every single flaky test in it is
fixed by my PR. I think this is undesirable because it would be better to
merge the fix for 3 flaky test methods than none.

UPDATE: After running my precheckin more times, I did get
stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
loophole than anything because I didn't manage to fix GEODE-9242.

Despite having PR #6542 eventually pass, I would like to discuss removing
or relaxing our requirement that stress-new-test-openjdk11 must pass in
order to merge a PR...

PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
PutAllClientServerDistributedTest
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391258409%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=Z4WRZBNikFsIEDQuDmEpKwsZO2WqLATudaMix%2BDrfMs%3D&amp;reserved=0>

Fixed in PR #6542:
* GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
testPartialKeyInPRSingleHopWithRedundancy
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391263407%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2FiNd3TORn9Al4Y%2BTwbzMmfy3jB7%2F5XxNibhtWtiCOfM%3D&amp;reserved=0>
* GEODE-9103: CI Failure:
PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391263407%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=9LiNnczBArXP%2FoNjyyhRRgDnpJqWgMuQtYRwzscQ2TQ%3D&amp;reserved=0>
* GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
fails due to missing disk store after server restart
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391263407%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=Cq1PCAjkoR3OoClV3NQTFrzyDpL6FXDs0LrI%2B4vyiac%3D&amp;reserved=0>

Still flaky:
* GEODE-9242: CI failure in PutAllClientServerDistributedTest >
testEventIdOutOfOrderInPartitionRegionSingleHop
<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Cdasmith%40vmware.com%7Cb7b22c48e3584f306fa408d92a9b36bd%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668391263407%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=%2B20jJatRlhxWoYyihBD8yI%2FMaEVWJ0E17VgdjHODmvY%3D&amp;reserved=0>

Thanks,
Kirk

Re: [DISCUSS] Remove stress-new-test-openjdk11 requirement from PRs

Posted by Mark Hanson <ha...@vmware.com>.
I understand the challenge, but I disagree. It is only through requirement that we keep new flakey tests out. While I don't think one should have to fix all the flaky tests to get their unrelated change in, I think it serves a purpose.

IMHO, the problems that you are seeing are indications that we are not keeping on top of fixing flakey tests, as recently indicated by our 15% pass rate on mass test runs. ( I have not seen  the results recently, I assume they are better, but the fact that it got to there is a sign). It seems we are not currently paying the full cost of changes, as indicated by decreasing pass rates.

Thanks,
Mark

On 6/8/21, 9:33 AM, "Kirk Lund" <kl...@apache.org> wrote:

    Our requirement for stress-new-test-openjdk11 to pass before allowing merge
    doesn't really work as intended for fixing distributed tests that contain
    multiple flaky test methods. In fact, I think it causes contributors to
    avoid tackling flaky tests.

    I've been working on GEODE-9103: CI Failure:
    PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7C129111e6619d41283fbf08d92a9b2c67%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668225634197%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=iCEpdxkzLjNyEMRfHY5n05Nro3pg%2FgBS2Y3KN%2Frk3mQ%3D&amp;reserved=0> and was able to fix it.

    However, stress-new-test-openjdk11 then continued to fail for other flaky
    tests in PutAllClientServerDistributedTest. I managed to fix GEODE-9296 and
    GEODE-8528 as well. I also tried but have not been able to fix GEODE-9242
    which remains flaky.

    Unfortunately, I cannot merge any of my fixes for
    PutAllClientServerDistributedTest unless every single flaky test in it is
    fixed by my PR. I think this is undesirable because it would be better to
    merge the fix for 3 flaky test methods than none.

    UPDATE: After running my precheckin more times, I did get
    stress-new-test-openjdk11 to pass once so I can merge, but that's more of a
    loophole than anything because I didn't manage to fix GEODE-9242.

    Despite having PR #6542 eventually pass, I would like to discuss removing
    or relaxing our requirement that stress-new-test-openjdk11 must pass in
    order to merge a PR...

    PR #6542: GEODE-9103: Fix ServerConnectivityExceptions in
    PutAllClientServerDistributedTest
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fgeode%2Fpull%2F6542&amp;data=04%7C01%7Chansonm%40vmware.com%7C129111e6619d41283fbf08d92a9b2c67%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668225634197%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=WzdY%2F9xs5%2FayYzqbwxc89cfWC02GMFELQf971drfr2s%3D&amp;reserved=0>

    Fixed in PR #6542:
    * GEODE-9296: CI Failure: PutAllClientServerDistributedTest >
    testPartialKeyInPRSingleHopWithRedundancy
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9296&amp;data=04%7C01%7Chansonm%40vmware.com%7C129111e6619d41283fbf08d92a9b2c67%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668225634197%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=HJ3x3lMUCT%2F2mQsHu%2FymfX%2FKDHb9CBQRxuA9Pp0q%2BJg%3D&amp;reserved=0>
    * GEODE-9103: CI Failure:
    PutAllClientServerDistributedTest.testPutAllReturnsExceptions FAILED
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9103&amp;data=04%7C01%7Chansonm%40vmware.com%7C129111e6619d41283fbf08d92a9b2c67%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668225634197%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=iCEpdxkzLjNyEMRfHY5n05Nro3pg%2FgBS2Y3KN%2Frk3mQ%3D&amp;reserved=0>
    * GEODE-8528: PutAllClientServerDistributedTest.testPartialKeyInPRSingleHop
    fails due to missing disk store after server restart
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-8528&amp;data=04%7C01%7Chansonm%40vmware.com%7C129111e6619d41283fbf08d92a9b2c67%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668225634197%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=7GreRNbawjMuVz0Ue%2ByE4ftlz1EO9XmTYEOWx0srix0%3D&amp;reserved=0>

    Still flaky:
    * GEODE-9242: CI failure in PutAllClientServerDistributedTest >
    testEventIdOutOfOrderInPartitionRegionSingleHop
    <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FGEODE-9242&amp;data=04%7C01%7Chansonm%40vmware.com%7C129111e6619d41283fbf08d92a9b2c67%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C637587668225634197%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000&amp;sdata=YnQiHN6FzLdkPdS3JSARx1JHC2%2Bu6eQ4Aq44dduyTL8%3D&amp;reserved=0>

    Thanks,
    Kirk