You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@beam.apache.org by Sam Rohde <sr...@google.com> on 2019/01/07 23:19:20 UTC

Add all tests to release validation

Hi All,

There are a number of tests in our system that are either flaky or
permanently red. I am suggesting to add, if not all, then most of the tests
(style, unit, integration, etc) to the release validation step. In this
way, we will add a regular cadence to ensuring greenness and no flaky tests
in Beam.

There are a number of ways of implementing this, but what I think might
work the best is to set up a process that either manually or automatically
creates a JIRA for the failing test and assigns it to a component tagged
with the release number. The release can then continue when all JIRAs are
closed by either fixing the failure or manually testing to ensure no
adverse side effects (this is in case there are environmental issues in the
testing infrastructure or otherwise).

Thanks for reading, what do you think?
- Is there another, easier way to ensure that no test failures go unfixed?
- Can the process be automated?
- What am I missing?

Regards,
Sam

Re: Add all tests to release validation

Posted by Kenneth Knowles <kl...@google.com>.

Good points. I wasn't tuned in to those nuances of how the jobs are run. I
think we *could* cause a postcommit job to run against exactly that commit
hash instead of origin/master, but I won't advocate for that. My suggestion
of the "find a green commit" approach is a holdover from continuously
shipping services. It isn't terrifically important when you have a release
branch process.

The bit I feel strongly about is that we should not wait on commits to
master except in catastrophic circumstances. I'm happy with
cut-then-verify/triage.

Kenn

On Wed, Jan 16, 2019 at 9:11 AM Scott Wegner <sc...@apache.org> wrote:

> I like the idea of using test greenness to choose a release commit.
> There's a couple challenges with our current setup:
>
> 1) Post-commits don't run at every commit. The Jenkins jobs are configured
> to run on pushes to master, but (at least some Jobs) are serialized to run
> a single Jenkins job instance at a time, and the next run will be at the
> current HEAD, skipping pushes it didn't get to. So it may be
> hard/impossible to find a commit which had all Post-Commit jobs run against
> it.
>
> 2) I don't see an easy way in Jenkins or GitHub to find overall test
> status for a commit across all test jobs which ran. The GitHub history [1]
> seems to only show badges from PR test runs. Perhaps we're missing
> something in our Jenkins job config to publish the status back to GitHub.
> Or, we could import the necessary data into our Community Metrics DB [2]
> and build our own dashboard [3].
>
> So assuming I'm not missing something, +1 to Sam's proposal to
> cut-then-validate since that seems much easier to get started with today.
>
> [1] https://github.com/apache/beam/commits/master
> [2]
> https://github.com/apache/beam/blob/6c2fe17cfdea1be1fdcfb02267894f0d37a671b3/.test-infra/metrics/sync/jenkins/syncjenkins.py#L38
> [3] https://s.apache.org/beam-community-metrics
>
> On Tue, Jan 15, 2019 at 2:47 PM Kenneth Knowles <ke...@apache.org> wrote:
>
>> Since you brought up the entirety of the process, I would suggest to move
>> the release branch cut up like so:
>>
>>  - Decide to release
>>  - Create a new version in JIRA
>>  - Find a recent green commit (according to postcommit)
>>  - Create a release branch from that commit
>>  - Bump the version on master (green PR w/ parent at the green commit)
>>  - Triage release-blocking JIRAs
>>  - ...
>>
>> Notes:
>>
>>  - Choosing postcommit signal to cut means we already have the signal and
>> we aren't tempted to wait on master
>>  - Cutting before triage starts stabilization process ASAP and gives
>> clear signal on the burndown
>>
>> Kenn
>>
>>
>> On Tue, Jan 15, 2019 at 1:25 PM Sam Rohde <sr...@google.com> wrote:
>>
>>> +Boyuan Zhang <bo...@google.com> who is modifying the rc validation
>>> script
>>>
>>> I'm thinking of a small change to the proposed process brought to my
>>> attention from Boyuan.
>>>
>>> Instead of running the additional validation tests during the rc
>>> validation, run the tests and the proposed process after the release branch
>>> has been cut. A couple of reasons why:
>>>
>>>    - The additional validation tests (PostCommit and PreCommit) don't
>>>    run against the RC and are instead run against the branch. This is
>>>    confusing considering the other tests in the RC validation step are per RC.
>>>    - The additional validation tests are expensive.
>>>
>>> The final release process would look like:
>>>
>>>    - Decide to release
>>>    - Create a new version in JIRA
>>>    - Triage release-blocking issue in JIRAs
>>>    - Review release notes in JIRA
>>>    - Create a release branch
>>>    - Verify that a release builds
>>>    - >>> Verify that a release passes its tests <<< (this is where the
>>>    new process would be added)
>>>    - Build/test/fix RCs
>>>    - >>> Fix any issues <<< (all JIRAs created during the new process
>>>    will have to be closed by here)
>>>    - Finalize the release
>>>    - Promote the release
>>>
>>>
>>>
>>>
>>> On Thu, Jan 10, 2019 at 4:32 PM Kenneth Knowles <ke...@apache.org> wrote:
>>>
>>>> What do you think about crowd-sourcing?
>>>>
>>>> 1. Fix Version = 2.10.0
>>>> 2. If assigned, ping ticket and maybe assignee, unassign if unresponsive
>>>> 3. If unassigned, assign it to yourself while thinking about it
>>>> 4. If you can route it a bit closer to someone who might know, great
>>>> 5. If it doesn't look like a blocker (after routing best you can), Fix
>>>> Version = 2.11.0
>>>>
>>>> I think this has enough mutexes that there should be no duplicated work
>>>> if it is followed. And every step is a standard use of Fix Version and
>>>> Assignee so there's not really special policy needed.
>>>>
>>>> Kenn
>>>>
>>>> On Thu, Jan 10, 2019 at 4:25 PM Mikhail Gryzykhin <mi...@google.com>
>>>> wrote:
>>>>
>>>>> +1
>>>>>
>>>>> Although we should be cautious when enabling this policy. We have
>>>>> decent backlog of bugs that we need to plumb through.
>>>>>
>>>>> --Mikhail
>>>>>
>>>>> Have feedback <http://go/migryz-feedback>?
>>>>>
>>>>>
>>>>> On Thu, Jan 10, 2019 at 11:44 AM Scott Wegner <sc...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> +1, this sounds good to me.
>>>>>>
>>>>>> I believe the next step would be to open a PR to add this to the
>>>>>> release guide:
>>>>>> https://github.com/apache/beam/blob/master/website/src/contribute/release-guide.md
>>>>>>
>>>>>> On Wed, Jan 9, 2019 at 12:04 PM Sam Rohde <sr...@google.com> wrote:
>>>>>>
>>>>>>> Cool, thanks for all of the replies. Does this summary sound
>>>>>>> reasonable?
>>>>>>>
>>>>>>> *Problem:* there are a number of failing tests (including flaky)
>>>>>>> that don't get looked at, and aren't necessarily green upon cutting a new
>>>>>>> Beam release.
>>>>>>>
>>>>>>> *Proposed Solution:*
>>>>>>>
>>>>>>>    - Add all tests to the release validation
>>>>>>>    - For all failing tests (including flaky) create a JIRA attached
>>>>>>>    to the Beam release and add to the "test-failures" component*
>>>>>>>    - If a test is continuously failing
>>>>>>>          - fix it
>>>>>>>          - add fix to release
>>>>>>>          - close out JIRA
>>>>>>>       - If a test is flaky
>>>>>>>          - try and fix it
>>>>>>>          - If fixed
>>>>>>>             - add fix to release
>>>>>>>             - close out JIRA
>>>>>>>          - else
>>>>>>>             - manually test it
>>>>>>>             - modify "Fix Version" to next release
>>>>>>>          - The release validation can continue when all JIRAs are
>>>>>>>    closed out.
>>>>>>>
>>>>>>> *Why this is an improvement:*
>>>>>>>
>>>>>>>    - Ensures that every test is a valid signal (as opposed to
>>>>>>>    disabling failing tests)
>>>>>>>    - Creates an incentive to automate tests (no longer on the hook
>>>>>>>    to manually test)
>>>>>>>    - Creates a forcing-function to fix flaky tests (once fixed, no
>>>>>>>    longer needs to be manually tested)
>>>>>>>    - Ensures that every failing test gets looked at
>>>>>>>
>>>>>>> *Why this may not be an improvement:*
>>>>>>>
>>>>>>>    - More effort for release validation
>>>>>>>    - May slow down release velocity
>>>>>>>
>>>>>>> * for brevity, this might be better to create a JIRA per component
>>>>>>> containing a summary of failing tests
>>>>>>>
>>>>>>>
>>>>>>> -Sam
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 8, 2019 at 10:25 AM Ahmet Altay <al...@google.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 8, 2019 at 8:25 AM Kenneth Knowles <ke...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Tue, Jan 8, 2019 at 7:52 AM Scott Wegner <sc...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> For reference, there are currently 34 unresolved JIRA issues
>>>>>>>>>> under the test-failures component [1].
>>>>>>>>>>
>>>>>>>>>> [1]
>>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> And there are 19 labeled with flake or sickbay:
>>>>>>>>> https://issues.apache.org/jira/issues/?filter=12343195
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>> On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <al...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> This is a a good idea. Some suggestions:
>>>>>>>>>>> - It would be nicer if we can figure out process to act on flaky
>>>>>>>>>>> test more frequently than releases.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Any ideas? We could just have some cadence and try to establish
>>>>>>>>> the practice of having a deflake thread every couple of weeks? How about we
>>>>>>>>> add it to release verification as a first step and then continue to discuss?
>>>>>>>>>
>>>>>>>>
>>>>>>>> Sounds great. I do not know enough JIRA, but I am hoping that a
>>>>>>>> solution can come in the form of tooling. If we could configure JIRA with
>>>>>>>> SLOs per issue type, we could have customized reports on which issues are
>>>>>>>> not getting enough attention and then do a load balance among us.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> - Another improvement in the process would be having actual owners
>>>>>>>>>>> of issues rather than auto assigned component owners. A few folks have 100+
>>>>>>>>>>> assigned issues. Unassigning those issues, and finding owners who would
>>>>>>>>>>> have time to work on identified flaky tests would be helpful.
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> Yikes. Two issues here:
>>>>>>>>>
>>>>>>>>>  - sounds like Jira component owners aren't really working for us
>>>>>>>>> as a first point of contact for triage
>>>>>>>>>  - a person shouldn't really have more than 5 Jira assigned, or if
>>>>>>>>> you get really loose maybe 20 (I am guilty of having 30 at this moment...)
>>>>>>>>>
>>>>>>>>> Maybe this is one or two separate threads?
>>>>>>>>>
>>>>>>>>
>>>>>>>> I can fork this to another thread. I think both issues are related
>>>>>>>> because components owners are more likely to be in this situaion. I agree
>>>>>>>> with assessment of two issues.
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <ke...@apache.org>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I love this idea. It can easily feel like bugs filed for
>>>>>>>>>>>> Jenkins flakes/failures just get lost if there is no process for looking
>>>>>>>>>>>> them over regularly.
>>>>>>>>>>>>
>>>>>>>>>>>> I would suggest that test failures / flakes all get filed with
>>>>>>>>>>>> Fix Version = whatever release is next. Then at release time we can triage
>>>>>>>>>>>> the list, making sure none might be a symptom of something that should
>>>>>>>>>>>> block the release. One modification to your proposal is that after manual
>>>>>>>>>>>> verification that it is safe to release I would move Fix Version to the
>>>>>>>>>>>> next release instead of closing, unless the issue really is fixed or
>>>>>>>>>>>> otherwise not reproducible.
>>>>>>>>>>>>
>>>>>>>>>>>> For automation, I wonder if there's something automatic already
>>>>>>>>>>>> available somewhere that would:
>>>>>>>>>>>>
>>>>>>>>>>>>  - mark the Jenkins build to "Keep This Build Forever"
>>>>>>>>>>>>  - be *very* careful to try to find an existing bug, else it
>>>>>>>>>>>> will be spam
>>>>>>>>>>>>  - file bugs to "test-failures" component
>>>>>>>>>>>>  - set Fix Version to the "next" - right now we have 2.7.1
>>>>>>>>>>>> (LTS), 2.11.0 (next mainline), 3.0.0 (dreamy incompatible ideas) so need
>>>>>>>>>>>> the smarts to choose 2.11.0
>>>>>>>>>>>>
>>>>>>>>>>>> If not, I think doing this stuff manually is not that bad,
>>>>>>>>>>>> assuming we can stay fairly green.
>>>>>>>>>>>>
>>>>>>>>>>>> Kenn
>>>>>>>>>>>>
>>>>>>>>>>>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <sr...@google.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are a number of tests in our system that are either
>>>>>>>>>>>>> flaky or permanently red. I am suggesting to add, if not all, then most of
>>>>>>>>>>>>> the tests (style, unit, integration, etc) to the release validation step.
>>>>>>>>>>>>> In this way, we will add a regular cadence to ensuring greenness and no
>>>>>>>>>>>>> flaky tests in Beam.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There are a number of ways of implementing this, but what I
>>>>>>>>>>>>> think might work the best is to set up a process that either manually or
>>>>>>>>>>>>> automatically creates a JIRA for the failing test and assigns it to a
>>>>>>>>>>>>> component tagged with the release number. The release can then continue
>>>>>>>>>>>>> when all JIRAs are closed by either fixing the failure or manually testing
>>>>>>>>>>>>> to ensure no adverse side effects (this is in case there are environmental
>>>>>>>>>>>>> issues in the testing infrastructure or otherwise).
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks for reading, what do you think?
>>>>>>>>>>>>> - Is there another, easier way to ensure that no test failures
>>>>>>>>>>>>> go unfixed?
>>>>>>>>>>>>> - Can the process be automated?
>>>>>>>>>>>>> - What am I missing?
>>>>>>>>>>>>>
>>>>>>>>>>>>> Regards,
>>>>>>>>>>>>> Sam
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Got feedback? tinyurl.com/swegner-feedback
>>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Got feedback? tinyurl.com/swegner-feedback
>>>>>>
>>>>>
>
> --
>
>
>
>
> Got feedback? tinyurl.com/swegner-feedback
>

Re: Add all tests to release validation

Posted by Scott Wegner <sc...@apache.org>.

I like the idea of using test greenness to choose a release commit. There's
a couple challenges with our current setup:

1) Post-commits don't run at every commit. The Jenkins jobs are configured
to run on pushes to master, but (at least some Jobs) are serialized to run
a single Jenkins job instance at a time, and the next run will be at the
current HEAD, skipping pushes it didn't get to. So it may be
hard/impossible to find a commit which had all Post-Commit jobs run against
it.

2) I don't see an easy way in Jenkins or GitHub to find overall test status
for a commit across all test jobs which ran. The GitHub history [1] seems
to only show badges from PR test runs. Perhaps we're missing something in
our Jenkins job config to publish the status back to GitHub. Or, we could
import the necessary data into our Community Metrics DB [2] and build our
own dashboard [3].

So assuming I'm not missing something, +1 to Sam's proposal to
cut-then-validate since that seems much easier to get started with today.

[1] https://github.com/apache/beam/commits/master
[2]
https://github.com/apache/beam/blob/6c2fe17cfdea1be1fdcfb02267894f0d37a671b3/.test-infra/metrics/sync/jenkins/syncjenkins.py#L38
[3] https://s.apache.org/beam-community-metrics

On Tue, Jan 15, 2019 at 2:47 PM Kenneth Knowles <ke...@apache.org> wrote:

> Since you brought up the entirety of the process, I would suggest to move
> the release branch cut up like so:
>
>  - Decide to release
>  - Create a new version in JIRA
>  - Find a recent green commit (according to postcommit)
>  - Create a release branch from that commit
>  - Bump the version on master (green PR w/ parent at the green commit)
>  - Triage release-blocking JIRAs
>  - ...
>
> Notes:
>
>  - Choosing postcommit signal to cut means we already have the signal and
> we aren't tempted to wait on master
>  - Cutting before triage starts stabilization process ASAP and gives clear
> signal on the burndown
>
> Kenn
>
>
> On Tue, Jan 15, 2019 at 1:25 PM Sam Rohde <sr...@google.com> wrote:
>
>> +Boyuan Zhang <bo...@google.com> who is modifying the rc validation
>> script
>>
>> I'm thinking of a small change to the proposed process brought to my
>> attention from Boyuan.
>>
>> Instead of running the additional validation tests during the rc
>> validation, run the tests and the proposed process after the release branch
>> has been cut. A couple of reasons why:
>>
>>    - The additional validation tests (PostCommit and PreCommit) don't
>>    run against the RC and are instead run against the branch. This is
>>    confusing considering the other tests in the RC validation step are per RC.
>>    - The additional validation tests are expensive.
>>
>> The final release process would look like:
>>
>>    - Decide to release
>>    - Create a new version in JIRA
>>    - Triage release-blocking issue in JIRAs
>>    - Review release notes in JIRA
>>    - Create a release branch
>>    - Verify that a release builds
>>    - >>> Verify that a release passes its tests <<< (this is where the
>>    new process would be added)
>>    - Build/test/fix RCs
>>    - >>> Fix any issues <<< (all JIRAs created during the new process
>>    will have to be closed by here)
>>    - Finalize the release
>>    - Promote the release
>>
>>
>>
>>
>> On Thu, Jan 10, 2019 at 4:32 PM Kenneth Knowles <ke...@apache.org> wrote:
>>
>>> What do you think about crowd-sourcing?
>>>
>>> 1. Fix Version = 2.10.0
>>> 2. If assigned, ping ticket and maybe assignee, unassign if unresponsive
>>> 3. If unassigned, assign it to yourself while thinking about it
>>> 4. If you can route it a bit closer to someone who might know, great
>>> 5. If it doesn't look like a blocker (after routing best you can), Fix
>>> Version = 2.11.0
>>>
>>> I think this has enough mutexes that there should be no duplicated work
>>> if it is followed. And every step is a standard use of Fix Version and
>>> Assignee so there's not really special policy needed.
>>>
>>> Kenn
>>>
>>> On Thu, Jan 10, 2019 at 4:25 PM Mikhail Gryzykhin <mi...@google.com>
>>> wrote:
>>>
>>>> +1
>>>>
>>>> Although we should be cautious when enabling this policy. We have
>>>> decent backlog of bugs that we need to plumb through.
>>>>
>>>> --Mikhail
>>>>
>>>> Have feedback <http://go/migryz-feedback>?
>>>>
>>>>
>>>> On Thu, Jan 10, 2019 at 11:44 AM Scott Wegner <sc...@apache.org> wrote:
>>>>
>>>>> +1, this sounds good to me.
>>>>>
>>>>> I believe the next step would be to open a PR to add this to the
>>>>> release guide:
>>>>> https://github.com/apache/beam/blob/master/website/src/contribute/release-guide.md
>>>>>
>>>>> On Wed, Jan 9, 2019 at 12:04 PM Sam Rohde <sr...@google.com> wrote:
>>>>>
>>>>>> Cool, thanks for all of the replies. Does this summary sound
>>>>>> reasonable?
>>>>>>
>>>>>> *Problem:* there are a number of failing tests (including flaky)
>>>>>> that don't get looked at, and aren't necessarily green upon cutting a new
>>>>>> Beam release.
>>>>>>
>>>>>> *Proposed Solution:*
>>>>>>
>>>>>>    - Add all tests to the release validation
>>>>>>    - For all failing tests (including flaky) create a JIRA attached
>>>>>>    to the Beam release and add to the "test-failures" component*
>>>>>>    - If a test is continuously failing
>>>>>>          - fix it
>>>>>>          - add fix to release
>>>>>>          - close out JIRA
>>>>>>       - If a test is flaky
>>>>>>          - try and fix it
>>>>>>          - If fixed
>>>>>>             - add fix to release
>>>>>>             - close out JIRA
>>>>>>          - else
>>>>>>             - manually test it
>>>>>>             - modify "Fix Version" to next release
>>>>>>          - The release validation can continue when all JIRAs are
>>>>>>    closed out.
>>>>>>
>>>>>> *Why this is an improvement:*
>>>>>>
>>>>>>    - Ensures that every test is a valid signal (as opposed to
>>>>>>    disabling failing tests)
>>>>>>    - Creates an incentive to automate tests (no longer on the hook
>>>>>>    to manually test)
>>>>>>    - Creates a forcing-function to fix flaky tests (once fixed, no
>>>>>>    longer needs to be manually tested)
>>>>>>    - Ensures that every failing test gets looked at
>>>>>>
>>>>>> *Why this may not be an improvement:*
>>>>>>
>>>>>>    - More effort for release validation
>>>>>>    - May slow down release velocity
>>>>>>
>>>>>> * for brevity, this might be better to create a JIRA per component
>>>>>> containing a summary of failing tests
>>>>>>
>>>>>>
>>>>>> -Sam
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 8, 2019 at 10:25 AM Ahmet Altay <al...@google.com> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 8, 2019 at 8:25 AM Kenneth Knowles <ke...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Jan 8, 2019 at 7:52 AM Scott Wegner <sc...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> For reference, there are currently 34 unresolved JIRA issues under
>>>>>>>>> the test-failures component [1].
>>>>>>>>>
>>>>>>>>> [1]
>>>>>>>>> https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>>>>>>>
>>>>>>>>
>>>>>>>> And there are 19 labeled with flake or sickbay:
>>>>>>>> https://issues.apache.org/jira/issues/?filter=12343195
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <al...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> This is a a good idea. Some suggestions:
>>>>>>>>>> - It would be nicer if we can figure out process to act on flaky
>>>>>>>>>> test more frequently than releases.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> Any ideas? We could just have some cadence and try to establish the
>>>>>>>> practice of having a deflake thread every couple of weeks? How about we add
>>>>>>>> it to release verification as a first step and then continue to discuss?
>>>>>>>>
>>>>>>>
>>>>>>> Sounds great. I do not know enough JIRA, but I am hoping that a
>>>>>>> solution can come in the form of tooling. If we could configure JIRA with
>>>>>>> SLOs per issue type, we could have customized reports on which issues are
>>>>>>> not getting enough attention and then do a load balance among us.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> - Another improvement in the process would be having actual owners
>>>>>>>>>> of issues rather than auto assigned component owners. A few folks have 100+
>>>>>>>>>> assigned issues. Unassigning those issues, and finding owners who would
>>>>>>>>>> have time to work on identified flaky tests would be helpful.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>> Yikes. Two issues here:
>>>>>>>>
>>>>>>>>  - sounds like Jira component owners aren't really working for us
>>>>>>>> as a first point of contact for triage
>>>>>>>>  - a person shouldn't really have more than 5 Jira assigned, or if
>>>>>>>> you get really loose maybe 20 (I am guilty of having 30 at this moment...)
>>>>>>>>
>>>>>>>> Maybe this is one or two separate threads?
>>>>>>>>
>>>>>>>
>>>>>>> I can fork this to another thread. I think both issues are related
>>>>>>> because components owners are more likely to be in this situaion. I agree
>>>>>>> with assessment of two issues.
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <ke...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> I love this idea. It can easily feel like bugs filed for Jenkins
>>>>>>>>>>> flakes/failures just get lost if there is no process for looking them over
>>>>>>>>>>> regularly.
>>>>>>>>>>>
>>>>>>>>>>> I would suggest that test failures / flakes all get filed with
>>>>>>>>>>> Fix Version = whatever release is next. Then at release time we can triage
>>>>>>>>>>> the list, making sure none might be a symptom of something that should
>>>>>>>>>>> block the release. One modification to your proposal is that after manual
>>>>>>>>>>> verification that it is safe to release I would move Fix Version to the
>>>>>>>>>>> next release instead of closing, unless the issue really is fixed or
>>>>>>>>>>> otherwise not reproducible.
>>>>>>>>>>>
>>>>>>>>>>> For automation, I wonder if there's something automatic already
>>>>>>>>>>> available somewhere that would:
>>>>>>>>>>>
>>>>>>>>>>>  - mark the Jenkins build to "Keep This Build Forever"
>>>>>>>>>>>  - be *very* careful to try to find an existing bug, else it
>>>>>>>>>>> will be spam
>>>>>>>>>>>  - file bugs to "test-failures" component
>>>>>>>>>>>  - set Fix Version to the "next" - right now we have 2.7.1
>>>>>>>>>>> (LTS), 2.11.0 (next mainline), 3.0.0 (dreamy incompatible ideas) so need
>>>>>>>>>>> the smarts to choose 2.11.0
>>>>>>>>>>>
>>>>>>>>>>> If not, I think doing this stuff manually is not that bad,
>>>>>>>>>>> assuming we can stay fairly green.
>>>>>>>>>>>
>>>>>>>>>>> Kenn
>>>>>>>>>>>
>>>>>>>>>>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <sr...@google.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi All,
>>>>>>>>>>>>
>>>>>>>>>>>> There are a number of tests in our system that are either flaky
>>>>>>>>>>>> or permanently red. I am suggesting to add, if not all, then most of the
>>>>>>>>>>>> tests (style, unit, integration, etc) to the release validation step. In
>>>>>>>>>>>> this way, we will add a regular cadence to ensuring greenness and no flaky
>>>>>>>>>>>> tests in Beam.
>>>>>>>>>>>>
>>>>>>>>>>>> There are a number of ways of implementing this, but what I
>>>>>>>>>>>> think might work the best is to set up a process that either manually or
>>>>>>>>>>>> automatically creates a JIRA for the failing test and assigns it to a
>>>>>>>>>>>> component tagged with the release number. The release can then continue
>>>>>>>>>>>> when all JIRAs are closed by either fixing the failure or manually testing
>>>>>>>>>>>> to ensure no adverse side effects (this is in case there are environmental
>>>>>>>>>>>> issues in the testing infrastructure or otherwise).
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for reading, what do you think?
>>>>>>>>>>>> - Is there another, easier way to ensure that no test failures
>>>>>>>>>>>> go unfixed?
>>>>>>>>>>>> - Can the process be automated?
>>>>>>>>>>>> - What am I missing?
>>>>>>>>>>>>
>>>>>>>>>>>> Regards,
>>>>>>>>>>>> Sam
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Got feedback? tinyurl.com/swegner-feedback
>>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Got feedback? tinyurl.com/swegner-feedback
>>>>>
>>>>

-- 




Got feedback? tinyurl.com/swegner-feedback

Re: Add all tests to release validation

Posted by Kenneth Knowles <ke...@apache.org>.

Since you brought up the entirety of the process, I would suggest to move
the release branch cut up like so:

 - Decide to release
 - Create a new version in JIRA
 - Find a recent green commit (according to postcommit)
 - Create a release branch from that commit
 - Bump the version on master (green PR w/ parent at the green commit)
 - Triage release-blocking JIRAs
 - ...

Notes:

 - Choosing postcommit signal to cut means we already have the signal and
we aren't tempted to wait on master
 - Cutting before triage starts stabilization process ASAP and gives clear
signal on the burndown

Kenn


On Tue, Jan 15, 2019 at 1:25 PM Sam Rohde <sr...@google.com> wrote:

> +Boyuan Zhang <bo...@google.com> who is modifying the rc validation
> script
>
> I'm thinking of a small change to the proposed process brought to my
> attention from Boyuan.
>
> Instead of running the additional validation tests during the rc
> validation, run the tests and the proposed process after the release branch
> has been cut. A couple of reasons why:
>
>    - The additional validation tests (PostCommit and PreCommit) don't run
>    against the RC and are instead run against the branch. This is confusing
>    considering the other tests in the RC validation step are per RC.
>    - The additional validation tests are expensive.
>
> The final release process would look like:
>
>    - Decide to release
>    - Create a new version in JIRA
>    - Triage release-blocking issue in JIRAs
>    - Review release notes in JIRA
>    - Create a release branch
>    - Verify that a release builds
>    - >>> Verify that a release passes its tests <<< (this is where the
>    new process would be added)
>    - Build/test/fix RCs
>    - >>> Fix any issues <<< (all JIRAs created during the new process
>    will have to be closed by here)
>    - Finalize the release
>    - Promote the release
>
>
>
>
> On Thu, Jan 10, 2019 at 4:32 PM Kenneth Knowles <ke...@apache.org> wrote:
>
>> What do you think about crowd-sourcing?
>>
>> 1. Fix Version = 2.10.0
>> 2. If assigned, ping ticket and maybe assignee, unassign if unresponsive
>> 3. If unassigned, assign it to yourself while thinking about it
>> 4. If you can route it a bit closer to someone who might know, great
>> 5. If it doesn't look like a blocker (after routing best you can), Fix
>> Version = 2.11.0
>>
>> I think this has enough mutexes that there should be no duplicated work
>> if it is followed. And every step is a standard use of Fix Version and
>> Assignee so there's not really special policy needed.
>>
>> Kenn
>>
>> On Thu, Jan 10, 2019 at 4:25 PM Mikhail Gryzykhin <mi...@google.com>
>> wrote:
>>
>>> +1
>>>
>>> Although we should be cautious when enabling this policy. We have decent
>>> backlog of bugs that we need to plumb through.
>>>
>>> --Mikhail
>>>
>>> Have feedback <http://go/migryz-feedback>?
>>>
>>>
>>> On Thu, Jan 10, 2019 at 11:44 AM Scott Wegner <sc...@apache.org> wrote:
>>>
>>>> +1, this sounds good to me.
>>>>
>>>> I believe the next step would be to open a PR to add this to the
>>>> release guide:
>>>> https://github.com/apache/beam/blob/master/website/src/contribute/release-guide.md
>>>>
>>>> On Wed, Jan 9, 2019 at 12:04 PM Sam Rohde <sr...@google.com> wrote:
>>>>
>>>>> Cool, thanks for all of the replies. Does this summary sound
>>>>> reasonable?
>>>>>
>>>>> *Problem:* there are a number of failing tests (including flaky) that
>>>>> don't get looked at, and aren't necessarily green upon cutting a new Beam
>>>>> release.
>>>>>
>>>>> *Proposed Solution:*
>>>>>
>>>>>    - Add all tests to the release validation
>>>>>    - For all failing tests (including flaky) create a JIRA attached
>>>>>    to the Beam release and add to the "test-failures" component*
>>>>>    - If a test is continuously failing
>>>>>          - fix it
>>>>>          - add fix to release
>>>>>          - close out JIRA
>>>>>       - If a test is flaky
>>>>>          - try and fix it
>>>>>          - If fixed
>>>>>             - add fix to release
>>>>>             - close out JIRA
>>>>>          - else
>>>>>             - manually test it
>>>>>             - modify "Fix Version" to next release
>>>>>          - The release validation can continue when all JIRAs are
>>>>>    closed out.
>>>>>
>>>>> *Why this is an improvement:*
>>>>>
>>>>>    - Ensures that every test is a valid signal (as opposed to
>>>>>    disabling failing tests)
>>>>>    - Creates an incentive to automate tests (no longer on the hook to
>>>>>    manually test)
>>>>>    - Creates a forcing-function to fix flaky tests (once fixed, no
>>>>>    longer needs to be manually tested)
>>>>>    - Ensures that every failing test gets looked at
>>>>>
>>>>> *Why this may not be an improvement:*
>>>>>
>>>>>    - More effort for release validation
>>>>>    - May slow down release velocity
>>>>>
>>>>> * for brevity, this might be better to create a JIRA per component
>>>>> containing a summary of failing tests
>>>>>
>>>>>
>>>>> -Sam
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Jan 8, 2019 at 10:25 AM Ahmet Altay <al...@google.com> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 8, 2019 at 8:25 AM Kenneth Knowles <ke...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Jan 8, 2019 at 7:52 AM Scott Wegner <sc...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> For reference, there are currently 34 unresolved JIRA issues under
>>>>>>>> the test-failures component [1].
>>>>>>>>
>>>>>>>> [1]
>>>>>>>> https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>>>>>>
>>>>>>>
>>>>>>> And there are 19 labeled with flake or sickbay:
>>>>>>> https://issues.apache.org/jira/issues/?filter=12343195
>>>>>>>
>>>>>>>
>>>>>>>> On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <al...@google.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This is a a good idea. Some suggestions:
>>>>>>>>> - It would be nicer if we can figure out process to act on flaky
>>>>>>>>> test more frequently than releases.
>>>>>>>>>
>>>>>>>>
>>>>>>> Any ideas? We could just have some cadence and try to establish the
>>>>>>> practice of having a deflake thread every couple of weeks? How about we add
>>>>>>> it to release verification as a first step and then continue to discuss?
>>>>>>>
>>>>>>
>>>>>> Sounds great. I do not know enough JIRA, but I am hoping that a
>>>>>> solution can come in the form of tooling. If we could configure JIRA with
>>>>>> SLOs per issue type, we could have customized reports on which issues are
>>>>>> not getting enough attention and then do a load balance among us.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> - Another improvement in the process would be having actual owners
>>>>>>>>> of issues rather than auto assigned component owners. A few folks have 100+
>>>>>>>>> assigned issues. Unassigning those issues, and finding owners who would
>>>>>>>>> have time to work on identified flaky tests would be helpful.
>>>>>>>>>
>>>>>>>>
>>>>>>> Yikes. Two issues here:
>>>>>>>
>>>>>>>  - sounds like Jira component owners aren't really working for us as
>>>>>>> a first point of contact for triage
>>>>>>>  - a person shouldn't really have more than 5 Jira assigned, or if
>>>>>>> you get really loose maybe 20 (I am guilty of having 30 at this moment...)
>>>>>>>
>>>>>>> Maybe this is one or two separate threads?
>>>>>>>
>>>>>>
>>>>>> I can fork this to another thread. I think both issues are related
>>>>>> because components owners are more likely to be in this situaion. I agree
>>>>>> with assessment of two issues.
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <ke...@apache.org>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> I love this idea. It can easily feel like bugs filed for Jenkins
>>>>>>>>>> flakes/failures just get lost if there is no process for looking them over
>>>>>>>>>> regularly.
>>>>>>>>>>
>>>>>>>>>> I would suggest that test failures / flakes all get filed with
>>>>>>>>>> Fix Version = whatever release is next. Then at release time we can triage
>>>>>>>>>> the list, making sure none might be a symptom of something that should
>>>>>>>>>> block the release. One modification to your proposal is that after manual
>>>>>>>>>> verification that it is safe to release I would move Fix Version to the
>>>>>>>>>> next release instead of closing, unless the issue really is fixed or
>>>>>>>>>> otherwise not reproducible.
>>>>>>>>>>
>>>>>>>>>> For automation, I wonder if there's something automatic already
>>>>>>>>>> available somewhere that would:
>>>>>>>>>>
>>>>>>>>>>  - mark the Jenkins build to "Keep This Build Forever"
>>>>>>>>>>  - be *very* careful to try to find an existing bug, else it will
>>>>>>>>>> be spam
>>>>>>>>>>  - file bugs to "test-failures" component
>>>>>>>>>>  - set Fix Version to the "next" - right now we have 2.7.1 (LTS),
>>>>>>>>>> 2.11.0 (next mainline), 3.0.0 (dreamy incompatible ideas) so need the
>>>>>>>>>> smarts to choose 2.11.0
>>>>>>>>>>
>>>>>>>>>> If not, I think doing this stuff manually is not that bad,
>>>>>>>>>> assuming we can stay fairly green.
>>>>>>>>>>
>>>>>>>>>> Kenn
>>>>>>>>>>
>>>>>>>>>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <sr...@google.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi All,
>>>>>>>>>>>
>>>>>>>>>>> There are a number of tests in our system that are either flaky
>>>>>>>>>>> or permanently red. I am suggesting to add, if not all, then most of the
>>>>>>>>>>> tests (style, unit, integration, etc) to the release validation step. In
>>>>>>>>>>> this way, we will add a regular cadence to ensuring greenness and no flaky
>>>>>>>>>>> tests in Beam.
>>>>>>>>>>>
>>>>>>>>>>> There are a number of ways of implementing this, but what I
>>>>>>>>>>> think might work the best is to set up a process that either manually or
>>>>>>>>>>> automatically creates a JIRA for the failing test and assigns it to a
>>>>>>>>>>> component tagged with the release number. The release can then continue
>>>>>>>>>>> when all JIRAs are closed by either fixing the failure or manually testing
>>>>>>>>>>> to ensure no adverse side effects (this is in case there are environmental
>>>>>>>>>>> issues in the testing infrastructure or otherwise).
>>>>>>>>>>>
>>>>>>>>>>> Thanks for reading, what do you think?
>>>>>>>>>>> - Is there another, easier way to ensure that no test failures
>>>>>>>>>>> go unfixed?
>>>>>>>>>>> - Can the process be automated?
>>>>>>>>>>> - What am I missing?
>>>>>>>>>>>
>>>>>>>>>>> Regards,
>>>>>>>>>>> Sam
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Got feedback? tinyurl.com/swegner-feedback
>>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>>
>>>> Got feedback? tinyurl.com/swegner-feedback
>>>>
>>>

Re: Add all tests to release validation

Posted by Sam Rohde <sr...@google.com>.

+Boyuan Zhang <bo...@google.com> who is modifying the rc validation script

I'm thinking of a small change to the proposed process brought to my
attention from Boyuan.

Instead of running the additional validation tests during the rc
validation, run the tests and the proposed process after the release branch
has been cut. A couple of reasons why:

   - The additional validation tests (PostCommit and PreCommit) don't run
   against the RC and are instead run against the branch. This is confusing
   considering the other tests in the RC validation step are per RC.
   - The additional validation tests are expensive.

The final release process would look like:

   - Decide to release
   - Create a new version in JIRA
   - Triage release-blocking issue in JIRAs
   - Review release notes in JIRA
   - Create a release branch
   - Verify that a release builds
   - >>> Verify that a release passes its tests <<< (this is where the new
   process would be added)
   - Build/test/fix RCs
   - >>> Fix any issues <<< (all JIRAs created during the new process will
   have to be closed by here)
   - Finalize the release
   - Promote the release




On Thu, Jan 10, 2019 at 4:32 PM Kenneth Knowles <ke...@apache.org> wrote:

> What do you think about crowd-sourcing?
>
> 1. Fix Version = 2.10.0
> 2. If assigned, ping ticket and maybe assignee, unassign if unresponsive
> 3. If unassigned, assign it to yourself while thinking about it
> 4. If you can route it a bit closer to someone who might know, great
> 5. If it doesn't look like a blocker (after routing best you can), Fix
> Version = 2.11.0
>
> I think this has enough mutexes that there should be no duplicated work if
> it is followed. And every step is a standard use of Fix Version and
> Assignee so there's not really special policy needed.
>
> Kenn
>
> On Thu, Jan 10, 2019 at 4:25 PM Mikhail Gryzykhin <mi...@google.com>
> wrote:
>
>> +1
>>
>> Although we should be cautious when enabling this policy. We have decent
>> backlog of bugs that we need to plumb through.
>>
>> --Mikhail
>>
>> Have feedback <http://go/migryz-feedback>?
>>
>>
>> On Thu, Jan 10, 2019 at 11:44 AM Scott Wegner <sc...@apache.org> wrote:
>>
>>> +1, this sounds good to me.
>>>
>>> I believe the next step would be to open a PR to add this to the release
>>> guide:
>>> https://github.com/apache/beam/blob/master/website/src/contribute/release-guide.md
>>>
>>> On Wed, Jan 9, 2019 at 12:04 PM Sam Rohde <sr...@google.com> wrote:
>>>
>>>> Cool, thanks for all of the replies. Does this summary sound reasonable?
>>>>
>>>> *Problem:* there are a number of failing tests (including flaky) that
>>>> don't get looked at, and aren't necessarily green upon cutting a new Beam
>>>> release.
>>>>
>>>> *Proposed Solution:*
>>>>
>>>>    - Add all tests to the release validation
>>>>    - For all failing tests (including flaky) create a JIRA attached to
>>>>    the Beam release and add to the "test-failures" component*
>>>>    - If a test is continuously failing
>>>>          - fix it
>>>>          - add fix to release
>>>>          - close out JIRA
>>>>       - If a test is flaky
>>>>          - try and fix it
>>>>          - If fixed
>>>>             - add fix to release
>>>>             - close out JIRA
>>>>          - else
>>>>             - manually test it
>>>>             - modify "Fix Version" to next release
>>>>          - The release validation can continue when all JIRAs are
>>>>    closed out.
>>>>
>>>> *Why this is an improvement:*
>>>>
>>>>    - Ensures that every test is a valid signal (as opposed to
>>>>    disabling failing tests)
>>>>    - Creates an incentive to automate tests (no longer on the hook to
>>>>    manually test)
>>>>    - Creates a forcing-function to fix flaky tests (once fixed, no
>>>>    longer needs to be manually tested)
>>>>    - Ensures that every failing test gets looked at
>>>>
>>>> *Why this may not be an improvement:*
>>>>
>>>>    - More effort for release validation
>>>>    - May slow down release velocity
>>>>
>>>> * for brevity, this might be better to create a JIRA per component
>>>> containing a summary of failing tests
>>>>
>>>>
>>>> -Sam
>>>>
>>>>
>>>>
>>>>
>>>> On Tue, Jan 8, 2019 at 10:25 AM Ahmet Altay <al...@google.com> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Jan 8, 2019 at 8:25 AM Kenneth Knowles <ke...@apache.org>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>> On Tue, Jan 8, 2019 at 7:52 AM Scott Wegner <sc...@apache.org> wrote:
>>>>>>
>>>>>>> For reference, there are currently 34 unresolved JIRA issues under
>>>>>>> the test-failures component [1].
>>>>>>>
>>>>>>> [1]
>>>>>>> https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>>>>>
>>>>>>
>>>>>> And there are 19 labeled with flake or sickbay:
>>>>>> https://issues.apache.org/jira/issues/?filter=12343195
>>>>>>
>>>>>>
>>>>>>> On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <al...@google.com> wrote:
>>>>>>>
>>>>>>>> This is a a good idea. Some suggestions:
>>>>>>>> - It would be nicer if we can figure out process to act on flaky
>>>>>>>> test more frequently than releases.
>>>>>>>>
>>>>>>>
>>>>>> Any ideas? We could just have some cadence and try to establish the
>>>>>> practice of having a deflake thread every couple of weeks? How about we add
>>>>>> it to release verification as a first step and then continue to discuss?
>>>>>>
>>>>>
>>>>> Sounds great. I do not know enough JIRA, but I am hoping that a
>>>>> solution can come in the form of tooling. If we could configure JIRA with
>>>>> SLOs per issue type, we could have customized reports on which issues are
>>>>> not getting enough attention and then do a load balance among us.
>>>>>
>>>>>
>>>>>>
>>>>>> - Another improvement in the process would be having actual owners of
>>>>>>>> issues rather than auto assigned component owners. A few folks have 100+
>>>>>>>> assigned issues. Unassigning those issues, and finding owners who would
>>>>>>>> have time to work on identified flaky tests would be helpful.
>>>>>>>>
>>>>>>>
>>>>>> Yikes. Two issues here:
>>>>>>
>>>>>>  - sounds like Jira component owners aren't really working for us as
>>>>>> a first point of contact for triage
>>>>>>  - a person shouldn't really have more than 5 Jira assigned, or if
>>>>>> you get really loose maybe 20 (I am guilty of having 30 at this moment...)
>>>>>>
>>>>>> Maybe this is one or two separate threads?
>>>>>>
>>>>>
>>>>> I can fork this to another thread. I think both issues are related
>>>>> because components owners are more likely to be in this situaion. I agree
>>>>> with assessment of two issues.
>>>>>
>>>>>
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <ke...@apache.org>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I love this idea. It can easily feel like bugs filed for Jenkins
>>>>>>>>> flakes/failures just get lost if there is no process for looking them over
>>>>>>>>> regularly.
>>>>>>>>>
>>>>>>>>> I would suggest that test failures / flakes all get filed with Fix
>>>>>>>>> Version = whatever release is next. Then at release time we can triage the
>>>>>>>>> list, making sure none might be a symptom of something that should block
>>>>>>>>> the release. One modification to your proposal is that after manual
>>>>>>>>> verification that it is safe to release I would move Fix Version to the
>>>>>>>>> next release instead of closing, unless the issue really is fixed or
>>>>>>>>> otherwise not reproducible.
>>>>>>>>>
>>>>>>>>> For automation, I wonder if there's something automatic already
>>>>>>>>> available somewhere that would:
>>>>>>>>>
>>>>>>>>>  - mark the Jenkins build to "Keep This Build Forever"
>>>>>>>>>  - be *very* careful to try to find an existing bug, else it will
>>>>>>>>> be spam
>>>>>>>>>  - file bugs to "test-failures" component
>>>>>>>>>  - set Fix Version to the "next" - right now we have 2.7.1 (LTS),
>>>>>>>>> 2.11.0 (next mainline), 3.0.0 (dreamy incompatible ideas) so need the
>>>>>>>>> smarts to choose 2.11.0
>>>>>>>>>
>>>>>>>>> If not, I think doing this stuff manually is not that bad,
>>>>>>>>> assuming we can stay fairly green.
>>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <sr...@google.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Hi All,
>>>>>>>>>>
>>>>>>>>>> There are a number of tests in our system that are either flaky
>>>>>>>>>> or permanently red. I am suggesting to add, if not all, then most of the
>>>>>>>>>> tests (style, unit, integration, etc) to the release validation step. In
>>>>>>>>>> this way, we will add a regular cadence to ensuring greenness and no flaky
>>>>>>>>>> tests in Beam.
>>>>>>>>>>
>>>>>>>>>> There are a number of ways of implementing this, but what I think
>>>>>>>>>> might work the best is to set up a process that either manually or
>>>>>>>>>> automatically creates a JIRA for the failing test and assigns it to a
>>>>>>>>>> component tagged with the release number. The release can then continue
>>>>>>>>>> when all JIRAs are closed by either fixing the failure or manually testing
>>>>>>>>>> to ensure no adverse side effects (this is in case there are environmental
>>>>>>>>>> issues in the testing infrastructure or otherwise).
>>>>>>>>>>
>>>>>>>>>> Thanks for reading, what do you think?
>>>>>>>>>> - Is there another, easier way to ensure that no test failures go
>>>>>>>>>> unfixed?
>>>>>>>>>> - Can the process be automated?
>>>>>>>>>> - What am I missing?
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>> Sam
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Got feedback? tinyurl.com/swegner-feedback
>>>>>>>
>>>>>>
>>>
>>> --
>>>
>>>
>>>
>>>
>>> Got feedback? tinyurl.com/swegner-feedback
>>>
>>

Re: Add all tests to release validation

Posted by Kenneth Knowles <ke...@apache.org>.

What do you think about crowd-sourcing?

1. Fix Version = 2.10.0
2. If assigned, ping ticket and maybe assignee, unassign if unresponsive
3. If unassigned, assign it to yourself while thinking about it
4. If you can route it a bit closer to someone who might know, great
5. If it doesn't look like a blocker (after routing best you can), Fix
Version = 2.11.0

I think this has enough mutexes that there should be no duplicated work if
it is followed. And every step is a standard use of Fix Version and
Assignee so there's not really special policy needed.

Kenn

On Thu, Jan 10, 2019 at 4:25 PM Mikhail Gryzykhin <mi...@google.com> wrote:

> +1
>
> Although we should be cautious when enabling this policy. We have decent
> backlog of bugs that we need to plumb through.
>
> --Mikhail
>
> Have feedback <http://go/migryz-feedback>?
>
>
> On Thu, Jan 10, 2019 at 11:44 AM Scott Wegner <sc...@apache.org> wrote:
>
>> +1, this sounds good to me.
>>
>> I believe the next step would be to open a PR to add this to the release
>> guide:
>> https://github.com/apache/beam/blob/master/website/src/contribute/release-guide.md
>>
>> On Wed, Jan 9, 2019 at 12:04 PM Sam Rohde <sr...@google.com> wrote:
>>
>>> Cool, thanks for all of the replies. Does this summary sound reasonable?
>>>
>>> *Problem:* there are a number of failing tests (including flaky) that
>>> don't get looked at, and aren't necessarily green upon cutting a new Beam
>>> release.
>>>
>>> *Proposed Solution:*
>>>
>>>    - Add all tests to the release validation
>>>    - For all failing tests (including flaky) create a JIRA attached to
>>>    the Beam release and add to the "test-failures" component*
>>>    - If a test is continuously failing
>>>          - fix it
>>>          - add fix to release
>>>          - close out JIRA
>>>       - If a test is flaky
>>>          - try and fix it
>>>          - If fixed
>>>             - add fix to release
>>>             - close out JIRA
>>>          - else
>>>             - manually test it
>>>             - modify "Fix Version" to next release
>>>          - The release validation can continue when all JIRAs are
>>>    closed out.
>>>
>>> *Why this is an improvement:*
>>>
>>>    - Ensures that every test is a valid signal (as opposed to disabling
>>>    failing tests)
>>>    - Creates an incentive to automate tests (no longer on the hook to
>>>    manually test)
>>>    - Creates a forcing-function to fix flaky tests (once fixed, no
>>>    longer needs to be manually tested)
>>>    - Ensures that every failing test gets looked at
>>>
>>> *Why this may not be an improvement:*
>>>
>>>    - More effort for release validation
>>>    - May slow down release velocity
>>>
>>> * for brevity, this might be better to create a JIRA per component
>>> containing a summary of failing tests
>>>
>>>
>>> -Sam
>>>
>>>
>>>
>>>
>>> On Tue, Jan 8, 2019 at 10:25 AM Ahmet Altay <al...@google.com> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Jan 8, 2019 at 8:25 AM Kenneth Knowles <ke...@apache.org> wrote:
>>>>
>>>>>
>>>>>
>>>>> On Tue, Jan 8, 2019 at 7:52 AM Scott Wegner <sc...@apache.org> wrote:
>>>>>
>>>>>> For reference, there are currently 34 unresolved JIRA issues under
>>>>>> the test-failures component [1].
>>>>>>
>>>>>> [1]
>>>>>> https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>>>>
>>>>>
>>>>> And there are 19 labeled with flake or sickbay:
>>>>> https://issues.apache.org/jira/issues/?filter=12343195
>>>>>
>>>>>
>>>>>> On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <al...@google.com> wrote:
>>>>>>
>>>>>>> This is a a good idea. Some suggestions:
>>>>>>> - It would be nicer if we can figure out process to act on flaky
>>>>>>> test more frequently than releases.
>>>>>>>
>>>>>>
>>>>> Any ideas? We could just have some cadence and try to establish the
>>>>> practice of having a deflake thread every couple of weeks? How about we add
>>>>> it to release verification as a first step and then continue to discuss?
>>>>>
>>>>
>>>> Sounds great. I do not know enough JIRA, but I am hoping that a
>>>> solution can come in the form of tooling. If we could configure JIRA with
>>>> SLOs per issue type, we could have customized reports on which issues are
>>>> not getting enough attention and then do a load balance among us.
>>>>
>>>>
>>>>>
>>>>> - Another improvement in the process would be having actual owners of
>>>>>>> issues rather than auto assigned component owners. A few folks have 100+
>>>>>>> assigned issues. Unassigning those issues, and finding owners who would
>>>>>>> have time to work on identified flaky tests would be helpful.
>>>>>>>
>>>>>>
>>>>> Yikes. Two issues here:
>>>>>
>>>>>  - sounds like Jira component owners aren't really working for us as a
>>>>> first point of contact for triage
>>>>>  - a person shouldn't really have more than 5 Jira assigned, or if you
>>>>> get really loose maybe 20 (I am guilty of having 30 at this moment...)
>>>>>
>>>>> Maybe this is one or two separate threads?
>>>>>
>>>>
>>>> I can fork this to another thread. I think both issues are related
>>>> because components owners are more likely to be in this situaion. I agree
>>>> with assessment of two issues.
>>>>
>>>>
>>>>>
>>>>> Kenn
>>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <ke...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I love this idea. It can easily feel like bugs filed for Jenkins
>>>>>>>> flakes/failures just get lost if there is no process for looking them over
>>>>>>>> regularly.
>>>>>>>>
>>>>>>>> I would suggest that test failures / flakes all get filed with Fix
>>>>>>>> Version = whatever release is next. Then at release time we can triage the
>>>>>>>> list, making sure none might be a symptom of something that should block
>>>>>>>> the release. One modification to your proposal is that after manual
>>>>>>>> verification that it is safe to release I would move Fix Version to the
>>>>>>>> next release instead of closing, unless the issue really is fixed or
>>>>>>>> otherwise not reproducible.
>>>>>>>>
>>>>>>>> For automation, I wonder if there's something automatic already
>>>>>>>> available somewhere that would:
>>>>>>>>
>>>>>>>>  - mark the Jenkins build to "Keep This Build Forever"
>>>>>>>>  - be *very* careful to try to find an existing bug, else it will
>>>>>>>> be spam
>>>>>>>>  - file bugs to "test-failures" component
>>>>>>>>  - set Fix Version to the "next" - right now we have 2.7.1 (LTS),
>>>>>>>> 2.11.0 (next mainline), 3.0.0 (dreamy incompatible ideas) so need the
>>>>>>>> smarts to choose 2.11.0
>>>>>>>>
>>>>>>>> If not, I think doing this stuff manually is not that bad, assuming
>>>>>>>> we can stay fairly green.
>>>>>>>>
>>>>>>>> Kenn
>>>>>>>>
>>>>>>>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <sr...@google.com> wrote:
>>>>>>>>
>>>>>>>>> Hi All,
>>>>>>>>>
>>>>>>>>> There are a number of tests in our system that are either flaky or
>>>>>>>>> permanently red. I am suggesting to add, if not all, then most of the tests
>>>>>>>>> (style, unit, integration, etc) to the release validation step. In this
>>>>>>>>> way, we will add a regular cadence to ensuring greenness and no flaky tests
>>>>>>>>> in Beam.
>>>>>>>>>
>>>>>>>>> There are a number of ways of implementing this, but what I think
>>>>>>>>> might work the best is to set up a process that either manually or
>>>>>>>>> automatically creates a JIRA for the failing test and assigns it to a
>>>>>>>>> component tagged with the release number. The release can then continue
>>>>>>>>> when all JIRAs are closed by either fixing the failure or manually testing
>>>>>>>>> to ensure no adverse side effects (this is in case there are environmental
>>>>>>>>> issues in the testing infrastructure or otherwise).
>>>>>>>>>
>>>>>>>>> Thanks for reading, what do you think?
>>>>>>>>> - Is there another, easier way to ensure that no test failures go
>>>>>>>>> unfixed?
>>>>>>>>> - Can the process be automated?
>>>>>>>>> - What am I missing?
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>> Sam
>>>>>>>>>
>>>>>>>>>
>>>>>>
>>>>>> --
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Got feedback? tinyurl.com/swegner-feedback
>>>>>>
>>>>>
>>
>> --
>>
>>
>>
>>
>> Got feedback? tinyurl.com/swegner-feedback
>>
>

Re: Add all tests to release validation

Posted by Mikhail Gryzykhin <mi...@google.com>.

+1

Although we should be cautious when enabling this policy. We have decent
backlog of bugs that we need to plumb through.

--Mikhail

Have feedback <http://go/migryz-feedback>?


On Thu, Jan 10, 2019 at 11:44 AM Scott Wegner <sc...@apache.org> wrote:

> +1, this sounds good to me.
>
> I believe the next step would be to open a PR to add this to the release
> guide:
> https://github.com/apache/beam/blob/master/website/src/contribute/release-guide.md
>
> On Wed, Jan 9, 2019 at 12:04 PM Sam Rohde <sr...@google.com> wrote:
>
>> Cool, thanks for all of the replies. Does this summary sound reasonable?
>>
>> *Problem:* there are a number of failing tests (including flaky) that
>> don't get looked at, and aren't necessarily green upon cutting a new Beam
>> release.
>>
>> *Proposed Solution:*
>>
>>    - Add all tests to the release validation
>>    - For all failing tests (including flaky) create a JIRA attached to
>>    the Beam release and add to the "test-failures" component*
>>    - If a test is continuously failing
>>          - fix it
>>          - add fix to release
>>          - close out JIRA
>>       - If a test is flaky
>>          - try and fix it
>>          - If fixed
>>             - add fix to release
>>             - close out JIRA
>>          - else
>>             - manually test it
>>             - modify "Fix Version" to next release
>>          - The release validation can continue when all JIRAs are closed
>>    out.
>>
>> *Why this is an improvement:*
>>
>>    - Ensures that every test is a valid signal (as opposed to disabling
>>    failing tests)
>>    - Creates an incentive to automate tests (no longer on the hook to
>>    manually test)
>>    - Creates a forcing-function to fix flaky tests (once fixed, no
>>    longer needs to be manually tested)
>>    - Ensures that every failing test gets looked at
>>
>> *Why this may not be an improvement:*
>>
>>    - More effort for release validation
>>    - May slow down release velocity
>>
>> * for brevity, this might be better to create a JIRA per component
>> containing a summary of failing tests
>>
>>
>> -Sam
>>
>>
>>
>>
>> On Tue, Jan 8, 2019 at 10:25 AM Ahmet Altay <al...@google.com> wrote:
>>
>>>
>>>
>>> On Tue, Jan 8, 2019 at 8:25 AM Kenneth Knowles <ke...@apache.org> wrote:
>>>
>>>>
>>>>
>>>> On Tue, Jan 8, 2019 at 7:52 AM Scott Wegner <sc...@apache.org> wrote:
>>>>
>>>>> For reference, there are currently 34 unresolved JIRA issues under the
>>>>> test-failures component [1].
>>>>>
>>>>> [1]
>>>>> https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>>>
>>>>
>>>> And there are 19 labeled with flake or sickbay:
>>>> https://issues.apache.org/jira/issues/?filter=12343195
>>>>
>>>>
>>>>> On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <al...@google.com> wrote:
>>>>>
>>>>>> This is a a good idea. Some suggestions:
>>>>>> - It would be nicer if we can figure out process to act on flaky test
>>>>>> more frequently than releases.
>>>>>>
>>>>>
>>>> Any ideas? We could just have some cadence and try to establish the
>>>> practice of having a deflake thread every couple of weeks? How about we add
>>>> it to release verification as a first step and then continue to discuss?
>>>>
>>>
>>> Sounds great. I do not know enough JIRA, but I am hoping that a solution
>>> can come in the form of tooling. If we could configure JIRA with SLOs per
>>> issue type, we could have customized reports on which issues are not
>>> getting enough attention and then do a load balance among us.
>>>
>>>
>>>>
>>>> - Another improvement in the process would be having actual owners of
>>>>>> issues rather than auto assigned component owners. A few folks have 100+
>>>>>> assigned issues. Unassigning those issues, and finding owners who would
>>>>>> have time to work on identified flaky tests would be helpful.
>>>>>>
>>>>>
>>>> Yikes. Two issues here:
>>>>
>>>>  - sounds like Jira component owners aren't really working for us as a
>>>> first point of contact for triage
>>>>  - a person shouldn't really have more than 5 Jira assigned, or if you
>>>> get really loose maybe 20 (I am guilty of having 30 at this moment...)
>>>>
>>>> Maybe this is one or two separate threads?
>>>>
>>>
>>> I can fork this to another thread. I think both issues are related
>>> because components owners are more likely to be in this situaion. I agree
>>> with assessment of two issues.
>>>
>>>
>>>>
>>>> Kenn
>>>>
>>>>
>>>>>
>>>>>>
>>>>>> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <ke...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> I love this idea. It can easily feel like bugs filed for Jenkins
>>>>>>> flakes/failures just get lost if there is no process for looking them over
>>>>>>> regularly.
>>>>>>>
>>>>>>> I would suggest that test failures / flakes all get filed with Fix
>>>>>>> Version = whatever release is next. Then at release time we can triage the
>>>>>>> list, making sure none might be a symptom of something that should block
>>>>>>> the release. One modification to your proposal is that after manual
>>>>>>> verification that it is safe to release I would move Fix Version to the
>>>>>>> next release instead of closing, unless the issue really is fixed or
>>>>>>> otherwise not reproducible.
>>>>>>>
>>>>>>> For automation, I wonder if there's something automatic already
>>>>>>> available somewhere that would:
>>>>>>>
>>>>>>>  - mark the Jenkins build to "Keep This Build Forever"
>>>>>>>  - be *very* careful to try to find an existing bug, else it will be
>>>>>>> spam
>>>>>>>  - file bugs to "test-failures" component
>>>>>>>  - set Fix Version to the "next" - right now we have 2.7.1 (LTS),
>>>>>>> 2.11.0 (next mainline), 3.0.0 (dreamy incompatible ideas) so need the
>>>>>>> smarts to choose 2.11.0
>>>>>>>
>>>>>>> If not, I think doing this stuff manually is not that bad, assuming
>>>>>>> we can stay fairly green.
>>>>>>>
>>>>>>> Kenn
>>>>>>>
>>>>>>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <sr...@google.com> wrote:
>>>>>>>
>>>>>>>> Hi All,
>>>>>>>>
>>>>>>>> There are a number of tests in our system that are either flaky or
>>>>>>>> permanently red. I am suggesting to add, if not all, then most of the tests
>>>>>>>> (style, unit, integration, etc) to the release validation step. In this
>>>>>>>> way, we will add a regular cadence to ensuring greenness and no flaky tests
>>>>>>>> in Beam.
>>>>>>>>
>>>>>>>> There are a number of ways of implementing this, but what I think
>>>>>>>> might work the best is to set up a process that either manually or
>>>>>>>> automatically creates a JIRA for the failing test and assigns it to a
>>>>>>>> component tagged with the release number. The release can then continue
>>>>>>>> when all JIRAs are closed by either fixing the failure or manually testing
>>>>>>>> to ensure no adverse side effects (this is in case there are environmental
>>>>>>>> issues in the testing infrastructure or otherwise).
>>>>>>>>
>>>>>>>> Thanks for reading, what do you think?
>>>>>>>> - Is there another, easier way to ensure that no test failures go
>>>>>>>> unfixed?
>>>>>>>> - Can the process be automated?
>>>>>>>> - What am I missing?
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>> Sam
>>>>>>>>
>>>>>>>>
>>>>>
>>>>> --
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Got feedback? tinyurl.com/swegner-feedback
>>>>>
>>>>
>
> --
>
>
>
>
> Got feedback? tinyurl.com/swegner-feedback
>

Re: Add all tests to release validation

Posted by Scott Wegner <sc...@apache.org>.

+1, this sounds good to me.

I believe the next step would be to open a PR to add this to the release
guide:
https://github.com/apache/beam/blob/master/website/src/contribute/release-guide.md

On Wed, Jan 9, 2019 at 12:04 PM Sam Rohde <sr...@google.com> wrote:

> Cool, thanks for all of the replies. Does this summary sound reasonable?
>
> *Problem:* there are a number of failing tests (including flaky) that
> don't get looked at, and aren't necessarily green upon cutting a new Beam
> release.
>
> *Proposed Solution:*
>
>    - Add all tests to the release validation
>    - For all failing tests (including flaky) create a JIRA attached to
>    the Beam release and add to the "test-failures" component*
>    - If a test is continuously failing
>          - fix it
>          - add fix to release
>          - close out JIRA
>       - If a test is flaky
>          - try and fix it
>          - If fixed
>             - add fix to release
>             - close out JIRA
>          - else
>             - manually test it
>             - modify "Fix Version" to next release
>          - The release validation can continue when all JIRAs are closed
>    out.
>
> *Why this is an improvement:*
>
>    - Ensures that every test is a valid signal (as opposed to disabling
>    failing tests)
>    - Creates an incentive to automate tests (no longer on the hook to
>    manually test)
>    - Creates a forcing-function to fix flaky tests (once fixed, no longer
>    needs to be manually tested)
>    - Ensures that every failing test gets looked at
>
> *Why this may not be an improvement:*
>
>    - More effort for release validation
>    - May slow down release velocity
>
> * for brevity, this might be better to create a JIRA per component
> containing a summary of failing tests
>
>
> -Sam
>
>
>
>
> On Tue, Jan 8, 2019 at 10:25 AM Ahmet Altay <al...@google.com> wrote:
>
>>
>>
>> On Tue, Jan 8, 2019 at 8:25 AM Kenneth Knowles <ke...@apache.org> wrote:
>>
>>>
>>>
>>> On Tue, Jan 8, 2019 at 7:52 AM Scott Wegner <sc...@apache.org> wrote:
>>>
>>>> For reference, there are currently 34 unresolved JIRA issues under the
>>>> test-failures component [1].
>>>>
>>>> [1]
>>>> https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>>
>>>
>>> And there are 19 labeled with flake or sickbay:
>>> https://issues.apache.org/jira/issues/?filter=12343195
>>>
>>>
>>>> On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <al...@google.com> wrote:
>>>>
>>>>> This is a a good idea. Some suggestions:
>>>>> - It would be nicer if we can figure out process to act on flaky test
>>>>> more frequently than releases.
>>>>>
>>>>
>>> Any ideas? We could just have some cadence and try to establish the
>>> practice of having a deflake thread every couple of weeks? How about we add
>>> it to release verification as a first step and then continue to discuss?
>>>
>>
>> Sounds great. I do not know enough JIRA, but I am hoping that a solution
>> can come in the form of tooling. If we could configure JIRA with SLOs per
>> issue type, we could have customized reports on which issues are not
>> getting enough attention and then do a load balance among us.
>>
>>
>>>
>>> - Another improvement in the process would be having actual owners of
>>>>> issues rather than auto assigned component owners. A few folks have 100+
>>>>> assigned issues. Unassigning those issues, and finding owners who would
>>>>> have time to work on identified flaky tests would be helpful.
>>>>>
>>>>
>>> Yikes. Two issues here:
>>>
>>>  - sounds like Jira component owners aren't really working for us as a
>>> first point of contact for triage
>>>  - a person shouldn't really have more than 5 Jira assigned, or if you
>>> get really loose maybe 20 (I am guilty of having 30 at this moment...)
>>>
>>> Maybe this is one or two separate threads?
>>>
>>
>> I can fork this to another thread. I think both issues are related
>> because components owners are more likely to be in this situaion. I agree
>> with assessment of two issues.
>>
>>
>>>
>>> Kenn
>>>
>>>
>>>>
>>>>>
>>>>> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <ke...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I love this idea. It can easily feel like bugs filed for Jenkins
>>>>>> flakes/failures just get lost if there is no process for looking them over
>>>>>> regularly.
>>>>>>
>>>>>> I would suggest that test failures / flakes all get filed with Fix
>>>>>> Version = whatever release is next. Then at release time we can triage the
>>>>>> list, making sure none might be a symptom of something that should block
>>>>>> the release. One modification to your proposal is that after manual
>>>>>> verification that it is safe to release I would move Fix Version to the
>>>>>> next release instead of closing, unless the issue really is fixed or
>>>>>> otherwise not reproducible.
>>>>>>
>>>>>> For automation, I wonder if there's something automatic already
>>>>>> available somewhere that would:
>>>>>>
>>>>>>  - mark the Jenkins build to "Keep This Build Forever"
>>>>>>  - be *very* careful to try to find an existing bug, else it will be
>>>>>> spam
>>>>>>  - file bugs to "test-failures" component
>>>>>>  - set Fix Version to the "next" - right now we have 2.7.1 (LTS),
>>>>>> 2.11.0 (next mainline), 3.0.0 (dreamy incompatible ideas) so need the
>>>>>> smarts to choose 2.11.0
>>>>>>
>>>>>> If not, I think doing this stuff manually is not that bad, assuming
>>>>>> we can stay fairly green.
>>>>>>
>>>>>> Kenn
>>>>>>
>>>>>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <sr...@google.com> wrote:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> There are a number of tests in our system that are either flaky or
>>>>>>> permanently red. I am suggesting to add, if not all, then most of the tests
>>>>>>> (style, unit, integration, etc) to the release validation step. In this
>>>>>>> way, we will add a regular cadence to ensuring greenness and no flaky tests
>>>>>>> in Beam.
>>>>>>>
>>>>>>> There are a number of ways of implementing this, but what I think
>>>>>>> might work the best is to set up a process that either manually or
>>>>>>> automatically creates a JIRA for the failing test and assigns it to a
>>>>>>> component tagged with the release number. The release can then continue
>>>>>>> when all JIRAs are closed by either fixing the failure or manually testing
>>>>>>> to ensure no adverse side effects (this is in case there are environmental
>>>>>>> issues in the testing infrastructure or otherwise).
>>>>>>>
>>>>>>> Thanks for reading, what do you think?
>>>>>>> - Is there another, easier way to ensure that no test failures go
>>>>>>> unfixed?
>>>>>>> - Can the process be automated?
>>>>>>> - What am I missing?
>>>>>>>
>>>>>>> Regards,
>>>>>>> Sam
>>>>>>>
>>>>>>>
>>>>
>>>> --
>>>>
>>>>
>>>>
>>>>
>>>> Got feedback? tinyurl.com/swegner-feedback
>>>>
>>>

-- 




Got feedback? tinyurl.com/swegner-feedback

Re: Add all tests to release validation

Posted by Sam Rohde <sr...@google.com>.

Cool, thanks for all of the replies. Does this summary sound reasonable?

*Problem:* there are a number of failing tests (including flaky) that don't
get looked at, and aren't necessarily green upon cutting a new Beam release.

*Proposed Solution:*

   - Add all tests to the release validation
   - For all failing tests (including flaky) create a JIRA attached to the
   Beam release and add to the "test-failures" component*
   - If a test is continuously failing
         - fix it
         - add fix to release
         - close out JIRA
      - If a test is flaky
         - try and fix it
         - If fixed
            - add fix to release
            - close out JIRA
         - else
            - manually test it
            - modify "Fix Version" to next release
         - The release validation can continue when all JIRAs are closed
   out.

*Why this is an improvement:*

   - Ensures that every test is a valid signal (as opposed to disabling
   failing tests)
   - Creates an incentive to automate tests (no longer on the hook to
   manually test)
   - Creates a forcing-function to fix flaky tests (once fixed, no longer
   needs to be manually tested)
   - Ensures that every failing test gets looked at

*Why this may not be an improvement:*

   - More effort for release validation
   - May slow down release velocity

* for brevity, this might be better to create a JIRA per component
containing a summary of failing tests


-Sam




On Tue, Jan 8, 2019 at 10:25 AM Ahmet Altay <al...@google.com> wrote:

>
>
> On Tue, Jan 8, 2019 at 8:25 AM Kenneth Knowles <ke...@apache.org> wrote:
>
>>
>>
>> On Tue, Jan 8, 2019 at 7:52 AM Scott Wegner <sc...@apache.org> wrote:
>>
>>> For reference, there are currently 34 unresolved JIRA issues under the
>>> test-failures component [1].
>>>
>>> [1]
>>> https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>
>>
>> And there are 19 labeled with flake or sickbay:
>> https://issues.apache.org/jira/issues/?filter=12343195
>>
>>
>>> On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <al...@google.com> wrote:
>>>
>>>> This is a a good idea. Some suggestions:
>>>> - It would be nicer if we can figure out process to act on flaky test
>>>> more frequently than releases.
>>>>
>>>
>> Any ideas? We could just have some cadence and try to establish the
>> practice of having a deflake thread every couple of weeks? How about we add
>> it to release verification as a first step and then continue to discuss?
>>
>
> Sounds great. I do not know enough JIRA, but I am hoping that a solution
> can come in the form of tooling. If we could configure JIRA with SLOs per
> issue type, we could have customized reports on which issues are not
> getting enough attention and then do a load balance among us.
>
>
>>
>> - Another improvement in the process would be having actual owners of
>>>> issues rather than auto assigned component owners. A few folks have 100+
>>>> assigned issues. Unassigning those issues, and finding owners who would
>>>> have time to work on identified flaky tests would be helpful.
>>>>
>>>
>> Yikes. Two issues here:
>>
>>  - sounds like Jira component owners aren't really working for us as a
>> first point of contact for triage
>>  - a person shouldn't really have more than 5 Jira assigned, or if you
>> get really loose maybe 20 (I am guilty of having 30 at this moment...)
>>
>> Maybe this is one or two separate threads?
>>
>
> I can fork this to another thread. I think both issues are related because
> components owners are more likely to be in this situaion. I agree with
> assessment of two issues.
>
>
>>
>> Kenn
>>
>>
>>>
>>>>
>>>> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <ke...@apache.org> wrote:
>>>>
>>>>> I love this idea. It can easily feel like bugs filed for Jenkins
>>>>> flakes/failures just get lost if there is no process for looking them over
>>>>> regularly.
>>>>>
>>>>> I would suggest that test failures / flakes all get filed with Fix
>>>>> Version = whatever release is next. Then at release time we can triage the
>>>>> list, making sure none might be a symptom of something that should block
>>>>> the release. One modification to your proposal is that after manual
>>>>> verification that it is safe to release I would move Fix Version to the
>>>>> next release instead of closing, unless the issue really is fixed or
>>>>> otherwise not reproducible.
>>>>>
>>>>> For automation, I wonder if there's something automatic already
>>>>> available somewhere that would:
>>>>>
>>>>>  - mark the Jenkins build to "Keep This Build Forever"
>>>>>  - be *very* careful to try to find an existing bug, else it will be
>>>>> spam
>>>>>  - file bugs to "test-failures" component
>>>>>  - set Fix Version to the "next" - right now we have 2.7.1 (LTS),
>>>>> 2.11.0 (next mainline), 3.0.0 (dreamy incompatible ideas) so need the
>>>>> smarts to choose 2.11.0
>>>>>
>>>>> If not, I think doing this stuff manually is not that bad, assuming we
>>>>> can stay fairly green.
>>>>>
>>>>> Kenn
>>>>>
>>>>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <sr...@google.com> wrote:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> There are a number of tests in our system that are either flaky or
>>>>>> permanently red. I am suggesting to add, if not all, then most of the tests
>>>>>> (style, unit, integration, etc) to the release validation step. In this
>>>>>> way, we will add a regular cadence to ensuring greenness and no flaky tests
>>>>>> in Beam.
>>>>>>
>>>>>> There are a number of ways of implementing this, but what I think
>>>>>> might work the best is to set up a process that either manually or
>>>>>> automatically creates a JIRA for the failing test and assigns it to a
>>>>>> component tagged with the release number. The release can then continue
>>>>>> when all JIRAs are closed by either fixing the failure or manually testing
>>>>>> to ensure no adverse side effects (this is in case there are environmental
>>>>>> issues in the testing infrastructure or otherwise).
>>>>>>
>>>>>> Thanks for reading, what do you think?
>>>>>> - Is there another, easier way to ensure that no test failures go
>>>>>> unfixed?
>>>>>> - Can the process be automated?
>>>>>> - What am I missing?
>>>>>>
>>>>>> Regards,
>>>>>> Sam
>>>>>>
>>>>>>
>>>
>>> --
>>>
>>>
>>>
>>>
>>> Got feedback? tinyurl.com/swegner-feedback
>>>
>>

Re: Add all tests to release validation

Posted by Ahmet Altay <al...@google.com>.

On Tue, Jan 8, 2019 at 8:25 AM Kenneth Knowles <ke...@apache.org> wrote:

>
>
> On Tue, Jan 8, 2019 at 7:52 AM Scott Wegner <sc...@apache.org> wrote:
>
>> For reference, there are currently 34 unresolved JIRA issues under the
>> test-failures component [1].
>>
>> [1]
>> https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>
>
> And there are 19 labeled with flake or sickbay:
> https://issues.apache.org/jira/issues/?filter=12343195
>
>
>> On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <al...@google.com> wrote:
>>
>>> This is a a good idea. Some suggestions:
>>> - It would be nicer if we can figure out process to act on flaky test
>>> more frequently than releases.
>>>
>>
> Any ideas? We could just have some cadence and try to establish the
> practice of having a deflake thread every couple of weeks? How about we add
> it to release verification as a first step and then continue to discuss?
>

Sounds great. I do not know enough JIRA, but I am hoping that a solution
can come in the form of tooling. If we could configure JIRA with SLOs per
issue type, we could have customized reports on which issues are not
getting enough attention and then do a load balance among us.


>
> - Another improvement in the process would be having actual owners of
>>> issues rather than auto assigned component owners. A few folks have 100+
>>> assigned issues. Unassigning those issues, and finding owners who would
>>> have time to work on identified flaky tests would be helpful.
>>>
>>
> Yikes. Two issues here:
>
>  - sounds like Jira component owners aren't really working for us as a
> first point of contact for triage
>  - a person shouldn't really have more than 5 Jira assigned, or if you get
> really loose maybe 20 (I am guilty of having 30 at this moment...)
>
> Maybe this is one or two separate threads?
>

I can fork this to another thread. I think both issues are related because
components owners are more likely to be in this situaion. I agree with
assessment of two issues.


>
> Kenn
>
>
>>
>>>
>>> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <ke...@apache.org> wrote:
>>>
>>>> I love this idea. It can easily feel like bugs filed for Jenkins
>>>> flakes/failures just get lost if there is no process for looking them over
>>>> regularly.
>>>>
>>>> I would suggest that test failures / flakes all get filed with Fix
>>>> Version = whatever release is next. Then at release time we can triage the
>>>> list, making sure none might be a symptom of something that should block
>>>> the release. One modification to your proposal is that after manual
>>>> verification that it is safe to release I would move Fix Version to the
>>>> next release instead of closing, unless the issue really is fixed or
>>>> otherwise not reproducible.
>>>>
>>>> For automation, I wonder if there's something automatic already
>>>> available somewhere that would:
>>>>
>>>>  - mark the Jenkins build to "Keep This Build Forever"
>>>>  - be *very* careful to try to find an existing bug, else it will be
>>>> spam
>>>>  - file bugs to "test-failures" component
>>>>  - set Fix Version to the "next" - right now we have 2.7.1 (LTS),
>>>> 2.11.0 (next mainline), 3.0.0 (dreamy incompatible ideas) so need the
>>>> smarts to choose 2.11.0
>>>>
>>>> If not, I think doing this stuff manually is not that bad, assuming we
>>>> can stay fairly green.
>>>>
>>>> Kenn
>>>>
>>>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <sr...@google.com> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> There are a number of tests in our system that are either flaky or
>>>>> permanently red. I am suggesting to add, if not all, then most of the tests
>>>>> (style, unit, integration, etc) to the release validation step. In this
>>>>> way, we will add a regular cadence to ensuring greenness and no flaky tests
>>>>> in Beam.
>>>>>
>>>>> There are a number of ways of implementing this, but what I think
>>>>> might work the best is to set up a process that either manually or
>>>>> automatically creates a JIRA for the failing test and assigns it to a
>>>>> component tagged with the release number. The release can then continue
>>>>> when all JIRAs are closed by either fixing the failure or manually testing
>>>>> to ensure no adverse side effects (this is in case there are environmental
>>>>> issues in the testing infrastructure or otherwise).
>>>>>
>>>>> Thanks for reading, what do you think?
>>>>> - Is there another, easier way to ensure that no test failures go
>>>>> unfixed?
>>>>> - Can the process be automated?
>>>>> - What am I missing?
>>>>>
>>>>> Regards,
>>>>> Sam
>>>>>
>>>>>
>>
>> --
>>
>>
>>
>>
>> Got feedback? tinyurl.com/swegner-feedback
>>
>

Re: Add all tests to release validation

Posted by Kenneth Knowles <ke...@apache.org>.

On Tue, Jan 8, 2019 at 7:52 AM Scott Wegner <sc...@apache.org> wrote:

> For reference, there are currently 34 unresolved JIRA issues under the
> test-failures component [1].
>
> [1]
> https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>

And there are 19 labeled with flake or sickbay:
https://issues.apache.org/jira/issues/?filter=12343195


> On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <al...@google.com> wrote:
>
>> This is a a good idea. Some suggestions:
>> - It would be nicer if we can figure out process to act on flaky test
>> more frequently than releases.
>>
>
Any ideas? We could just have some cadence and try to establish the
practice of having a deflake thread every couple of weeks? How about we add
it to release verification as a first step and then continue to discuss?

- Another improvement in the process would be having actual owners of
>> issues rather than auto assigned component owners. A few folks have 100+
>> assigned issues. Unassigning those issues, and finding owners who would
>> have time to work on identified flaky tests would be helpful.
>>
>
Yikes. Two issues here:

 - sounds like Jira component owners aren't really working for us as a
first point of contact for triage
 - a person shouldn't really have more than 5 Jira assigned, or if you get
really loose maybe 20 (I am guilty of having 30 at this moment...)

Maybe this is one or two separate threads?

Kenn


>
>>
>> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <ke...@apache.org> wrote:
>>
>>> I love this idea. It can easily feel like bugs filed for Jenkins
>>> flakes/failures just get lost if there is no process for looking them over
>>> regularly.
>>>
>>> I would suggest that test failures / flakes all get filed with Fix
>>> Version = whatever release is next. Then at release time we can triage the
>>> list, making sure none might be a symptom of something that should block
>>> the release. One modification to your proposal is that after manual
>>> verification that it is safe to release I would move Fix Version to the
>>> next release instead of closing, unless the issue really is fixed or
>>> otherwise not reproducible.
>>>
>>> For automation, I wonder if there's something automatic already
>>> available somewhere that would:
>>>
>>>  - mark the Jenkins build to "Keep This Build Forever"
>>>  - be *very* careful to try to find an existing bug, else it will be spam
>>>  - file bugs to "test-failures" component
>>>  - set Fix Version to the "next" - right now we have 2.7.1 (LTS), 2.11.0
>>> (next mainline), 3.0.0 (dreamy incompatible ideas) so need the smarts to
>>> choose 2.11.0
>>>
>>> If not, I think doing this stuff manually is not that bad, assuming we
>>> can stay fairly green.
>>>
>>> Kenn
>>>
>>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <sr...@google.com> wrote:
>>>
>>>> Hi All,
>>>>
>>>> There are a number of tests in our system that are either flaky or
>>>> permanently red. I am suggesting to add, if not all, then most of the tests
>>>> (style, unit, integration, etc) to the release validation step. In this
>>>> way, we will add a regular cadence to ensuring greenness and no flaky tests
>>>> in Beam.
>>>>
>>>> There are a number of ways of implementing this, but what I think might
>>>> work the best is to set up a process that either manually or automatically
>>>> creates a JIRA for the failing test and assigns it to a component tagged
>>>> with the release number. The release can then continue when all JIRAs are
>>>> closed by either fixing the failure or manually testing to ensure no
>>>> adverse side effects (this is in case there are environmental issues in the
>>>> testing infrastructure or otherwise).
>>>>
>>>> Thanks for reading, what do you think?
>>>> - Is there another, easier way to ensure that no test failures go
>>>> unfixed?
>>>> - Can the process be automated?
>>>> - What am I missing?
>>>>
>>>> Regards,
>>>> Sam
>>>>
>>>>
>
> --
>
>
>
>
> Got feedback? tinyurl.com/swegner-feedback
>

Re: Add all tests to release validation

Posted by Scott Wegner <sc...@apache.org>.

+1; this essentially converts flaky automated tests into manual release
tests until the automation gets fixed. It's an improvement over the current
behavior of simply disabling tests, because when tests are disabled the
quality signal is lost. This also creates a stronger incentive to fix
tests: fixing the automation means you're no longer on the hook to run
manual tests.

This could potentially be a significant increase the validation work for a
release since each flaky test will need to be manually verified. I think
it's worth it, and will push us to fix flaky tests. For reference, there
are currently 34 unresolved JIRA issues under the test-failures component
[1].

[1]
https://issues.apache.org/jira/browse/BEAM-6280?jql=project%20%3D%20BEAM%20AND%20resolution%20%3D%20Unresolved%20AND%20component%20%3D%20test-failures%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC

On Mon, Jan 7, 2019 at 4:03 PM Ahmet Altay <al...@google.com> wrote:

> This is a a good idea. Some suggestions:
> - It would be nicer if we can figure out process to act on flaky test more
> frequently than releases.
> - Another improvement in the process would be having actual owners of
> issues rather than auto assigned component owners. A few folks have 100+
> assigned issues. Unassigning those issues, and finding owners who would
> have time to work on identified flaky tests would be helpful.
>
>
> On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <ke...@apache.org> wrote:
>
>> I love this idea. It can easily feel like bugs filed for Jenkins
>> flakes/failures just get lost if there is no process for looking them over
>> regularly.
>>
>> I would suggest that test failures / flakes all get filed with Fix
>> Version = whatever release is next. Then at release time we can triage the
>> list, making sure none might be a symptom of something that should block
>> the release. One modification to your proposal is that after manual
>> verification that it is safe to release I would move Fix Version to the
>> next release instead of closing, unless the issue really is fixed or
>> otherwise not reproducible.
>>
>> For automation, I wonder if there's something automatic already available
>> somewhere that would:
>>
>>  - mark the Jenkins build to "Keep This Build Forever"
>>  - be *very* careful to try to find an existing bug, else it will be spam
>>  - file bugs to "test-failures" component
>>  - set Fix Version to the "next" - right now we have 2.7.1 (LTS), 2.11.0
>> (next mainline), 3.0.0 (dreamy incompatible ideas) so need the smarts to
>> choose 2.11.0
>>
>> If not, I think doing this stuff manually is not that bad, assuming we
>> can stay fairly green.
>>
>> Kenn
>>
>> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <sr...@google.com> wrote:
>>
>>> Hi All,
>>>
>>> There are a number of tests in our system that are either flaky or
>>> permanently red. I am suggesting to add, if not all, then most of the tests
>>> (style, unit, integration, etc) to the release validation step. In this
>>> way, we will add a regular cadence to ensuring greenness and no flaky tests
>>> in Beam.
>>>
>>> There are a number of ways of implementing this, but what I think might
>>> work the best is to set up a process that either manually or automatically
>>> creates a JIRA for the failing test and assigns it to a component tagged
>>> with the release number. The release can then continue when all JIRAs are
>>> closed by either fixing the failure or manually testing to ensure no
>>> adverse side effects (this is in case there are environmental issues in the
>>> testing infrastructure or otherwise).
>>>
>>> Thanks for reading, what do you think?
>>> - Is there another, easier way to ensure that no test failures go
>>> unfixed?
>>> - Can the process be automated?
>>> - What am I missing?
>>>
>>> Regards,
>>> Sam
>>>
>>>

-- 




Got feedback? tinyurl.com/swegner-feedback

Re: Add all tests to release validation

Posted by Ahmet Altay <al...@google.com>.

This is a a good idea. Some suggestions:
- It would be nicer if we can figure out process to act on flaky test more
frequently than releases.
- Another improvement in the process would be having actual owners of
issues rather than auto assigned component owners. A few folks have 100+
assigned issues. Unassigning those issues, and finding owners who would
have time to work on identified flaky tests would be helpful.


On Mon, Jan 7, 2019 at 3:45 PM Kenneth Knowles <ke...@apache.org> wrote:

> I love this idea. It can easily feel like bugs filed for Jenkins
> flakes/failures just get lost if there is no process for looking them over
> regularly.
>
> I would suggest that test failures / flakes all get filed with Fix Version
> = whatever release is next. Then at release time we can triage the list,
> making sure none might be a symptom of something that should block the
> release. One modification to your proposal is that after manual
> verification that it is safe to release I would move Fix Version to the
> next release instead of closing, unless the issue really is fixed or
> otherwise not reproducible.
>
> For automation, I wonder if there's something automatic already available
> somewhere that would:
>
>  - mark the Jenkins build to "Keep This Build Forever"
>  - be *very* careful to try to find an existing bug, else it will be spam
>  - file bugs to "test-failures" component
>  - set Fix Version to the "next" - right now we have 2.7.1 (LTS), 2.11.0
> (next mainline), 3.0.0 (dreamy incompatible ideas) so need the smarts to
> choose 2.11.0
>
> If not, I think doing this stuff manually is not that bad, assuming we can
> stay fairly green.
>
> Kenn
>
> On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <sr...@google.com> wrote:
>
>> Hi All,
>>
>> There are a number of tests in our system that are either flaky or
>> permanently red. I am suggesting to add, if not all, then most of the tests
>> (style, unit, integration, etc) to the release validation step. In this
>> way, we will add a regular cadence to ensuring greenness and no flaky tests
>> in Beam.
>>
>> There are a number of ways of implementing this, but what I think might
>> work the best is to set up a process that either manually or automatically
>> creates a JIRA for the failing test and assigns it to a component tagged
>> with the release number. The release can then continue when all JIRAs are
>> closed by either fixing the failure or manually testing to ensure no
>> adverse side effects (this is in case there are environmental issues in the
>> testing infrastructure or otherwise).
>>
>> Thanks for reading, what do you think?
>> - Is there another, easier way to ensure that no test failures go unfixed?
>> - Can the process be automated?
>> - What am I missing?
>>
>> Regards,
>> Sam
>>
>>

Re: Add all tests to release validation

Posted by Kenneth Knowles <ke...@apache.org>.

I love this idea. It can easily feel like bugs filed for Jenkins
flakes/failures just get lost if there is no process for looking them over
regularly.

I would suggest that test failures / flakes all get filed with Fix Version
= whatever release is next. Then at release time we can triage the list,
making sure none might be a symptom of something that should block the
release. One modification to your proposal is that after manual
verification that it is safe to release I would move Fix Version to the
next release instead of closing, unless the issue really is fixed or
otherwise not reproducible.

For automation, I wonder if there's something automatic already available
somewhere that would:

 - mark the Jenkins build to "Keep This Build Forever"
 - be *very* careful to try to find an existing bug, else it will be spam
 - file bugs to "test-failures" component
 - set Fix Version to the "next" - right now we have 2.7.1 (LTS), 2.11.0
(next mainline), 3.0.0 (dreamy incompatible ideas) so need the smarts to
choose 2.11.0

If not, I think doing this stuff manually is not that bad, assuming we can
stay fairly green.

Kenn

On Mon, Jan 7, 2019 at 3:20 PM Sam Rohde <sr...@google.com> wrote:

> Hi All,
>
> There are a number of tests in our system that are either flaky or
> permanently red. I am suggesting to add, if not all, then most of the tests
> (style, unit, integration, etc) to the release validation step. In this
> way, we will add a regular cadence to ensuring greenness and no flaky tests
> in Beam.
>
> There are a number of ways of implementing this, but what I think might
> work the best is to set up a process that either manually or automatically
> creates a JIRA for the failing test and assigns it to a component tagged
> with the release number. The release can then continue when all JIRAs are
> closed by either fixing the failure or manually testing to ensure no
> adverse side effects (this is in case there are environmental issues in the
> testing infrastructure or otherwise).
>
> Thanks for reading, what do you think?
> - Is there another, easier way to ensure that no test failures go unfixed?
> - Can the process be automated?
> - What am I missing?
>
> Regards,
> Sam
>
>