You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by Etienne Chauchot <ec...@apache.org> on 2020/08/04 14:25:28 UTC

Re: Chronically flaky tests

Hi all,

+1 on ping the assigned person.

For the flakes I know of (ESIO and CassandraIO), they are due to the 
load of the CI server. These IOs are tested using real embedded backends 
because those backends are complex and we need relevant tests.

Counter measures have been taken (retrial inside the test sensible to 
load, add ranges of acceptable numbers, call internal backend mechanisms 
to force refresh in case load prevented the backend to do so ...).

I recently got pinged my Ahmet (thanks to him!) about a flakiness that I 
did not see. This seems to me the correct way to go. Systematically 
retrying tests with a CI mechanism or disabling tests seem to me a risky 
workaround that just allows to get the problem off our minds.

Etienne

On 20/07/2020 20:58, Brian Hulette wrote:
> > I think we are missing a way for checking that we are making 
> progress on P1 issues. For example, P0 issues block releases and this 
> obviously results in fixing/triaging/addressing P0 issues at least 
> every 6 weeks. We do not have a similar process for flaky tests. I do 
> not know what would be a good policy. One suggestion is to ping 
> (email/slack) assignees of issues. I recently missed a flaky issue 
> that was assigned to me. A ping like that would have reminded me. And 
> if an assignee cannot help/does not have the time, we can try to find 
> a new assignee.
>
> Yeah I think this is something we should address. With the new jira 
> automation at least assignees should get an email notification after 
> 30 days because of a jira comment like [1], but that's too long to let 
> a test continue to flake. Could Beam Jira Bot ping every N days for 
> P1s that aren't making progress?
>
> That wouldn't help us with P1s that have no assignee, or are assigned 
> to overloaded people. It seems we'd need some kind of dashboard or 
> report to capture those.
>
> [1] 
> https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918
>
> On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay <altay@google.com 
> <ma...@google.com>> wrote:
>
>     Another idea, could we change our "Retest X" phrases with "Retest
>     X (Reason)" phrases? With this change a PR author will have to
>     look at failed test logs. They could catch new flakiness
>     introduced by their PR, file a JIRA for a flakiness that was not
>     noted before, or ping an existing JIRA issue/raise its severity.
>     On the downside this will require PR authors to do more.
>
>     On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton <tysonjh@google.com
>     <ma...@google.com>> wrote:
>
>         Adding retries can be beneficial in two ways, unblocking a PR,
>         and collecting metrics about the flakes.
>
>
>     Makes sense. I think we will still need to have a plan to remove
>     retries similar to re-enabling disabled tests.
>
>
>         If we also had a flaky test leaderboard that showed which
>         tests are the most flaky, then we could take action on them.
>         Encouraging someone from the community to fix the flaky test
>         is another issue.
>
>         The test status matrix of tests that is on the GitHub landing
>         page could show flake level to communicate to users which
>         modules are losing a trustable test signal. Maybe this shows
>         up as a flake % or a code coverage % that decreases due to
>         disabled flaky tests.
>
>
>     +1 to a dashboard that will show a "leaderboard" of flaky tests.
>
>
>         I didn't look for plugins, just dreaming up some options.
>
>
>
>
>         On Thu, Jul 16, 2020, 5:58 PM Luke Cwik <lcwik@google.com
>         <ma...@google.com>> wrote:
>
>             What do other Apache projects do to address this issue?
>
>             On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay
>             <altay@google.com <ma...@google.com>> wrote:
>
>                 I agree with the comments in this thread.
>                 - If we are not re-enabling tests back again or we do
>                 not have a plan to re-enable them again, disabling
>                 tests only provides us temporary relief until
>                 eventually users find issues instead of disabled tests.
>                 - I feel similarly about retries. It is reasonable to
>                 add retries for reasons we understand. Adding retries
>                 to avoid flakes is similar to disabling tests. They
>                 might hide real issues.
>
>                 I think we are missing a way for checking that we are
>                 making progress on P1 issues. For example, P0 issues
>                 block releases and this obviously results in
>                 fixing/triaging/addressing P0 issues at least every 6
>                 weeks. We do not have a similar process for flaky
>                 tests. I do not know what would be a good policy. One
>                 suggestion is to ping (email/slack) assignees of
>                 issues. I recently missed a flaky issue that was
>                 assigned to me. A ping like that would have reminded
>                 me. And if an assignee cannot help/does not have the
>                 time, we can try to find a new assignee.
>
>                 Ahmet
>
>
>                 On Thu, Jul 16, 2020 at 11:52 AM Valentyn Tymofieiev
>                 <valentyn@google.com <ma...@google.com>> wrote:
>
>                     I think the original discussion[1] on introducing
>                     tenacity might answer that question.
>
>                     [1]
>                     https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af%40%3Cdev.beam.apache.org%3E
>
>                     On Thu, Jul 16, 2020 at 10:48 AM Rui Wang
>                     <ruwang@google.com <ma...@google.com>> wrote:
>
>                         Is there an observation that enabling tenacity
>                         improves the development experience on Python
>                         SDK? E.g. less wait time to get PR pass and
>                         merged? Or it might be a matter of a right
>                         number of retry to align with the "flakiness"
>                         of a test?
>
>
>                         -Rui
>
>                         On Thu, Jul 16, 2020 at 10:38 AM Valentyn
>                         Tymofieiev <valentyn@google.com
>                         <ma...@google.com>> wrote:
>
>                             We used tenacity[1] to retry some unit
>                             tests for which we understood the nature
>                             of flakiness.
>
>                             [1]
>                             https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156
>
>                             On Thu, Jul 16, 2020 at 10:25 AM Kenneth
>                             Knowles <kenn@apache.org
>                             <ma...@apache.org>> wrote:
>
>                                 Didn't we use something like that
>                                 flaky retry plugin for Python tests at
>                                 some point? Adding retries may be
>                                 preferable to disabling the test. We
>                                 need a process to remove the retries
>                                 ASAP though. As Luke says that is not
>                                 so easy to make happen. Having a way
>                                 to make P1 bugs more visible in an
>                                 ongoing way may help.
>
>                                 Kenn
>
>                                 On Thu, Jul 16, 2020 at 8:57 AM Luke
>                                 Cwik <lcwik@google.com
>                                 <ma...@google.com>> wrote:
>
>                                     I don't think I have seen tests
>                                     that were previously disabled
>                                     become re-enabled.
>
>                                     It seems as though we have about
>                                     ~60 disabled tests in Java and ~15
>                                     in Python. Half of the Java ones
>                                     seem to be in ZetaSQL/SQL due to
>                                     missing features so unrelated to
>                                     being a flake.
>
>                                     On Thu, Jul 16, 2020 at 8:49 AM
>                                     Gleb Kanterov <gleb@spotify.com
>                                     <ma...@spotify.com>> wrote:
>
>                                         There is something called
>                                         test-retry-gradle-plugin [1].
>                                         It retries tests if they fail,
>                                         and have different modes to
>                                         handle flaky tests. Did we
>                                         ever try or consider using it?
>
>                                         [1]:
>                                         https://github.com/gradle/test-retry-gradle-plugin
>
>                                         On Thu, Jul 16, 2020 at 1:15
>                                         PM Gleb Kanterov
>                                         <gleb@spotify.com
>                                         <ma...@spotify.com>> wrote:
>
>                                             I agree with what Ahmet is
>                                             saying. I can share my
>                                             perspective, recently I
>                                             had to retrigger build 6
>                                             times due to flaky tests,
>                                             and each retrigger took
>                                             one hour of waiting time.
>
>                                             I've seen examples of
>                                             automatic tracking of
>                                             flaky tests, where a test
>                                             is considered flaky if
>                                             both fails and succeeds
>                                             for the same git SHA. Not
>                                             sure if there is anything
>                                             we can enable to get this
>                                             automatically.
>
>                                             /Gleb
>
>                                             On Thu, Jul 16, 2020 at
>                                             2:33 AM Ahmet Altay
>                                             <altay@google.com
>                                             <ma...@google.com>>
>                                             wrote:
>
>                                                 I think it will be
>                                                 reasonable to
>                                                 disable/sickbay any
>                                                 flaky test that is
>                                                 actively blocking
>                                                 people. Collective
>                                                 cost of flaky tests
>                                                 for such a large group
>                                                 of contributors is
>                                                 very significant.
>
>                                                 Most of these issues
>                                                 are unassigned. IMO,
>                                                 it makes sense to
>                                                 assign these issues to
>                                                 the most relevant
>                                                 person (who added the
>                                                 test/who generally
>                                                 maintains those
>                                                 components). Those
>                                                 people can either fix
>                                                 and re-enable the
>                                                 tests, or remove them
>                                                 if they no longer
>                                                 provide valuable signals.
>
>                                                 Ahmet
>
>                                                 On Wed, Jul 15, 2020
>                                                 at 4:55 PM Kenneth
>                                                 Knowles
>                                                 <kenn@apache.org
>                                                 <ma...@apache.org>>
>                                                 wrote:
>
>                                                     The situation is
>                                                     much worse than
>                                                     that IMO. My
>                                                     experience of the
>                                                     last few days is
>                                                     that a large
>                                                     portion of time
>                                                     went to *just
>                                                     connecting failing
>                                                     runs with the
>                                                     corresponding Jira
>                                                     tickets or filing
>                                                     new ones*.
>
>                                                     Summarized on PRs:
>
>                                                      -
>                                                     https://github.com/apache/beam/pull/12272#issuecomment-659050891
>                                                      -
>                                                     https://github.com/apache/beam/pull/12273#issuecomment-659070317
>                                                      -
>                                                     https://github.com/apache/beam/pull/12225#issuecomment-656973073
>                                                      -
>                                                     https://github.com/apache/beam/pull/12225#issuecomment-657743373
>                                                      -
>                                                     https://github.com/apache/beam/pull/12224#issuecomment-657744481
>                                                      -
>                                                     https://github.com/apache/beam/pull/12216#issuecomment-657735289
>                                                      -
>                                                     https://github.com/apache/beam/pull/12216#issuecomment-657780781
>                                                      -
>                                                     https://github.com/apache/beam/pull/12216#issuecomment-657799415
>
>                                                     The tickets:
>
>                                                      -
>                                                     https://issues.apache.org/jira/browse/BEAM-10460
>                                                     SparkPortableExecutionTest
>                                                      -
>                                                     https://issues.apache.org/jira/browse/BEAM-10471
>                                                     CassandraIOTest >
>                                                     testEstimatedSizeBytes
>                                                      -
>                                                     https://issues.apache.org/jira/browse/BEAM-10504
>                                                     ElasticSearchIOTest
>                                                     >
>                                                     testWriteFullAddressing
>                                                     and
>                                                     testWriteWithIndexFn
>                                                      -
>                                                     https://issues.apache.org/jira/browse/BEAM-10470
>                                                     JdbcDriverTest
>                                                      -
>                                                     https://issues.apache.org/jira/browse/BEAM-8025
>                                                     CassandraIOTest
>                                                     > @BeforeClass
>                                                     (classmethod)
>                                                      -
>                                                     https://issues.apache.org/jira/browse/BEAM-8454
>                                                     FnHarnessTest
>                                                      -
>                                                     https://issues.apache.org/jira/browse/BEAM-10506
>                                                     SplunkEventWriterTest
>                                                      -
>                                                     https://issues.apache.org/jira/browse/BEAM-10472
>                                                     direct runner
>                                                     ParDoLifecycleTest
>                                                      -
>                                                     https://issues.apache.org/jira/browse/BEAM-9187
>                                                     DefaultJobBundleFactoryTest
>
>                                                     Here are our P1
>                                                     test flake bugs:
>                                                     https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>
>                                                     It seems quite a
>                                                     few of them are
>                                                     actively hindering
>                                                     people right now.
>
>                                                     Kenn
>
>                                                     On Wed, Jul 15,
>                                                     2020 at 4:23 PM
>                                                     Andrew Pilloud
>                                                     <apilloud@google.com
>                                                     <ma...@google.com>>
>                                                     wrote:
>
>                                                         We have two
>                                                         test suites
>                                                         that are
>                                                         responsible
>                                                         for a large
>                                                         percentage of
>                                                         our flaky
>                                                         tests and both
>                                                         have bugs open
>                                                         for about a
>                                                         year without
>                                                         being fixed.
>                                                         These suites
>                                                         are
>                                                         ParDoLifecycleTest
>                                                         (BEAM-8101
>                                                         <https://issues.apache.org/jira/browse/BEAM-8101>)
>                                                         in Java
>                                                         and BigQueryWriteIntegrationTests
>                                                         in python (py3
>                                                         BEAM-9484
>                                                         <https://issues.apache.org/jira/browse/BEAM-9484>,
>                                                         py2 BEAM-9232
>                                                         <https://issues.apache.org/jira/browse/BEAM-9232>,
>                                                         old duplicate
>                                                         BEAM-8197
>                                                         <https://issues.apache.org/jira/browse/BEAM-8197>).
>
>
>                                                         Are there any
>                                                         volunteers to
>                                                         look into
>                                                         these issues?
>                                                         What can we do
>                                                         to mitigate
>                                                         the
>                                                         flakiness until
>                                                         someone has
>                                                         time to
>                                                         investigate?
>
>                                                         Andrew
>

Re: Chronically flaky tests

Posted by Robert Bradshaw <ro...@google.com>.
I'm in favor of a quarantine job whose tests are called out
prominently as "possibly broken" in the release notes. As a follow up,
+1 to exploring better tooling to track at a fine grained level
exactly how flaky these test are (and hopefully detect if/when they go
from flakey to just plain broken).

On Tue, Aug 4, 2020 at 7:25 AM Etienne Chauchot <ec...@apache.org> wrote:
>
> Hi all,
>
> +1 on ping the assigned person.
>
> For the flakes I know of (ESIO and CassandraIO), they are due to the load of the CI server. These IOs are tested using real embedded backends because those backends are complex and we need relevant tests.
>
> Counter measures have been taken (retrial inside the test sensible to load, add ranges of acceptable numbers, call internal backend mechanisms to force refresh in case load prevented the backend to do so ...).

Yes, certain tests with external dependencies should to their own
internal retries. if that is not sufficient, they should probably be
quarantined.

> I recently got pinged my Ahmet (thanks to him!) about a flakiness that I did not see. This seems to me the correct way to go. Systematically retrying tests with a CI mechanism or disabling tests seem to me a risky workaround that just allows to get the problem off our minds.
>
> Etienne
>
> On 20/07/2020 20:58, Brian Hulette wrote:
>
> > I think we are missing a way for checking that we are making progress on P1 issues. For example, P0 issues block releases and this obviously results in fixing/triaging/addressing P0 issues at least every 6 weeks. We do not have a similar process for flaky tests. I do not know what would be a good policy. One suggestion is to ping (email/slack) assignees of issues. I recently missed a flaky issue that was assigned to me. A ping like that would have reminded me. And if an assignee cannot help/does not have the time, we can try to find a new assignee.
>
> Yeah I think this is something we should address. With the new jira automation at least assignees should get an email notification after 30 days because of a jira comment like [1], but that's too long to let a test continue to flake. Could Beam Jira Bot ping every N days for P1s that aren't making progress?
>
> That wouldn't help us with P1s that have no assignee, or are assigned to overloaded people. It seems we'd need some kind of dashboard or report to capture those.
>
> [1] https://issues.apache.org/jira/browse/BEAM-8101?focusedCommentId=17121918&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17121918
>
> On Fri, Jul 17, 2020 at 1:09 PM Ahmet Altay <al...@google.com> wrote:
>>
>> Another idea, could we change our "Retest X" phrases with "Retest X (Reason)" phrases? With this change a PR author will have to look at failed test logs. They could catch new flakiness introduced by their PR, file a JIRA for a flakiness that was not noted before, or ping an existing JIRA issue/raise its severity. On the downside this will require PR authors to do more.
>>
>> On Fri, Jul 17, 2020 at 6:46 AM Tyson Hamilton <ty...@google.com> wrote:
>>>
>>> Adding retries can be beneficial in two ways, unblocking a PR, and collecting metrics about the flakes.
>>
>>
>> Makes sense. I think we will still need to have a plan to remove retries similar to re-enabling disabled tests.
>>
>>>
>>>
>>> If we also had a flaky test leaderboard that showed which tests are the most flaky, then we could take action on them. Encouraging someone from the community to fix the flaky test is another issue.
>>>
>>> The test status matrix of tests that is on the GitHub landing page could show flake level to communicate to users which modules are losing a trustable test signal. Maybe this shows up as a flake % or a code coverage % that decreases due to disabled flaky tests.
>>
>>
>> +1 to a dashboard that will show a "leaderboard" of flaky tests.
>>
>>>
>>>
>>> I didn't look for plugins, just dreaming up some options.
>>>
>>>
>>>
>>>
>>> On Thu, Jul 16, 2020, 5:58 PM Luke Cwik <lc...@google.com> wrote:
>>>>
>>>> What do other Apache projects do to address this issue?
>>>>
>>>> On Thu, Jul 16, 2020 at 5:51 PM Ahmet Altay <al...@google.com> wrote:
>>>>>
>>>>> I agree with the comments in this thread.
>>>>> - If we are not re-enabling tests back again or we do not have a plan to re-enable them again, disabling tests only provides us temporary relief until eventually users find issues instead of disabled tests.
>>>>> - I feel similarly about retries. It is reasonable to add retries for reasons we understand. Adding retries to avoid flakes is similar to disabling tests. They might hide real issues.
>>>>>
>>>>> I think we are missing a way for checking that we are making progress on P1 issues. For example, P0 issues block releases and this obviously results in fixing/triaging/addressing P0 issues at least every 6 weeks. We do not have a similar process for flaky tests. I do not know what would be a good policy. One suggestion is to ping (email/slack) assignees of issues. I recently missed a flaky issue that was assigned to me. A ping like that would have reminded me. And if an assignee cannot help/does not have the time, we can try to find a new assignee.
>>>>>
>>>>> Ahmet
>>>>>
>>>>>
>>>>> On Thu, Jul 16, 2020 at 11:52 AM Valentyn Tymofieiev <va...@google.com> wrote:
>>>>>>
>>>>>> I think the original discussion[1] on introducing tenacity might answer that question.
>>>>>>
>>>>>> [1] https://lists.apache.org/thread.html/16060fd7f4d408857a5e4a2598cc96ebac0f744b65bf4699001350af%40%3Cdev.beam.apache.org%3E
>>>>>>
>>>>>> On Thu, Jul 16, 2020 at 10:48 AM Rui Wang <ru...@google.com> wrote:
>>>>>>>
>>>>>>> Is there an observation that enabling tenacity improves the development experience on Python SDK? E.g. less wait time to get PR pass and merged? Or it might be a matter of a right number of retry to align with the "flakiness" of a test?
>>>>>>>
>>>>>>>
>>>>>>> -Rui
>>>>>>>
>>>>>>> On Thu, Jul 16, 2020 at 10:38 AM Valentyn Tymofieiev <va...@google.com> wrote:
>>>>>>>>
>>>>>>>> We used tenacity[1] to retry some unit tests for which we understood the nature of flakiness.
>>>>>>>>
>>>>>>>> [1] https://github.com/apache/beam/blob/3b9aae2bcaeb48ab43a77368ae496edc73634c91/sdks/python/apache_beam/runners/portability/fn_api_runner/fn_runner_test.py#L1156
>>>>>>>>
>>>>>>>> On Thu, Jul 16, 2020 at 10:25 AM Kenneth Knowles <ke...@apache.org> wrote:
>>>>>>>>>
>>>>>>>>> Didn't we use something like that flaky retry plugin for Python tests at some point? Adding retries may be preferable to disabling the test. We need a process to remove the retries ASAP though. As Luke says that is not so easy to make happen. Having a way to make P1 bugs more visible in an ongoing way may help.
>>>>>>>>>
>>>>>>>>> Kenn
>>>>>>>>>
>>>>>>>>> On Thu, Jul 16, 2020 at 8:57 AM Luke Cwik <lc...@google.com> wrote:
>>>>>>>>>>
>>>>>>>>>> I don't think I have seen tests that were previously disabled become re-enabled.
>>>>>>>>>>
>>>>>>>>>> It seems as though we have about ~60 disabled tests in Java and ~15 in Python. Half of the Java ones seem to be in ZetaSQL/SQL due to missing features so unrelated to being a flake.
>>>>>>>>>>
>>>>>>>>>> On Thu, Jul 16, 2020 at 8:49 AM Gleb Kanterov <gl...@spotify.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>> There is something called test-retry-gradle-plugin [1]. It retries tests if they fail, and have different modes to handle flaky tests. Did we ever try or consider using it?
>>>>>>>>>>>
>>>>>>>>>>> [1]: https://github.com/gradle/test-retry-gradle-plugin
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Jul 16, 2020 at 1:15 PM Gleb Kanterov <gl...@spotify.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I agree with what Ahmet is saying. I can share my perspective, recently I had to retrigger build 6 times due to flaky tests, and each retrigger took one hour of waiting time.
>>>>>>>>>>>>
>>>>>>>>>>>> I've seen examples of automatic tracking of flaky tests, where a test is considered flaky if both fails and succeeds for the same git SHA. Not sure if there is anything we can enable to get this automatically.
>>>>>>>>>>>>
>>>>>>>>>>>> /Gleb
>>>>>>>>>>>>
>>>>>>>>>>>> On Thu, Jul 16, 2020 at 2:33 AM Ahmet Altay <al...@google.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I think it will be reasonable to disable/sickbay any flaky test that is actively blocking people. Collective cost of flaky tests for such a large group of contributors is very significant.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Most of these issues are unassigned. IMO, it makes sense to assign these issues to the most relevant person (who added the test/who generally maintains those components). Those people can either fix and re-enable the tests, or remove them if they no longer provide valuable signals.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ahmet
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:55 PM Kenneth Knowles <ke...@apache.org> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The situation is much worse than that IMO. My experience of the last few days is that a large portion of time went to *just connecting failing runs with the corresponding Jira tickets or filing new ones*.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Summarized on PRs:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  - https://github.com/apache/beam/pull/12272#issuecomment-659050891
>>>>>>>>>>>>>>  - https://github.com/apache/beam/pull/12273#issuecomment-659070317
>>>>>>>>>>>>>>  - https://github.com/apache/beam/pull/12225#issuecomment-656973073
>>>>>>>>>>>>>>  - https://github.com/apache/beam/pull/12225#issuecomment-657743373
>>>>>>>>>>>>>>  - https://github.com/apache/beam/pull/12224#issuecomment-657744481
>>>>>>>>>>>>>>  - https://github.com/apache/beam/pull/12216#issuecomment-657735289
>>>>>>>>>>>>>>  - https://github.com/apache/beam/pull/12216#issuecomment-657780781
>>>>>>>>>>>>>>  - https://github.com/apache/beam/pull/12216#issuecomment-657799415
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The tickets:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10460 SparkPortableExecutionTest
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10471 CassandraIOTest > testEstimatedSizeBytes
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10504 ElasticSearchIOTest > testWriteFullAddressing and testWriteWithIndexFn
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10470 JdbcDriverTest
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-8025 CassandraIOTest > @BeforeClass (classmethod)
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-8454 FnHarnessTest
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10506 SplunkEventWriterTest
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-10472 direct runner ParDoLifecycleTest
>>>>>>>>>>>>>>  - https://issues.apache.org/jira/browse/BEAM-9187 DefaultJobBundleFactoryTest
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Here are our P1 test flake bugs: https://issues.apache.org/jira/issues/?jql=project%20%3D%20BEAM%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22)%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20flake%20ORDER%20BY%20priority%20DESC%2C%20updated%20DESC
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It seems quite a few of them are actively hindering people right now.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Kenn
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Jul 15, 2020 at 4:23 PM Andrew Pilloud <ap...@google.com> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> We have two test suites that are responsible for a large percentage of our flaky tests and  both have bugs open for about a year without being fixed. These suites are ParDoLifecycleTest (BEAM-8101) in Java and BigQueryWriteIntegrationTests in python (py3 BEAM-9484, py2 BEAM-9232, old duplicate BEAM-8197).
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Are there any volunteers to look into these issues? What can we do to mitigate the flakiness until someone has time to investigate?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Andrew