You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@beam.apache.org by be...@gmail.com on 2022/09/14 10:02:46 UTC

Beam High Priority Issue Report (54)

This is your daily summary of Beam's current high priority issues that may need attention.

    See https://beam.apache.org/contribute/issue-priorities for the meaning and expectations around issue priorities.

Unassigned P1 Issues:

https://github.com/apache/beam/issues/23227 [Bug]: Python SDK installation cannot generate proto with protobuf 3.20.2
https://github.com/apache/beam/issues/23179 [Bug]: Parquet size exploded for no apparent reason
https://github.com/apache/beam/issues/22913 [Bug]: beam_PostCommit_Java_ValidatesRunner_Flink is flakey
https://github.com/apache/beam/issues/22303 [Task]: Add tests to Kafka SDF and fix known and discovered issues
https://github.com/apache/beam/issues/22299 [Bug]: JDBCIO Write freeze at getConnection() in WriteFn
https://github.com/apache/beam/issues/21794 Dataflow runner creates a new timer whenever the output timestamp is change
https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't get output to Failed Inserts PCollection
https://github.com/apache/beam/issues/21704 beam_PostCommit_Java_DataflowV2 failures parent bug
https://github.com/apache/beam/issues/21701 beam_PostCommit_Java_DataflowV1 failing with a variety of flakes and errors
https://github.com/apache/beam/issues/21700 --dataflowServiceOptions=use_runner_v2 is broken
https://github.com/apache/beam/issues/21696 Flink Tests failure :  java.lang.NoClassDefFoundError: Could not initialize class org.apache.beam.runners.core.construction.SerializablePipelineOptions 
https://github.com/apache/beam/issues/21695 DataflowPipelineResult does not raise exception for unsuccessful states.
https://github.com/apache/beam/issues/21694 BigQuery Storage API insert with writeResult retry and write to error table
https://github.com/apache/beam/issues/21480 flake: FlinkRunnerTest.testEnsureStdoutStdErrIsRestored
https://github.com/apache/beam/issues/21472 Dataflow streaming tests failing new AfterSynchronizedProcessingTime test
https://github.com/apache/beam/issues/21471 Flakes: Failed to load cache entry
https://github.com/apache/beam/issues/21470 Test flake: test_split_half_sdf
https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink flaky: Connection refused
https://github.com/apache/beam/issues/21468 beam_PostCommit_Python_Examples_Dataflow failing
https://github.com/apache/beam/issues/21467 GBK and CoGBK streaming Java load tests failing
https://github.com/apache/beam/issues/21465 Kafka commit offset drop data on failure for runners that have non-checkpointing shuffle
https://github.com/apache/beam/issues/21463 NPE in Flink Portable ValidatesRunner streaming suite
https://github.com/apache/beam/issues/21462 Flake in org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in use
https://github.com/apache/beam/issues/21271 pubsublite.ReadWriteIT flaky in beam_PostCommit_Java_DataflowV2  
https://github.com/apache/beam/issues/21270 org.apache.beam.sdk.transforms.CombineTest$WindowingTests.testWindowedCombineGloballyAsSingletonView flaky on Dataflow Runner V2
https://github.com/apache/beam/issues/21267 WriteToBigQuery submits a duplicate BQ load job if a 503 error code is returned from googleapi
https://github.com/apache/beam/issues/21266 org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful is flaky in Java ValidatesRunner Flink suite.
https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do not follow spec
https://github.com/apache/beam/issues/21261 org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer is flaky
https://github.com/apache/beam/issues/21260 Python DirectRunner does not emit data at GC time
https://github.com/apache/beam/issues/21257 Either Create or DirectRunner fails to produce all elements to the following transform
https://github.com/apache/beam/issues/21123 Multiple jobs running on Flink session cluster reuse the persistent Python environment.
https://github.com/apache/beam/issues/21121 apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it flakey
https://github.com/apache/beam/issues/21118 PortableRunnerTestWithExternalEnv.test_pardo_timers flaky
https://github.com/apache/beam/issues/21114 Already Exists: Dataset apache-beam-testing:python_bq_file_loads_NNN
https://github.com/apache/beam/issues/21113 testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky
https://github.com/apache/beam/issues/21111 Java creates an incorrect pipeline proto when core-construction-java jar is not in the CLASSPATH
https://github.com/apache/beam/issues/21109 SDF BoundedSource seems to execute significantly slower than 'normal' BoundedSource
https://github.com/apache/beam/issues/20981 Python precommit flaky: Failed to read inputs in the data plane
https://github.com/apache/beam/issues/20977 SamzaStoreStateInternalsTest is flaky
https://github.com/apache/beam/issues/20976 apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics is flaky
https://github.com/apache/beam/issues/20975 org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming: false] is flaky
https://github.com/apache/beam/issues/20974 Python GHA PreCommits flake with grpc.FutureTimeoutError on SDK harness startup
https://github.com/apache/beam/issues/20817 Bigquery Read tests are flaky on Flink runner in Python PostCommit suites
https://github.com/apache/beam/issues/20815 testTeardownCalledAfterExceptionInProcessElement flakes on direct runner.
https://github.com/apache/beam/issues/20692 Timer with dataflow runner can be set multiple times (dataflow runner)
https://github.com/apache/beam/issues/20689 Kafka commitOffsetsInFinalize OOM on Flink
https://github.com/apache/beam/issues/20528 python CombineGlobally().with_fanout() cause duplicate combine results for sliding windows
https://github.com/apache/beam/issues/20332 FileIO writeDynamic with AvroIO.sink not writing all data
https://github.com/apache/beam/issues/20331 org.apache.beam.sdk.io.mongodb.MongoDbIOTest.testReadWithAggregate is flaky
https://github.com/apache/beam/issues/20109 SortValues should fail if SecondaryKey coder is not deterministic
https://github.com/apache/beam/issues/20108 Python direct runner doesn't emit empty pane when it should
https://github.com/apache/beam/issues/19816 MetricsTest$AttemptedMetricTests.testAllAttemptedMetrics is flaky on DirectRunner
https://github.com/apache/beam/issues/19814 Flakes in ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful for Direct, Spark, Flink



Re: What to do about issues that track flaky tests?

Posted by Brian Hulette via dev <de...@beam.apache.org>.
I agree with Austin on this one, it makes sense to be realistic, but I'm
concerned about just blanket reducing the priority on all flakes. Two
classes of issues that could certainly be dropped to P2:
- Issues tracking flakes that have not been sickbayed yet (e.g.
https://github.com/apache/beam/issues/21266). These tests are still
providing signal (we should notice if it goes perma-red), and clearly the
flakes aren't so painful that someone felt the need to sickbay it.
- A sickbayed test, iff a breakage in the functionality it's testing would
be P2. This is admittedly difficult to identify.

It looks like we don't have a way to label sickbayed tests (or the inverse,
currently-failing), maybe we should have one?

Another thing to note: this email is reporting _unassigned_ P1 issues,
another way to remove issues from the search results would be to ensure
each flake has an owner (somehow). Maybe that's just shifting the problem,
but it could avoid the tragedy of the commons. To Manu's point, maybe those
new owners will happily discover their flake is no longer a problem.

Brian

On Wed, Sep 14, 2022 at 5:58 PM Manu Zhang <ow...@gmail.com> wrote:

> Agreed. I also mentioned in a previous email that some issues have been
> open for a long time (before being migrated to GitHub) and it's possible
> that those tests can pass constantly now.
> We may double check and close them since reopening is just one click.
>
> Manu
>
> On Thu, Sep 15, 2022 at 6:58 AM Austin Bennett <
> whatwouldaustindo@gmail.com> wrote:
>
>> +1 to being realistic -- proper labels are worthwhile.  Though, some
>> flaky tests probably should be P1, and just because isn't addressed in a
>> timely manner doesn't mean it isn't a P1 - though, it does mean it wasn't
>> addressed.
>>
>>
>>
>> On Wed, Sep 14, 2022 at 1:19 PM Kenneth Knowles <ke...@apache.org> wrote:
>>
>>> I would like to make this alert email actionable.
>>>
>>> I went through most of these issues. About half are P1 "flake" issues. I
>>> don't think magically expecting them to be deflaked is helpful. So I have a
>>> couple ideas:
>>>
>>> 1. Exclude "flake" P1s from this email. This is what we used to do. But
>>> then... are they really P1s?
>>> 2. Make "flake" bugs P2 if they are not currently impacting our test
>>> signal. But then... we may have a gap in test coverage that could cause
>>> severe problems. But anyhow something that is P1 for a long time is not
>>> *really* P1, so it is just being realistic.
>>>
>>> What do you all think?
>>>
>>> Kenn
>>>
>>> On Wed, Sep 14, 2022 at 3:03 AM <be...@gmail.com> wrote:
>>>
>>>> This is your daily summary of Beam's current high priority issues that
>>>> may need attention.
>>>>
>>>>     See https://beam.apache.org/contribute/issue-priorities for the
>>>> meaning and expectations around issue priorities.
>>>>
>>>> Unassigned P1 Issues:
>>>>
>>>> https://github.com/apache/beam/issues/23227 [Bug]: Python SDK
>>>> installation cannot generate proto with protobuf 3.20.2
>>>> https://github.com/apache/beam/issues/23179 [Bug]: Parquet size
>>>> exploded for no apparent reason
>>>> https://github.com/apache/beam/issues/22913 [Bug]:
>>>> beam_PostCommit_Java_ValidatesRunner_Flink is flakey
>>>> https://github.com/apache/beam/issues/22303 [Task]: Add tests to Kafka
>>>> SDF and fix known and discovered issues
>>>> https://github.com/apache/beam/issues/22299 [Bug]: JDBCIO Write freeze
>>>> at getConnection() in WriteFn
>>>> https://github.com/apache/beam/issues/21794 Dataflow runner creates a
>>>> new timer whenever the output timestamp is change
>>>> https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't
>>>> get output to Failed Inserts PCollection
>>>> https://github.com/apache/beam/issues/21704
>>>> beam_PostCommit_Java_DataflowV2 failures parent bug
>>>> https://github.com/apache/beam/issues/21701
>>>> beam_PostCommit_Java_DataflowV1 failing with a variety of flakes and errors
>>>> https://github.com/apache/beam/issues/21700
>>>> --dataflowServiceOptions=use_runner_v2 is broken
>>>> https://github.com/apache/beam/issues/21696 Flink Tests failure :
>>>> java.lang.NoClassDefFoundError: Could not initialize class
>>>> org.apache.beam.runners.core.construction.SerializablePipelineOptions
>>>> https://github.com/apache/beam/issues/21695 DataflowPipelineResult
>>>> does not raise exception for unsuccessful states.
>>>> https://github.com/apache/beam/issues/21694 BigQuery Storage API
>>>> insert with writeResult retry and write to error table
>>>> https://github.com/apache/beam/issues/21480 flake:
>>>> FlinkRunnerTest.testEnsureStdoutStdErrIsRestored
>>>> https://github.com/apache/beam/issues/21472 Dataflow streaming tests
>>>> failing new AfterSynchronizedProcessingTime test
>>>> https://github.com/apache/beam/issues/21471 Flakes: Failed to load
>>>> cache entry
>>>> https://github.com/apache/beam/issues/21470 Test flake:
>>>> test_split_half_sdf
>>>> https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink
>>>> flaky: Connection refused
>>>> https://github.com/apache/beam/issues/21468
>>>> beam_PostCommit_Python_Examples_Dataflow failing
>>>> https://github.com/apache/beam/issues/21467 GBK and CoGBK streaming
>>>> Java load tests failing
>>>> https://github.com/apache/beam/issues/21465 Kafka commit offset drop
>>>> data on failure for runners that have non-checkpointing shuffle
>>>> https://github.com/apache/beam/issues/21463 NPE in Flink Portable
>>>> ValidatesRunner streaming suite
>>>> https://github.com/apache/beam/issues/21462 Flake in
>>>> org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in
>>>> use
>>>> https://github.com/apache/beam/issues/21271 pubsublite.ReadWriteIT
>>>> flaky in beam_PostCommit_Java_DataflowV2
>>>> https://github.com/apache/beam/issues/21270
>>>> org.apache.beam.sdk.transforms.CombineTest$WindowingTests.testWindowedCombineGloballyAsSingletonView
>>>> flaky on Dataflow Runner V2
>>>> https://github.com/apache/beam/issues/21267 WriteToBigQuery submits a
>>>> duplicate BQ load job if a 503 error code is returned from googleapi
>>>> https://github.com/apache/beam/issues/21266
>>>> org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
>>>> is flaky in Java ValidatesRunner Flink suite.
>>>> https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll
>>>> do not follow spec
>>>> https://github.com/apache/beam/issues/21261
>>>> org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
>>>> is flaky
>>>> https://github.com/apache/beam/issues/21260 Python DirectRunner does
>>>> not emit data at GC time
>>>> https://github.com/apache/beam/issues/21257 Either Create or
>>>> DirectRunner fails to produce all elements to the following transform
>>>> https://github.com/apache/beam/issues/21123 Multiple jobs running on
>>>> Flink session cluster reuse the persistent Python environment.
>>>> https://github.com/apache/beam/issues/21121
>>>> apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
>>>> flakey
>>>> https://github.com/apache/beam/issues/21118
>>>> PortableRunnerTestWithExternalEnv.test_pardo_timers flaky
>>>> https://github.com/apache/beam/issues/21114 Already Exists: Dataset
>>>> apache-beam-testing:python_bq_file_loads_NNN
>>>> https://github.com/apache/beam/issues/21113
>>>> testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky
>>>> https://github.com/apache/beam/issues/21111 Java creates an incorrect
>>>> pipeline proto when core-construction-java jar is not in the CLASSPATH
>>>> https://github.com/apache/beam/issues/21109 SDF BoundedSource seems to
>>>> execute significantly slower than 'normal' BoundedSource
>>>> https://github.com/apache/beam/issues/20981 Python precommit flaky:
>>>> Failed to read inputs in the data plane
>>>> https://github.com/apache/beam/issues/20977
>>>> SamzaStoreStateInternalsTest is flaky
>>>> https://github.com/apache/beam/issues/20976
>>>> apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
>>>> is flaky
>>>> https://github.com/apache/beam/issues/20975
>>>> org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming:
>>>> false] is flaky
>>>> https://github.com/apache/beam/issues/20974 Python GHA PreCommits
>>>> flake with grpc.FutureTimeoutError on SDK harness startup
>>>> https://github.com/apache/beam/issues/20817 Bigquery Read tests are
>>>> flaky on Flink runner in Python PostCommit suites
>>>> https://github.com/apache/beam/issues/20815
>>>> testTeardownCalledAfterExceptionInProcessElement flakes on direct runner.
>>>> https://github.com/apache/beam/issues/20692 Timer with dataflow runner
>>>> can be set multiple times (dataflow runner)
>>>> https://github.com/apache/beam/issues/20689 Kafka
>>>> commitOffsetsInFinalize OOM on Flink
>>>> https://github.com/apache/beam/issues/20528 python
>>>> CombineGlobally().with_fanout() cause duplicate combine results for sliding
>>>> windows
>>>> https://github.com/apache/beam/issues/20332 FileIO writeDynamic with
>>>> AvroIO.sink not writing all data
>>>> https://github.com/apache/beam/issues/20331 org.apache.beam.sdk.io.mongodb.MongoDbIOTest.testReadWithAggregate
>>>> is flaky
>>>> https://github.com/apache/beam/issues/20109 SortValues should fail if
>>>> SecondaryKey coder is not deterministic
>>>> https://github.com/apache/beam/issues/20108 Python direct runner
>>>> doesn't emit empty pane when it should
>>>> https://github.com/apache/beam/issues/19816
>>>> MetricsTest$AttemptedMetricTests.testAllAttemptedMetrics is flaky on
>>>> DirectRunner
>>>> https://github.com/apache/beam/issues/19814 Flakes in
>>>> ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful
>>>> for Direct, Spark, Flink
>>>>
>>>>
>>>>

Re: What to do about issues that track flaky tests?

Posted by Manu Zhang <ow...@gmail.com>.
Agreed. I also mentioned in a previous email that some issues have been
open for a long time (before being migrated to GitHub) and it's possible
that those tests can pass constantly now.
We may double check and close them since reopening is just one click.

Manu

On Thu, Sep 15, 2022 at 6:58 AM Austin Bennett <wh...@gmail.com>
wrote:

> +1 to being realistic -- proper labels are worthwhile.  Though, some flaky
> tests probably should be P1, and just because isn't addressed in a timely
> manner doesn't mean it isn't a P1 - though, it does mean it wasn't
> addressed.
>
>
>
> On Wed, Sep 14, 2022 at 1:19 PM Kenneth Knowles <ke...@apache.org> wrote:
>
>> I would like to make this alert email actionable.
>>
>> I went through most of these issues. About half are P1 "flake" issues. I
>> don't think magically expecting them to be deflaked is helpful. So I have a
>> couple ideas:
>>
>> 1. Exclude "flake" P1s from this email. This is what we used to do. But
>> then... are they really P1s?
>> 2. Make "flake" bugs P2 if they are not currently impacting our test
>> signal. But then... we may have a gap in test coverage that could cause
>> severe problems. But anyhow something that is P1 for a long time is not
>> *really* P1, so it is just being realistic.
>>
>> What do you all think?
>>
>> Kenn
>>
>> On Wed, Sep 14, 2022 at 3:03 AM <be...@gmail.com> wrote:
>>
>>> This is your daily summary of Beam's current high priority issues that
>>> may need attention.
>>>
>>>     See https://beam.apache.org/contribute/issue-priorities for the
>>> meaning and expectations around issue priorities.
>>>
>>> Unassigned P1 Issues:
>>>
>>> https://github.com/apache/beam/issues/23227 [Bug]: Python SDK
>>> installation cannot generate proto with protobuf 3.20.2
>>> https://github.com/apache/beam/issues/23179 [Bug]: Parquet size
>>> exploded for no apparent reason
>>> https://github.com/apache/beam/issues/22913 [Bug]:
>>> beam_PostCommit_Java_ValidatesRunner_Flink is flakey
>>> https://github.com/apache/beam/issues/22303 [Task]: Add tests to Kafka
>>> SDF and fix known and discovered issues
>>> https://github.com/apache/beam/issues/22299 [Bug]: JDBCIO Write freeze
>>> at getConnection() in WriteFn
>>> https://github.com/apache/beam/issues/21794 Dataflow runner creates a
>>> new timer whenever the output timestamp is change
>>> https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't
>>> get output to Failed Inserts PCollection
>>> https://github.com/apache/beam/issues/21704
>>> beam_PostCommit_Java_DataflowV2 failures parent bug
>>> https://github.com/apache/beam/issues/21701
>>> beam_PostCommit_Java_DataflowV1 failing with a variety of flakes and errors
>>> https://github.com/apache/beam/issues/21700
>>> --dataflowServiceOptions=use_runner_v2 is broken
>>> https://github.com/apache/beam/issues/21696 Flink Tests failure :
>>> java.lang.NoClassDefFoundError: Could not initialize class
>>> org.apache.beam.runners.core.construction.SerializablePipelineOptions
>>> https://github.com/apache/beam/issues/21695 DataflowPipelineResult does
>>> not raise exception for unsuccessful states.
>>> https://github.com/apache/beam/issues/21694 BigQuery Storage API insert
>>> with writeResult retry and write to error table
>>> https://github.com/apache/beam/issues/21480 flake:
>>> FlinkRunnerTest.testEnsureStdoutStdErrIsRestored
>>> https://github.com/apache/beam/issues/21472 Dataflow streaming tests
>>> failing new AfterSynchronizedProcessingTime test
>>> https://github.com/apache/beam/issues/21471 Flakes: Failed to load
>>> cache entry
>>> https://github.com/apache/beam/issues/21470 Test flake:
>>> test_split_half_sdf
>>> https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink
>>> flaky: Connection refused
>>> https://github.com/apache/beam/issues/21468
>>> beam_PostCommit_Python_Examples_Dataflow failing
>>> https://github.com/apache/beam/issues/21467 GBK and CoGBK streaming
>>> Java load tests failing
>>> https://github.com/apache/beam/issues/21465 Kafka commit offset drop
>>> data on failure for runners that have non-checkpointing shuffle
>>> https://github.com/apache/beam/issues/21463 NPE in Flink Portable
>>> ValidatesRunner streaming suite
>>> https://github.com/apache/beam/issues/21462 Flake in
>>> org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in
>>> use
>>> https://github.com/apache/beam/issues/21271 pubsublite.ReadWriteIT
>>> flaky in beam_PostCommit_Java_DataflowV2
>>> https://github.com/apache/beam/issues/21270
>>> org.apache.beam.sdk.transforms.CombineTest$WindowingTests.testWindowedCombineGloballyAsSingletonView
>>> flaky on Dataflow Runner V2
>>> https://github.com/apache/beam/issues/21267 WriteToBigQuery submits a
>>> duplicate BQ load job if a 503 error code is returned from googleapi
>>> https://github.com/apache/beam/issues/21266
>>> org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
>>> is flaky in Java ValidatesRunner Flink suite.
>>> https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll
>>> do not follow spec
>>> https://github.com/apache/beam/issues/21261
>>> org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
>>> is flaky
>>> https://github.com/apache/beam/issues/21260 Python DirectRunner does
>>> not emit data at GC time
>>> https://github.com/apache/beam/issues/21257 Either Create or
>>> DirectRunner fails to produce all elements to the following transform
>>> https://github.com/apache/beam/issues/21123 Multiple jobs running on
>>> Flink session cluster reuse the persistent Python environment.
>>> https://github.com/apache/beam/issues/21121
>>> apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
>>> flakey
>>> https://github.com/apache/beam/issues/21118
>>> PortableRunnerTestWithExternalEnv.test_pardo_timers flaky
>>> https://github.com/apache/beam/issues/21114 Already Exists: Dataset
>>> apache-beam-testing:python_bq_file_loads_NNN
>>> https://github.com/apache/beam/issues/21113
>>> testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky
>>> https://github.com/apache/beam/issues/21111 Java creates an incorrect
>>> pipeline proto when core-construction-java jar is not in the CLASSPATH
>>> https://github.com/apache/beam/issues/21109 SDF BoundedSource seems to
>>> execute significantly slower than 'normal' BoundedSource
>>> https://github.com/apache/beam/issues/20981 Python precommit flaky:
>>> Failed to read inputs in the data plane
>>> https://github.com/apache/beam/issues/20977
>>> SamzaStoreStateInternalsTest is flaky
>>> https://github.com/apache/beam/issues/20976
>>> apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
>>> is flaky
>>> https://github.com/apache/beam/issues/20975
>>> org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming:
>>> false] is flaky
>>> https://github.com/apache/beam/issues/20974 Python GHA PreCommits flake
>>> with grpc.FutureTimeoutError on SDK harness startup
>>> https://github.com/apache/beam/issues/20817 Bigquery Read tests are
>>> flaky on Flink runner in Python PostCommit suites
>>> https://github.com/apache/beam/issues/20815
>>> testTeardownCalledAfterExceptionInProcessElement flakes on direct runner.
>>> https://github.com/apache/beam/issues/20692 Timer with dataflow runner
>>> can be set multiple times (dataflow runner)
>>> https://github.com/apache/beam/issues/20689 Kafka
>>> commitOffsetsInFinalize OOM on Flink
>>> https://github.com/apache/beam/issues/20528 python
>>> CombineGlobally().with_fanout() cause duplicate combine results for sliding
>>> windows
>>> https://github.com/apache/beam/issues/20332 FileIO writeDynamic with
>>> AvroIO.sink not writing all data
>>> https://github.com/apache/beam/issues/20331 org.apache.beam.sdk.io.mongodb.MongoDbIOTest.testReadWithAggregate
>>> is flaky
>>> https://github.com/apache/beam/issues/20109 SortValues should fail if
>>> SecondaryKey coder is not deterministic
>>> https://github.com/apache/beam/issues/20108 Python direct runner
>>> doesn't emit empty pane when it should
>>> https://github.com/apache/beam/issues/19816
>>> MetricsTest$AttemptedMetricTests.testAllAttemptedMetrics is flaky on
>>> DirectRunner
>>> https://github.com/apache/beam/issues/19814 Flakes in
>>> ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful
>>> for Direct, Spark, Flink
>>>
>>>
>>>

Re: What to do about issues that track flaky tests?

Posted by Austin Bennett <wh...@gmail.com>.
+1 to being realistic -- proper labels are worthwhile.  Though, some flaky
tests probably should be P1, and just because isn't addressed in a timely
manner doesn't mean it isn't a P1 - though, it does mean it wasn't
addressed.



On Wed, Sep 14, 2022 at 1:19 PM Kenneth Knowles <ke...@apache.org> wrote:

> I would like to make this alert email actionable.
>
> I went through most of these issues. About half are P1 "flake" issues. I
> don't think magically expecting them to be deflaked is helpful. So I have a
> couple ideas:
>
> 1. Exclude "flake" P1s from this email. This is what we used to do. But
> then... are they really P1s?
> 2. Make "flake" bugs P2 if they are not currently impacting our test
> signal. But then... we may have a gap in test coverage that could cause
> severe problems. But anyhow something that is P1 for a long time is not
> *really* P1, so it is just being realistic.
>
> What do you all think?
>
> Kenn
>
> On Wed, Sep 14, 2022 at 3:03 AM <be...@gmail.com> wrote:
>
>> This is your daily summary of Beam's current high priority issues that
>> may need attention.
>>
>>     See https://beam.apache.org/contribute/issue-priorities for the
>> meaning and expectations around issue priorities.
>>
>> Unassigned P1 Issues:
>>
>> https://github.com/apache/beam/issues/23227 [Bug]: Python SDK
>> installation cannot generate proto with protobuf 3.20.2
>> https://github.com/apache/beam/issues/23179 [Bug]: Parquet size exploded
>> for no apparent reason
>> https://github.com/apache/beam/issues/22913 [Bug]:
>> beam_PostCommit_Java_ValidatesRunner_Flink is flakey
>> https://github.com/apache/beam/issues/22303 [Task]: Add tests to Kafka
>> SDF and fix known and discovered issues
>> https://github.com/apache/beam/issues/22299 [Bug]: JDBCIO Write freeze
>> at getConnection() in WriteFn
>> https://github.com/apache/beam/issues/21794 Dataflow runner creates a
>> new timer whenever the output timestamp is change
>> https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't get
>> output to Failed Inserts PCollection
>> https://github.com/apache/beam/issues/21704
>> beam_PostCommit_Java_DataflowV2 failures parent bug
>> https://github.com/apache/beam/issues/21701
>> beam_PostCommit_Java_DataflowV1 failing with a variety of flakes and errors
>> https://github.com/apache/beam/issues/21700
>> --dataflowServiceOptions=use_runner_v2 is broken
>> https://github.com/apache/beam/issues/21696 Flink Tests failure :
>> java.lang.NoClassDefFoundError: Could not initialize class
>> org.apache.beam.runners.core.construction.SerializablePipelineOptions
>> https://github.com/apache/beam/issues/21695 DataflowPipelineResult does
>> not raise exception for unsuccessful states.
>> https://github.com/apache/beam/issues/21694 BigQuery Storage API insert
>> with writeResult retry and write to error table
>> https://github.com/apache/beam/issues/21480 flake:
>> FlinkRunnerTest.testEnsureStdoutStdErrIsRestored
>> https://github.com/apache/beam/issues/21472 Dataflow streaming tests
>> failing new AfterSynchronizedProcessingTime test
>> https://github.com/apache/beam/issues/21471 Flakes: Failed to load cache
>> entry
>> https://github.com/apache/beam/issues/21470 Test flake:
>> test_split_half_sdf
>> https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink
>> flaky: Connection refused
>> https://github.com/apache/beam/issues/21468
>> beam_PostCommit_Python_Examples_Dataflow failing
>> https://github.com/apache/beam/issues/21467 GBK and CoGBK streaming Java
>> load tests failing
>> https://github.com/apache/beam/issues/21465 Kafka commit offset drop
>> data on failure for runners that have non-checkpointing shuffle
>> https://github.com/apache/beam/issues/21463 NPE in Flink Portable
>> ValidatesRunner streaming suite
>> https://github.com/apache/beam/issues/21462 Flake in
>> org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in
>> use
>> https://github.com/apache/beam/issues/21271 pubsublite.ReadWriteIT flaky
>> in beam_PostCommit_Java_DataflowV2
>> https://github.com/apache/beam/issues/21270
>> org.apache.beam.sdk.transforms.CombineTest$WindowingTests.testWindowedCombineGloballyAsSingletonView
>> flaky on Dataflow Runner V2
>> https://github.com/apache/beam/issues/21267 WriteToBigQuery submits a
>> duplicate BQ load job if a 503 error code is returned from googleapi
>> https://github.com/apache/beam/issues/21266
>> org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
>> is flaky in Java ValidatesRunner Flink suite.
>> https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do
>> not follow spec
>> https://github.com/apache/beam/issues/21261
>> org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
>> is flaky
>> https://github.com/apache/beam/issues/21260 Python DirectRunner does not
>> emit data at GC time
>> https://github.com/apache/beam/issues/21257 Either Create or
>> DirectRunner fails to produce all elements to the following transform
>> https://github.com/apache/beam/issues/21123 Multiple jobs running on
>> Flink session cluster reuse the persistent Python environment.
>> https://github.com/apache/beam/issues/21121
>> apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
>> flakey
>> https://github.com/apache/beam/issues/21118
>> PortableRunnerTestWithExternalEnv.test_pardo_timers flaky
>> https://github.com/apache/beam/issues/21114 Already Exists: Dataset
>> apache-beam-testing:python_bq_file_loads_NNN
>> https://github.com/apache/beam/issues/21113
>> testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky
>> https://github.com/apache/beam/issues/21111 Java creates an incorrect
>> pipeline proto when core-construction-java jar is not in the CLASSPATH
>> https://github.com/apache/beam/issues/21109 SDF BoundedSource seems to
>> execute significantly slower than 'normal' BoundedSource
>> https://github.com/apache/beam/issues/20981 Python precommit flaky:
>> Failed to read inputs in the data plane
>> https://github.com/apache/beam/issues/20977 SamzaStoreStateInternalsTest
>> is flaky
>> https://github.com/apache/beam/issues/20976
>> apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
>> is flaky
>> https://github.com/apache/beam/issues/20975
>> org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming:
>> false] is flaky
>> https://github.com/apache/beam/issues/20974 Python GHA PreCommits flake
>> with grpc.FutureTimeoutError on SDK harness startup
>> https://github.com/apache/beam/issues/20817 Bigquery Read tests are
>> flaky on Flink runner in Python PostCommit suites
>> https://github.com/apache/beam/issues/20815
>> testTeardownCalledAfterExceptionInProcessElement flakes on direct runner.
>> https://github.com/apache/beam/issues/20692 Timer with dataflow runner
>> can be set multiple times (dataflow runner)
>> https://github.com/apache/beam/issues/20689 Kafka
>> commitOffsetsInFinalize OOM on Flink
>> https://github.com/apache/beam/issues/20528 python
>> CombineGlobally().with_fanout() cause duplicate combine results for sliding
>> windows
>> https://github.com/apache/beam/issues/20332 FileIO writeDynamic with
>> AvroIO.sink not writing all data
>> https://github.com/apache/beam/issues/20331 org.apache.beam.sdk.io.mongodb.MongoDbIOTest.testReadWithAggregate
>> is flaky
>> https://github.com/apache/beam/issues/20109 SortValues should fail if
>> SecondaryKey coder is not deterministic
>> https://github.com/apache/beam/issues/20108 Python direct runner doesn't
>> emit empty pane when it should
>> https://github.com/apache/beam/issues/19816
>> MetricsTest$AttemptedMetricTests.testAllAttemptedMetrics is flaky on
>> DirectRunner
>> https://github.com/apache/beam/issues/19814 Flakes in
>> ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful
>> for Direct, Spark, Flink
>>
>>
>>

What to do about issues that track flaky tests?

Posted by Kenneth Knowles <ke...@apache.org>.
I would like to make this alert email actionable.

I went through most of these issues. About half are P1 "flake" issues. I
don't think magically expecting them to be deflaked is helpful. So I have a
couple ideas:

1. Exclude "flake" P1s from this email. This is what we used to do. But
then... are they really P1s?
2. Make "flake" bugs P2 if they are not currently impacting our test
signal. But then... we may have a gap in test coverage that could cause
severe problems. But anyhow something that is P1 for a long time is not
*really* P1, so it is just being realistic.

What do you all think?

Kenn

On Wed, Sep 14, 2022 at 3:03 AM <be...@gmail.com> wrote:

> This is your daily summary of Beam's current high priority issues that may
> need attention.
>
>     See https://beam.apache.org/contribute/issue-priorities for the
> meaning and expectations around issue priorities.
>
> Unassigned P1 Issues:
>
> https://github.com/apache/beam/issues/23227 [Bug]: Python SDK
> installation cannot generate proto with protobuf 3.20.2
> https://github.com/apache/beam/issues/23179 [Bug]: Parquet size exploded
> for no apparent reason
> https://github.com/apache/beam/issues/22913 [Bug]:
> beam_PostCommit_Java_ValidatesRunner_Flink is flakey
> https://github.com/apache/beam/issues/22303 [Task]: Add tests to Kafka
> SDF and fix known and discovered issues
> https://github.com/apache/beam/issues/22299 [Bug]: JDBCIO Write freeze at
> getConnection() in WriteFn
> https://github.com/apache/beam/issues/21794 Dataflow runner creates a new
> timer whenever the output timestamp is change
> https://github.com/apache/beam/issues/21713 404s in BigQueryIO don't get
> output to Failed Inserts PCollection
> https://github.com/apache/beam/issues/21704
> beam_PostCommit_Java_DataflowV2 failures parent bug
> https://github.com/apache/beam/issues/21701
> beam_PostCommit_Java_DataflowV1 failing with a variety of flakes and errors
> https://github.com/apache/beam/issues/21700
> --dataflowServiceOptions=use_runner_v2 is broken
> https://github.com/apache/beam/issues/21696 Flink Tests failure :
> java.lang.NoClassDefFoundError: Could not initialize class
> org.apache.beam.runners.core.construction.SerializablePipelineOptions
> https://github.com/apache/beam/issues/21695 DataflowPipelineResult does
> not raise exception for unsuccessful states.
> https://github.com/apache/beam/issues/21694 BigQuery Storage API insert
> with writeResult retry and write to error table
> https://github.com/apache/beam/issues/21480 flake:
> FlinkRunnerTest.testEnsureStdoutStdErrIsRestored
> https://github.com/apache/beam/issues/21472 Dataflow streaming tests
> failing new AfterSynchronizedProcessingTime test
> https://github.com/apache/beam/issues/21471 Flakes: Failed to load cache
> entry
> https://github.com/apache/beam/issues/21470 Test flake:
> test_split_half_sdf
> https://github.com/apache/beam/issues/21469 beam_PostCommit_XVR_Flink
> flaky: Connection refused
> https://github.com/apache/beam/issues/21468
> beam_PostCommit_Python_Examples_Dataflow failing
> https://github.com/apache/beam/issues/21467 GBK and CoGBK streaming Java
> load tests failing
> https://github.com/apache/beam/issues/21465 Kafka commit offset drop data
> on failure for runners that have non-checkpointing shuffle
> https://github.com/apache/beam/issues/21463 NPE in Flink Portable
> ValidatesRunner streaming suite
> https://github.com/apache/beam/issues/21462 Flake in
> org.apache.beam.sdk.io.mqtt.MqttIOTest.testReadObject: Address already in
> use
> https://github.com/apache/beam/issues/21271 pubsublite.ReadWriteIT flaky
> in beam_PostCommit_Java_DataflowV2
> https://github.com/apache/beam/issues/21270
> org.apache.beam.sdk.transforms.CombineTest$WindowingTests.testWindowedCombineGloballyAsSingletonView
> flaky on Dataflow Runner V2
> https://github.com/apache/beam/issues/21267 WriteToBigQuery submits a
> duplicate BQ load job if a 503 error code is returned from googleapi
> https://github.com/apache/beam/issues/21266
> org.apache.beam.sdk.transforms.ParDoLifecycleTest.testTeardownCalledAfterExceptionInProcessElementStateful
> is flaky in Java ValidatesRunner Flink suite.
> https://github.com/apache/beam/issues/21262 Python AfterAny, AfterAll do
> not follow spec
> https://github.com/apache/beam/issues/21261
> org.apache.beam.runners.dataflow.worker.fn.logging.BeamFnLoggingServiceTest.testMultipleClientsFailingIsHandledGracefullyByServer
> is flaky
> https://github.com/apache/beam/issues/21260 Python DirectRunner does not
> emit data at GC time
> https://github.com/apache/beam/issues/21257 Either Create or DirectRunner
> fails to produce all elements to the following transform
> https://github.com/apache/beam/issues/21123 Multiple jobs running on
> Flink session cluster reuse the persistent Python environment.
> https://github.com/apache/beam/issues/21121
> apache_beam.examples.streaming_wordcount_it_test.StreamingWordCountIT.test_streaming_wordcount_it
> flakey
> https://github.com/apache/beam/issues/21118
> PortableRunnerTestWithExternalEnv.test_pardo_timers flaky
> https://github.com/apache/beam/issues/21114 Already Exists: Dataset
> apache-beam-testing:python_bq_file_loads_NNN
> https://github.com/apache/beam/issues/21113
> testTwoTimersSettingEachOtherWithCreateAsInputBounded flaky
> https://github.com/apache/beam/issues/21111 Java creates an incorrect
> pipeline proto when core-construction-java jar is not in the CLASSPATH
> https://github.com/apache/beam/issues/21109 SDF BoundedSource seems to
> execute significantly slower than 'normal' BoundedSource
> https://github.com/apache/beam/issues/20981 Python precommit flaky:
> Failed to read inputs in the data plane
> https://github.com/apache/beam/issues/20977 SamzaStoreStateInternalsTest
> is flaky
> https://github.com/apache/beam/issues/20976
> apache_beam.runners.portability.flink_runner_test.FlinkRunnerTestOptimized.test_flink_metrics
> is flaky
> https://github.com/apache/beam/issues/20975
> org.apache.beam.runners.flink.ReadSourcePortableTest.testExecution[streaming:
> false] is flaky
> https://github.com/apache/beam/issues/20974 Python GHA PreCommits flake
> with grpc.FutureTimeoutError on SDK harness startup
> https://github.com/apache/beam/issues/20817 Bigquery Read tests are flaky
> on Flink runner in Python PostCommit suites
> https://github.com/apache/beam/issues/20815
> testTeardownCalledAfterExceptionInProcessElement flakes on direct runner.
> https://github.com/apache/beam/issues/20692 Timer with dataflow runner
> can be set multiple times (dataflow runner)
> https://github.com/apache/beam/issues/20689 Kafka commitOffsetsInFinalize
> OOM on Flink
> https://github.com/apache/beam/issues/20528 python
> CombineGlobally().with_fanout() cause duplicate combine results for sliding
> windows
> https://github.com/apache/beam/issues/20332 FileIO writeDynamic with
> AvroIO.sink not writing all data
> https://github.com/apache/beam/issues/20331 org.apache.beam.sdk.io.mongodb.MongoDbIOTest.testReadWithAggregate
> is flaky
> https://github.com/apache/beam/issues/20109 SortValues should fail if
> SecondaryKey coder is not deterministic
> https://github.com/apache/beam/issues/20108 Python direct runner doesn't
> emit empty pane when it should
> https://github.com/apache/beam/issues/19816
> MetricsTest$AttemptedMetricTests.testAllAttemptedMetrics is flaky on
> DirectRunner
> https://github.com/apache/beam/issues/19814 Flakes in
> ParDoLifecycleTest.testTeardownCalledAfterExceptionInStartBundleStateful
> for Direct, Spark, Flink
>
>
>