You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Robert Metzger <rm...@apache.org> on 2020/10/12 19:02:51 UTC

Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Hi all!

According to the plan
<https://cwiki.apache.org/confluence/display/FLINK/1.12+Release> discussed
earlier in the release cycle, the feature freeze is expected to happen in
the week of October 26th. That's in 2.5 weeks from now.

I believe now is the time to discuss if we want to postpone the feature
freeze.
In my opinion, I would prefer to stick to the original schedule and rather
delay features to the 1.13 release if they are not ready yet.

From a stability perspective, we currently have the following situation:
- 6 blockers:
https://issues.apache.org/jira/browse/FLINK-19154?filter=12349334, most of
them are making progress, I notified people on those where the status is
unclear.
- 80 test instabilities:
https://issues.apache.org/jira/browse/FLINK-18117?filter=12348580&jql=project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20test-stability%20ORDER%20BY%20updated%20DESC%2C%20created%20DESC
- The CI system is a bit unstable these days: The e2e tests are often
timing out. I will look into options to mitigate this.



Drilling deeper into the test instabilities, these are some notable
clusters of test instabilities  (with recent failures, usually more than
once) [tests marked with >> have nobody assigned]

E2E tests, probably all test infrastructure
>> "Kerberized YARN per-job on Docker test" fails with "Could not start
hadoop cluster." https://issues.apache.org/jira/browse/FLINK-18117
>> SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1) failed
due to download error https://issues.apache.org/jira/browse/FLINK-17424
- "ES6 ElasticsearchSinkITCase unstable"
https://issues.apache.org/jira/browse/FLINK-17159
- "Avro Confluent Schema Registry nightly end-to-end test failed with
"Register operation timed out; error code: 50002""
https://issues.apache.org/jira/browse/FLINK-19422
- "SQLClientHBaseITCase.testHBase fails on azure"
https://issues.apache.org/jira/browse/FLINK-18570

New Source API
- "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable"
https://issues.apache.org/jira/browse/FLINK-19427
>> "CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs"
https://issues.apache.org/jira/browse/FLINK-19448
- "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent gets stuck"
https://issues.apache.org/jira/browse/FLINK-19489


Distributed Coordination
- "LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed with
"NoResourceAvailableException: Could not allocate the required slot within
slot request timeout" https://issues.apache.org/jira/browse/FLINK-19237
- "TaskExecutorSubmissionTest#testFailingScheduleOrUpdateConsumers"
https://issues.apache.org/jira/browse/FLINK-17458
- "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange
times out" https://issues.apache.org/jira/browse/FLINK-19514
- "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange:
ZooKeeper unexpectedly modified"
https://issues.apache.org/jira/browse/FLINK-19458

Kafka
>> "KafkaITCase failing with "Failed to send data to Kafka: This server
does not host this topic-partition""
https://issues.apache.org/jira/browse/FLINK-18444
>> "KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
expected:<310> but was:<0>"
https://issues.apache.org/jira/browse/FLINK-17949
- "KafkaITCase.testKeyValueSupport failure due to assertion error.""
https://issues.apache.org/jira/browse/FLINK-15745
- "KafkaITCase.testStartFromGroupOffsets times out on azure"
https://issues.apache.org/jira/browse/FLINK-18648
- "FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis"
https://issues.apache.org/jira/browse/FLINK-13733



On Tue, Sep 29, 2020 at 11:49 AM Dian Fu <di...@gmail.com> wrote:

> Hi all,
>
> I'd like to update the status about the blocker issues and build
> instabilities as there is only one month left and the number of blocker
> issues increases a lot compared to last week.
>
> == Blockers:
> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>
>
> Currently there are 10 blocker issues
> - 3 performance regression (
> https://issues.apache.org/jira/browse/FLINK-19439 <
> https://issues.apache.org/jira/browse/FLINK-19439>,
> https://issues.apache.org/jira/browse/FLINK-19440 <
> https://issues.apache.org/jira/browse/FLINK-19440>,
> https://issues.apache.org/jira/browse/FLINK-19441 <
> https://issues.apache.org/jira/browse/FLINK-19441>)
> - 3 Runtime (https://issues.apache.org/jira/browse/FLINK-19264 <
> https://issues.apache.org/jira/browse/FLINK-19264>,
> https://issues.apache.org/jira/browse/FLINK-19388 <
> https://issues.apache.org/jira/browse/FLINK-19388>,
> https://issues.apache.org/jira/browse/FLINK-19249 <
> https://issues.apache.org/jira/browse/FLINK-19249>)
> - 1 HBase connector (https://issues.apache.org/jira/browse/FLINK-19445 <
> https://issues.apache.org/jira/browse/FLINK-19445>)
> - 1 Application mode (https://issues.apache.org/jira/browse/FLINK-19154 <
> https://issues.apache.org/jira/browse/FLINK-19154>)
> - 1 New source API (https://issues.apache.org/jira/browse/FLINK-19384 <
> https://issues.apache.org/jira/browse/FLINK-19384>)
> - 1 Kinesis (https://issues.apache.org/jira/browse/FLINK-19332 <
> https://issues.apache.org/jira/browse/FLINK-19332>)
>
> == Recent notable build instabilities which still have no owners:
> - New source API
>    https://issues.apache.org/jira/browse/FLINK-19253 <
> https://issues.apache.org/jira/browse/FLINK-19253>
> SourceReaderTestBase.testAddSplitToExistingFetcher hangs
>    https://issues.apache.org/jira/browse/FLINK-19370 <
> https://issues.apache.org/jira/browse/FLINK-19370>
> FileSourceTextLinesITCase.testContinuousTextFileSource failed as results
> mismatch
>    https://issues.apache.org/jira/browse/FLINK-19427 <
> https://issues.apache.org/jira/browse/FLINK-19427>
> SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable,
>    https://issues.apache.org/jira/browse/FLINK-19437 <
> https://issues.apache.org/jira/browse/FLINK-19437>
> FileSourceTextLinesITCase.testContinuousTextFileSource failed with
> "SimpleStreamFormat is not splittable, but found split end (0) different
> from file length (198)"
>    https://issues.apache.org/jira/browse/FLINK-19448 <
> https://issues.apache.org/jira/browse/FLINK-19448>
> CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs
> - Runtime/Network
>    https://issues.apache.org/jira/browse/FLINK-19426 <
> https://issues.apache.org/jira/browse/FLINK-19426>  End-to-end test
> sometimes fails with PartitionConnectionException
> - Unaligned Checkpoint
>    https://issues.apache.org/jira/browse/FLINK-19027 <
> https://issues.apache.org/jira/browse/FLINK-19027>
> UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointOnParallelRemoteChannel
> failed because of test timeout
> - Table
>    https://issues.apache.org/jira/browse/FLINK-19340 <
> https://issues.apache.org/jira/browse/FLINK-19340>
> AggregateITCase.testListAggWithDistinct failed with "expected:<List(1,A,
> 2,B, 3,C#A, 4,EF)> but was:<List(1,A, 2,B, 3,C#A, 4,EF#EF)>"
> - HBase connector
>    https://issues.apache.org/jira/browse/FLINK-18570 <
> https://issues.apache.org/jira/browse/FLINK-18570>
> SQLClientHBaseITCase.testHBase fails on azure
>     https://issues.apache.org/jira/browse/FLINK-19447 <
> https://issues.apache.org/jira/browse/FLINK-19447>
> HBaseConnectorITCase.HBaseTestingClusterAutoStarter failed with "Master not
> initialized after 200000ms"
> - Avro
>    https://issues.apache.org/jira/browse/FLINK-19422 <
> https://issues.apache.org/jira/browse/FLINK-19422>  Avro Confluent Schema
> Registry nightly end-to-end test failed with "Register operation timed out;
> error code: 50002"
>
> Regards,
> Dian
>
> > 在 2020年9月21日,下午2:32,Robert Metzger <rm...@apache.org> 写道:
> >
> > Hi all,
> >
> > An update on the release status:
> > 1. We have 35 days = *5 weeks left until feature freeze*
> > 2. There are currently 2 blockers for Flink
> > <https://issues.apache.org/jira/browse/FLINK-19264?filter=12349334>, all
> > making progress
> > 3. We have 72 test instabilities
> > <https://issues.apache.org/jira/browse/FLINK-19237> (down 7 from 2 weeks
> > ago). I have pinged people to help addressing frequent or critical
> issues.
> >
> > Best,
> > Robert
> >
> >
> > On Mon, Sep 7, 2020 at 10:37 AM Robert Metzger <rm...@apache.org>
> wrote:
> >
> >> Hi all,
> >>
> >> another two weeks have passed. We now have 5 blockers
> >> <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> (Up
> >> 3 from 2 weeks ago), but they are all making progress.
> >>
> >> We currently have 79 test-instabilities
> >> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>,
> >> since the last report, a few have been resolved, and some others have
> been
> >> added.
> >> I have checked the tickets, closed some old ones and pinged people to
> help
> >> resolve new or frequent ones.
> >> Except for Kafka, there are no major clusters of test instabilities.
> Most
> >> failures are rarely failing tests across the entire system.
> >>
> >>
> >> On Tue, Aug 25, 2020 at 9:05 AM Rui Li <li...@gmail.com> wrote:
> >>
> >>> Thanks Dian for the pointer. I'll take a look.
> >>>
> >>> On Tue, Aug 25, 2020 at 3:02 PM Dian Fu <di...@gmail.com> wrote:
> >>>
> >>>> Thanks Rui for the info. This issue(hive related)
> >>>> https://issues.apache.org/jira/browse/FLINK-19025 <
> >>>> https://issues.apache.org/jira/browse/FLINK-19025> is marked as a
> >>> blocker.
> >>>>
> >>>> Regards,
> >>>> Dian
> >>>>
> >>>>> 在 2020年8月25日,下午2:58,Rui Li <li...@gmail.com> 写道:
> >>>>>
> >>>>> Hi Dian,
> >>>>>
> >>>>> FLINK-18682 has been fixed. Is there any other blocker in the hive
> >>>>> connector?
> >>>>>
> >>>>> On Tue, Aug 25, 2020 at 2:41 PM Dian Fu <dian0511.fu@gmail.com
> >>> <mailto:
> >>>> dian0511.fu@gmail.com>> wrote:
> >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> Two weeks have passed and it seems that none of the test stabilities
> >>>>>> issues have been addressed since then.
> >>>>>>
> >>>>>> Here is an updated status report of Blockers and Test instabilities:
> >>>>>>
> >>>>>> Blockers <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>>>:
> >>>>>> Currently 2 blockers (1x Hive, 1x CI Infra)
> >>>>>>
> >>>>>> Test-Instabilities <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580> <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>>>:
> >>>>>> (total 80)
> >>>>>>
> >>>>>> Besides the issues already posted in previous mail, here are the new
> >>>>>> instability issues which should be taken care of:
> >>>>>>
> >>>>>> - FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012 <
> >>>> https://issues.apache.org/jira/browse/FLINK-19012> <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-19012 <
> >>>> https://issues.apache.org/jira/browse/FLINK-19012>>)
> >>>>>> E2E test fails with "Cannot register Closeable, this
> >>>>>> subtaskCheckpointCoordinator is already closed. Closing argument."
> >>>>>>
> >>>>>> -> This is a new issue occurred recently. It has occurred several
> >>> times
> >>>>>> and may indicate a bug somewhere and should be taken care of.
> >>>>>>
> >>>>>> - FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 <
> >>>> https://issues.apache.org/jira/browse/FLINK-9992> <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-9992 <
> >>>> https://issues.apache.org/jira/browse/FLINK-9992>>)
> >>>>>> FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI
> >>>>>>
> >>>>>> -> There is already a PR for it and needs review.
> >>>>>>
> >>>>>> - FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18842> <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18842 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18842>>)
> >>>>>> e2e test failed to download "localhost:9999/flink.tgz" in "Wordcount
> >>> on
> >>>>>> Docker test"
> >>>>>>
> >>>>>>
> >>>>>>> 在 2020年8月11日,下午2:08,Robert Metzger <rm...@apache.org> 写道:
> >>>>>>>
> >>>>>>> Hi team,
> >>>>>>>
> >>>>>>> 2 weeks have passed since the last update. None of the test
> >>> stabilities
> >>>>>>> I've mentioned have been addressed since then.
> >>>>>>>
> >>>>>>> Here's an updated status report of Blockers and Test instabilities:
> >>>>>>>
> >>>>>>> Blockers <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>:
> >>>>>>> Currently 3 blockers (2x Hive, 1x CI Infra)
> >>>>>>>
> >>>>>>> Test-Instabilities
> >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
> >
> >>>>>> (total
> >>>>>>> 79) which failed recently or frequently:
> >>>>>>>
> >>>>>>>
> >>>>>>> - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807>
> >>>>>>> FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
> >>>>>>> failed with "Timeout expired after 60000milliseconds while awaiting
> >>>>>>> EndTxn(COMMIT)"
> >>>>>>>
> >>>>>>> - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634>
> >>>>>>> FlinkKafkaProducerITCase.testRecoverCommittedTransaction
> >>>>>>> failed with "Timeout expired after 60000milliseconds while awaiting
> >>>>>>> InitProducerId"
> >>>>>>>
> >>>>>>> - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908>
> >>>>>>> FlinkKafkaProducerITCase
> >>>>>>> testScaleUpAfterScalingDown Timeout expired while initializing
> >>>>>>> transactional state in 60000ms.
> >>>>>>>
> >>>>>>> - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733>
> >>>>>>> FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis
> >>>>>>>
> >>>>>>> --> The first three tickets seem related.
> >>>>>>>
> >>>>>>>
> >>>>>>> - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260>
> >>>>>>> StreamingKafkaITCase failure on Azure
> >>>>>>>
> >>>>>>> --> This one seems really hard to reproduce
> >>>>>>>
> >>>>>>>
> >>>>>>> - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768>
> >>>>>>> HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
> >>>>>>> hangs
> >>>>>>>
> >>>>>>> - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
> >>>>>>> produced no output for 900 seconds
> >>>>>>>
> >>>>>>> --> nobody seems to feel responsible for these tickets. My guess is
> >>>> that
> >>>>>>> the S3 connector should have shorter timeouts / faster retries to
> >>>> finish
> >>>>>>> within the 15 minutes test timeout. OR there is really something
> >>> wrong
> >>>>>> with
> >>>>>>> the code.
> >>>>>>>
> >>>>>>>
> >>>>>>> - FLINK-18333 UnsignedTypeConversionITCase failed caused by
> >>> MariaDB4j
> >>>>>>> "Asked to waitFor Program"
> >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>
> >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
> >>>>>>> <https://issues.apache.org/jira/browse/FLINK-17159> ES6
> >>>>>>> ElasticsearchSinkITCase unstable
> >>>>>>>
> >>>>>>> - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949>
> >>>>>>> KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
> >>>>>>> expected:<310> but was:<0>
> >>>>>>>
> >>>>>>> - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222>
> >>>> "Avro
> >>>>>>> Confluent Schema Registry nightly end-to-end test" unstable with
> >>> "Kafka
> >>>>>>> cluster did not start after 120 seconds"
> >>>>>>>
> >>>>>>> - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511>
> >>>>>> "RocksDB
> >>>>>>> Memory Management end-to-end test" fails with "Current block cache
> >>>> usage
> >>>>>>> 202123272 larger than expected memory limit 200000000"
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <
> rmetzger@apache.org
> >>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi team,
> >>>>>>>>
> >>>>>>>> We would like to use this thread as a permanent thread for
> >>>>>>>> regularly syncing on stale blockers (need to have somebody
> assigned
> >>>>>> within
> >>>>>>>> a week and progress, or a good plan) and build instabilities (need
> >>> to
> >>>>>> check
> >>>>>>>> if its a blocker).
> >>>>>>>>
> >>>>>>>> Recent test-instabilities:
> >>>>>>>>
> >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
> >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test
> >>>>>> unstable)
> >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test
> >>>>>> unstable)
> >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17949
> >>>>>>>> (KafkaShuffleITCase)
> >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
> >>>>>>>> transactions)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> It would be nice if the committers taking care of these components
> >>>> could
> >>>>>>>> look into the test failures.
> >>>>>>>> If nothing happens, we'll personally reach out to people I believe
> >>>> they
> >>>>>>>> could look into the ticket.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Dian & Robert
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Best regards!
> >>>>> Rui Li
> >>>>
> >>>>
> >>>
> >>> --
> >>> Best regards!
> >>> Rui Li
> >>>
> >>
>
>

Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Posted by Dian Fu <di...@gmail.com>.
Hi all,

Here is an update of the status about the blocker issues and build instabilities these days.

Currently there are 10 blocker issues and 78 test instabilities:

== Blockers ( https://issues.apache.org/jira/browse/FLINK-19805?filter=12349334 <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>)
- 1 performance regression (https://issues.apache.org/jira/browse/FLINK-19441 <https://issues.apache.org/jira/browse/FLINK-19441>)
- 2 New source API (https://issues.apache.org/jira/browse/FLINK-19384 <https://issues.apache.org/jira/browse/FLINK-19384>, https://issues.apache.org/jira/browse/FLINK-19717 <https://issues.apache.org/jira/browse/FLINK-19717>(PR available, needs review))
- 2 Runtime(https://issues.apache.org/jira/browse/FLINK-19645 <https://issues.apache.org/jira/browse/FLINK-19645>, https://issues.apache.org/jira/browse/FLINK-19805 <https://issues.apache.org/jira/browse/FLINK-19805>)
- 1 State Backend (https://issues.apache.org/jira/browse/FLINK-19741 <https://issues.apache.org/jira/browse/FLINK-19741> (under review))
- 1 Job submission (https://issues.apache.org/jira/browse/FLINK-19909 <https://issues.apache.org/jira/browse/FLINK-19909>)
- 1 Format/Parquet (https://issues.apache.org/jira/browse/FLINK-19843 <https://issues.apache.org/jira/browse/FLINK-19843>)
- 1 YARN (https://issues.apache.org/jira/browse/FLINK-19865 <https://issues.apache.org/jira/browse/FLINK-19865>)
- 1 license (https://issues.apache.org/jira/browse/FLINK-19849 <https://issues.apache.org/jira/browse/FLINK-19849> (under review))

== Test instabilities (https://issues.apache.org/jira/browse/FLINK-18117?filter=12348580&jql=project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20test-stability%20ORDER%20BY%20updated%20DESC%2C%20created%20DESC <https://issues.apache.org/jira/browse/FLINK-18117?filter=12348580&jql=project%20=%20FLINK%20AND%20resolution%20=%20Unresolved%20AND%20labels%20=%20test-stability%20ORDER%20BY%20updated%20DESC,%20created%20DESC>)

Recent notable build instabilities which still have no owners:
- https://issues.apache.org/jira/browse/FLINK-19838 <https://issues.apache.org/jira/browse/FLINK-19340> SQLClientKafkaITCase hangs
- https://issues.apache.org/jira/browse/FLINK-19863 SQLClientHBaseITCase.testHBase failed with "java.io.IOException: Process failed due to timeout"

Regards,
Dian


> 在 2020年10月14日,下午12:34,Yu Li <ca...@gmail.com> 写道:
> 
> Thanks for monitoring the release progress and kindly reminding us Robert!
> 
> Minor: the below link shows the complete list of existing blockers:
> https://issues.apache.org/jira/issues/?filter=12349334
> 
> Best Regards,
> Yu
> 
> 
> On Tue, 13 Oct 2020 at 03:03, Robert Metzger <rm...@apache.org> wrote:
> 
>> Hi all!
>> 
>> According to the plan
>> <https://cwiki.apache.org/confluence/display/FLINK/1.12+Release> discussed
>> earlier in the release cycle, the feature freeze is expected to happen in
>> the week of October 26th. That's in 2.5 weeks from now.
>> 
>> I believe now is the time to discuss if we want to postpone the feature
>> freeze.
>> In my opinion, I would prefer to stick to the original schedule and rather
>> delay features to the 1.13 release if they are not ready yet.
>> 
>> From a stability perspective, we currently have the following situation:
>> - 6 blockers:
>> https://issues.apache.org/jira/browse/FLINK-19154?filter=12349334, most of
>> them are making progress, I notified people on those where the status is
>> unclear.
>> - 80 test instabilities:
>> 
>> https://issues.apache.org/jira/browse/FLINK-18117?filter=12348580&jql=project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20test-stability%20ORDER%20BY%20updated%20DESC%2C%20created%20DESC
>> - The CI system is a bit unstable these days: The e2e tests are often
>> timing out. I will look into options to mitigate this.
>> 
>> 
>> 
>> Drilling deeper into the test instabilities, these are some notable
>> clusters of test instabilities  (with recent failures, usually more than
>> once) [tests marked with >> have nobody assigned]
>> 
>> E2E tests, probably all test infrastructure
>>>> "Kerberized YARN per-job on Docker test" fails with "Could not start
>> hadoop cluster." https://issues.apache.org/jira/browse/FLINK-18117
>>>> SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1) failed
>> due to download error https://issues.apache.org/jira/browse/FLINK-17424
>> - "ES6 ElasticsearchSinkITCase unstable"
>> https://issues.apache.org/jira/browse/FLINK-17159
>> - "Avro Confluent Schema Registry nightly end-to-end test failed with
>> "Register operation timed out; error code: 50002""
>> https://issues.apache.org/jira/browse/FLINK-19422
>> - "SQLClientHBaseITCase.testHBase fails on azure"
>> https://issues.apache.org/jira/browse/FLINK-18570
>> 
>> New Source API
>> - "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable"
>> https://issues.apache.org/jira/browse/FLINK-19427
>>>> "CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs"
>> https://issues.apache.org/jira/browse/FLINK-19448
>> - "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent gets stuck"
>> https://issues.apache.org/jira/browse/FLINK-19489
>> 
>> 
>> Distributed Coordination
>> - "LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed with
>> "NoResourceAvailableException: Could not allocate the required slot within
>> slot request timeout" https://issues.apache.org/jira/browse/FLINK-19237
>> - "TaskExecutorSubmissionTest#testFailingScheduleOrUpdateConsumers"
>> https://issues.apache.org/jira/browse/FLINK-17458
>> - "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange
>> times out" https://issues.apache.org/jira/browse/FLINK-19514
>> - "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange:
>> ZooKeeper unexpectedly modified"
>> https://issues.apache.org/jira/browse/FLINK-19458
>> 
>> Kafka
>>>> "KafkaITCase failing with "Failed to send data to Kafka: This server
>> does not host this topic-partition""
>> https://issues.apache.org/jira/browse/FLINK-18444
>>>> "KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
>> expected:<310> but was:<0>"
>> https://issues.apache.org/jira/browse/FLINK-17949
>> - "KafkaITCase.testKeyValueSupport failure due to assertion error.""
>> https://issues.apache.org/jira/browse/FLINK-15745
>> - "KafkaITCase.testStartFromGroupOffsets times out on azure"
>> https://issues.apache.org/jira/browse/FLINK-18648
>> - "FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis"
>> https://issues.apache.org/jira/browse/FLINK-13733
>> 
>> 
>> 
>> On Tue, Sep 29, 2020 at 11:49 AM Dian Fu <di...@gmail.com> wrote:
>> 
>>> Hi all,
>>> 
>>> I'd like to update the status about the blocker issues and build
>>> instabilities as there is only one month left and the number of blocker
>>> issues increases a lot compared to last week.
>>> 
>>> == Blockers:
>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>
>>> 
>>> Currently there are 10 blocker issues
>>> - 3 performance regression (
>>> https://issues.apache.org/jira/browse/FLINK-19439 <
>>> https://issues.apache.org/jira/browse/FLINK-19439>,
>>> https://issues.apache.org/jira/browse/FLINK-19440 <
>>> https://issues.apache.org/jira/browse/FLINK-19440>,
>>> https://issues.apache.org/jira/browse/FLINK-19441 <
>>> https://issues.apache.org/jira/browse/FLINK-19441>)
>>> - 3 Runtime (https://issues.apache.org/jira/browse/FLINK-19264 <
>>> https://issues.apache.org/jira/browse/FLINK-19264>,
>>> https://issues.apache.org/jira/browse/FLINK-19388 <
>>> https://issues.apache.org/jira/browse/FLINK-19388>,
>>> https://issues.apache.org/jira/browse/FLINK-19249 <
>>> https://issues.apache.org/jira/browse/FLINK-19249>)
>>> - 1 HBase connector (https://issues.apache.org/jira/browse/FLINK-19445 <
>>> https://issues.apache.org/jira/browse/FLINK-19445>)
>>> - 1 Application mode (https://issues.apache.org/jira/browse/FLINK-19154
>> <
>>> https://issues.apache.org/jira/browse/FLINK-19154>)
>>> - 1 New source API (https://issues.apache.org/jira/browse/FLINK-19384 <
>>> https://issues.apache.org/jira/browse/FLINK-19384>)
>>> - 1 Kinesis (https://issues.apache.org/jira/browse/FLINK-19332 <
>>> https://issues.apache.org/jira/browse/FLINK-19332>)
>>> 
>>> == Recent notable build instabilities which still have no owners:
>>> - New source API
>>>   https://issues.apache.org/jira/browse/FLINK-19253 <
>>> https://issues.apache.org/jira/browse/FLINK-19253>
>>> SourceReaderTestBase.testAddSplitToExistingFetcher hangs
>>>   https://issues.apache.org/jira/browse/FLINK-19370 <
>>> https://issues.apache.org/jira/browse/FLINK-19370>
>>> FileSourceTextLinesITCase.testContinuousTextFileSource failed as results
>>> mismatch
>>>   https://issues.apache.org/jira/browse/FLINK-19427 <
>>> https://issues.apache.org/jira/browse/FLINK-19427>
>>> SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable,
>>>   https://issues.apache.org/jira/browse/FLINK-19437 <
>>> https://issues.apache.org/jira/browse/FLINK-19437>
>>> FileSourceTextLinesITCase.testContinuousTextFileSource failed with
>>> "SimpleStreamFormat is not splittable, but found split end (0) different
>>> from file length (198)"
>>>   https://issues.apache.org/jira/browse/FLINK-19448 <
>>> https://issues.apache.org/jira/browse/FLINK-19448>
>>> CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs
>>> - Runtime/Network
>>>   https://issues.apache.org/jira/browse/FLINK-19426 <
>>> https://issues.apache.org/jira/browse/FLINK-19426>  End-to-end test
>>> sometimes fails with PartitionConnectionException
>>> - Unaligned Checkpoint
>>>   https://issues.apache.org/jira/browse/FLINK-19027 <
>>> https://issues.apache.org/jira/browse/FLINK-19027>
>>> 
>> UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointOnParallelRemoteChannel
>>> failed because of test timeout
>>> - Table
>>>   https://issues.apache.org/jira/browse/FLINK-19340 <
>>> https://issues.apache.org/jira/browse/FLINK-19340>
>>> AggregateITCase.testListAggWithDistinct failed with "expected:<List(1,A,
>>> 2,B, 3,C#A, 4,EF)> but was:<List(1,A, 2,B, 3,C#A, 4,EF#EF)>"
>>> - HBase connector
>>>   https://issues.apache.org/jira/browse/FLINK-18570 <
>>> https://issues.apache.org/jira/browse/FLINK-18570>
>>> SQLClientHBaseITCase.testHBase fails on azure
>>>    https://issues.apache.org/jira/browse/FLINK-19447 <
>>> https://issues.apache.org/jira/browse/FLINK-19447>
>>> HBaseConnectorITCase.HBaseTestingClusterAutoStarter failed with "Master
>> not
>>> initialized after 200000ms"
>>> - Avro
>>>   https://issues.apache.org/jira/browse/FLINK-19422 <
>>> https://issues.apache.org/jira/browse/FLINK-19422>  Avro Confluent
>> Schema
>>> Registry nightly end-to-end test failed with "Register operation timed
>> out;
>>> error code: 50002"
>>> 
>>> Regards,
>>> Dian
>>> 
>>>> 在 2020年9月21日,下午2:32,Robert Metzger <rm...@apache.org> 写道:
>>>> 
>>>> Hi all,
>>>> 
>>>> An update on the release status:
>>>> 1. We have 35 days = *5 weeks left until feature freeze*
>>>> 2. There are currently 2 blockers for Flink
>>>> <https://issues.apache.org/jira/browse/FLINK-19264?filter=12349334>,
>> all
>>>> making progress
>>>> 3. We have 72 test instabilities
>>>> <https://issues.apache.org/jira/browse/FLINK-19237> (down 7 from 2
>> weeks
>>>> ago). I have pinged people to help addressing frequent or critical
>>> issues.
>>>> 
>>>> Best,
>>>> Robert
>>>> 
>>>> 
>>>> On Mon, Sep 7, 2020 at 10:37 AM Robert Metzger <rm...@apache.org>
>>> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> another two weeks have passed. We now have 5 blockers
>>>>> <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>
>> (Up
>>>>> 3 from 2 weeks ago), but they are all making progress.
>>>>> 
>>>>> We currently have 79 test-instabilities
>>>>> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>,
>>>>> since the last report, a few have been resolved, and some others have
>>> been
>>>>> added.
>>>>> I have checked the tickets, closed some old ones and pinged people to
>>> help
>>>>> resolve new or frequent ones.
>>>>> Except for Kafka, there are no major clusters of test instabilities.
>>> Most
>>>>> failures are rarely failing tests across the entire system.
>>>>> 
>>>>> 
>>>>> On Tue, Aug 25, 2020 at 9:05 AM Rui Li <li...@gmail.com> wrote:
>>>>> 
>>>>>> Thanks Dian for the pointer. I'll take a look.
>>>>>> 
>>>>>> On Tue, Aug 25, 2020 at 3:02 PM Dian Fu <di...@gmail.com>
>> wrote:
>>>>>> 
>>>>>>> Thanks Rui for the info. This issue(hive related)
>>>>>>> https://issues.apache.org/jira/browse/FLINK-19025 <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-19025> is marked as a
>>>>>> blocker.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Dian
>>>>>>> 
>>>>>>>> 在 2020年8月25日,下午2:58,Rui Li <li...@gmail.com> 写道:
>>>>>>>> 
>>>>>>>> Hi Dian,
>>>>>>>> 
>>>>>>>> FLINK-18682 has been fixed. Is there any other blocker in the hive
>>>>>>>> connector?
>>>>>>>> 
>>>>>>>> On Tue, Aug 25, 2020 at 2:41 PM Dian Fu <dian0511.fu@gmail.com
>>>>>> <mailto:
>>>>>>> dian0511.fu@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> Two weeks have passed and it seems that none of the test
>> stabilities
>>>>>>>>> issues have been addressed since then.
>>>>>>>>> 
>>>>>>>>> Here is an updated status report of Blockers and Test
>> instabilities:
>>>>>>>>> 
>>>>>>>>> Blockers <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
>> <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>
>> <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
>> <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
>>>>> :
>>>>>>>>> Currently 2 blockers (1x Hive, 1x CI Infra)
>>>>>>>>> 
>>>>>>>>> Test-Instabilities <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
>> <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>
>> <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
>> <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
>>>>> :
>>>>>>>>> (total 80)
>>>>>>>>> 
>>>>>>>>> Besides the issues already posted in previous mail, here are the
>> new
>>>>>>>>> instability issues which should be taken care of:
>>>>>>>>> 
>>>>>>>>> - FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012
>> <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-19012> <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-19012 <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-19012>>)
>>>>>>>>> E2E test fails with "Cannot register Closeable, this
>>>>>>>>> subtaskCheckpointCoordinator is already closed. Closing argument."
>>>>>>>>> 
>>>>>>>>> -> This is a new issue occurred recently. It has occurred several
>>>>>> times
>>>>>>>>> and may indicate a bug somewhere and should be taken care of.
>>>>>>>>> 
>>>>>>>>> - FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-9992> <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-9992 <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-9992>>)
>>>>>>>>> FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI
>>>>>>>>> 
>>>>>>>>> -> There is already a PR for it and needs review.
>>>>>>>>> 
>>>>>>>>> - FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842
>> <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-18842> <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18842 <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-18842>>)
>>>>>>>>> e2e test failed to download "localhost:9999/flink.tgz" in
>> "Wordcount
>>>>>> on
>>>>>>>>> Docker test"
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 在 2020年8月11日,下午2:08,Robert Metzger <rm...@apache.org> 写道:
>>>>>>>>>> 
>>>>>>>>>> Hi team,
>>>>>>>>>> 
>>>>>>>>>> 2 weeks have passed since the last update. None of the test
>>>>>> stabilities
>>>>>>>>>> I've mentioned have been addressed since then.
>>>>>>>>>> 
>>>>>>>>>> Here's an updated status report of Blockers and Test
>> instabilities:
>>>>>>>>>> 
>>>>>>>>>> Blockers <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
>>> :
>>>>>>>>>> Currently 3 blockers (2x Hive, 1x CI Infra)
>>>>>>>>>> 
>>>>>>>>>> Test-Instabilities
>>>>>>>>>> <
>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
>>>> 
>>>>>>>>> (total
>>>>>>>>>> 79) which failed recently or frequently:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807
>>> 
>>>>>>>>>> FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
>>>>>>>>>> failed with "Timeout expired after 60000milliseconds while
>> awaiting
>>>>>>>>>> EndTxn(COMMIT)"
>>>>>>>>>> 
>>>>>>>>>> - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634
>>> 
>>>>>>>>>> FlinkKafkaProducerITCase.testRecoverCommittedTransaction
>>>>>>>>>> failed with "Timeout expired after 60000milliseconds while
>> awaiting
>>>>>>>>>> InitProducerId"
>>>>>>>>>> 
>>>>>>>>>> - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908
>>> 
>>>>>>>>>> FlinkKafkaProducerITCase
>>>>>>>>>> testScaleUpAfterScalingDown Timeout expired while initializing
>>>>>>>>>> transactional state in 60000ms.
>>>>>>>>>> 
>>>>>>>>>> - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733
>>> 
>>>>>>>>>> FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis
>>>>>>>>>> 
>>>>>>>>>> --> The first three tickets seem related.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260
>>> 
>>>>>>>>>> StreamingKafkaITCase failure on Azure
>>>>>>>>>> 
>>>>>>>>>> --> This one seems really hard to reproduce
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768
>>> 
>>>>>>>>>> HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
>>>>>>>>>> hangs
>>>>>>>>>> 
>>>>>>>>>> - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374
>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>> HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
>>>>>>>>>> produced no output for 900 seconds
>>>>>>>>>> 
>>>>>>>>>> --> nobody seems to feel responsible for these tickets. My guess
>> is
>>>>>>> that
>>>>>>>>>> the S3 connector should have shorter timeouts / faster retries to
>>>>>>> finish
>>>>>>>>>> within the 15 minutes test timeout. OR there is really something
>>>>>> wrong
>>>>>>>>> with
>>>>>>>>>> the code.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> - FLINK-18333 UnsignedTypeConversionITCase failed caused by
>>>>>> MariaDB4j
>>>>>>>>>> "Asked to waitFor Program"
>>>>>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>
>>>>>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
>>>>>>>>>> <https://issues.apache.org/jira/browse/FLINK-17159> ES6
>>>>>>>>>> ElasticsearchSinkITCase unstable
>>>>>>>>>> 
>>>>>>>>>> - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949
>>> 
>>>>>>>>>> 
>> KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
>>>>>>>>>> expected:<310> but was:<0>
>>>>>>>>>> 
>>>>>>>>>> - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222
>>> 
>>>>>>> "Avro
>>>>>>>>>> Confluent Schema Registry nightly end-to-end test" unstable with
>>>>>> "Kafka
>>>>>>>>>> cluster did not start after 120 seconds"
>>>>>>>>>> 
>>>>>>>>>> - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511
>>> 
>>>>>>>>> "RocksDB
>>>>>>>>>> Memory Management end-to-end test" fails with "Current block
>> cache
>>>>>>> usage
>>>>>>>>>> 202123272 larger than expected memory limit 200000000"
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <
>>> rmetzger@apache.org
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi team,
>>>>>>>>>>> 
>>>>>>>>>>> We would like to use this thread as a permanent thread for
>>>>>>>>>>> regularly syncing on stale blockers (need to have somebody
>>> assigned
>>>>>>>>> within
>>>>>>>>>>> a week and progress, or a good plan) and build instabilities
>> (need
>>>>>> to
>>>>>>>>> check
>>>>>>>>>>> if its a blocker).
>>>>>>>>>>> 
>>>>>>>>>>> Recent test-instabilities:
>>>>>>>>>>> 
>>>>>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
>>>>>>>>>>> - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test
>>>>>>>>> unstable)
>>>>>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test
>>>>>>>>> unstable)
>>>>>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17949
>>>>>>>>>>> (KafkaShuffleITCase)
>>>>>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
>>>>>>>>>>> transactions)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> It would be nice if the committers taking care of these
>> components
>>>>>>> could
>>>>>>>>>>> look into the test failures.
>>>>>>>>>>> If nothing happens, we'll personally reach out to people I
>> believe
>>>>>>> they
>>>>>>>>>>> could look into the ticket.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Dian & Robert
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Best regards!
>>>>>>>> Rui Li
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Best regards!
>>>>>> Rui Li
>>>>>> 
>>>>> 
>>> 
>>> 
>> 


Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Posted by Yu Li <ca...@gmail.com>.
Thanks for monitoring the release progress and kindly reminding us Robert!

Minor: the below link shows the complete list of existing blockers:
https://issues.apache.org/jira/issues/?filter=12349334

Best Regards,
Yu


On Tue, 13 Oct 2020 at 03:03, Robert Metzger <rm...@apache.org> wrote:

> Hi all!
>
> According to the plan
> <https://cwiki.apache.org/confluence/display/FLINK/1.12+Release> discussed
> earlier in the release cycle, the feature freeze is expected to happen in
> the week of October 26th. That's in 2.5 weeks from now.
>
> I believe now is the time to discuss if we want to postpone the feature
> freeze.
> In my opinion, I would prefer to stick to the original schedule and rather
> delay features to the 1.13 release if they are not ready yet.
>
> From a stability perspective, we currently have the following situation:
> - 6 blockers:
> https://issues.apache.org/jira/browse/FLINK-19154?filter=12349334, most of
> them are making progress, I notified people on those where the status is
> unclear.
> - 80 test instabilities:
>
> https://issues.apache.org/jira/browse/FLINK-18117?filter=12348580&jql=project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20test-stability%20ORDER%20BY%20updated%20DESC%2C%20created%20DESC
> - The CI system is a bit unstable these days: The e2e tests are often
> timing out. I will look into options to mitigate this.
>
>
>
> Drilling deeper into the test instabilities, these are some notable
> clusters of test instabilities  (with recent failures, usually more than
> once) [tests marked with >> have nobody assigned]
>
> E2E tests, probably all test infrastructure
> >> "Kerberized YARN per-job on Docker test" fails with "Could not start
> hadoop cluster." https://issues.apache.org/jira/browse/FLINK-18117
> >> SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1) failed
> due to download error https://issues.apache.org/jira/browse/FLINK-17424
> - "ES6 ElasticsearchSinkITCase unstable"
> https://issues.apache.org/jira/browse/FLINK-17159
> - "Avro Confluent Schema Registry nightly end-to-end test failed with
> "Register operation timed out; error code: 50002""
> https://issues.apache.org/jira/browse/FLINK-19422
> - "SQLClientHBaseITCase.testHBase fails on azure"
> https://issues.apache.org/jira/browse/FLINK-18570
>
> New Source API
> - "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable"
> https://issues.apache.org/jira/browse/FLINK-19427
> >> "CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs"
> https://issues.apache.org/jira/browse/FLINK-19448
> - "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent gets stuck"
> https://issues.apache.org/jira/browse/FLINK-19489
>
>
> Distributed Coordination
> - "LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed with
> "NoResourceAvailableException: Could not allocate the required slot within
> slot request timeout" https://issues.apache.org/jira/browse/FLINK-19237
> - "TaskExecutorSubmissionTest#testFailingScheduleOrUpdateConsumers"
> https://issues.apache.org/jira/browse/FLINK-17458
> - "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange
> times out" https://issues.apache.org/jira/browse/FLINK-19514
> - "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange:
> ZooKeeper unexpectedly modified"
> https://issues.apache.org/jira/browse/FLINK-19458
>
> Kafka
> >> "KafkaITCase failing with "Failed to send data to Kafka: This server
> does not host this topic-partition""
> https://issues.apache.org/jira/browse/FLINK-18444
> >> "KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
> expected:<310> but was:<0>"
> https://issues.apache.org/jira/browse/FLINK-17949
> - "KafkaITCase.testKeyValueSupport failure due to assertion error.""
> https://issues.apache.org/jira/browse/FLINK-15745
> - "KafkaITCase.testStartFromGroupOffsets times out on azure"
> https://issues.apache.org/jira/browse/FLINK-18648
> - "FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis"
> https://issues.apache.org/jira/browse/FLINK-13733
>
>
>
> On Tue, Sep 29, 2020 at 11:49 AM Dian Fu <di...@gmail.com> wrote:
>
> > Hi all,
> >
> > I'd like to update the status about the blocker issues and build
> > instabilities as there is only one month left and the number of blocker
> > issues increases a lot compared to last week.
> >
> > == Blockers:
> > https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> > https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>
> >
> > Currently there are 10 blocker issues
> > - 3 performance regression (
> > https://issues.apache.org/jira/browse/FLINK-19439 <
> > https://issues.apache.org/jira/browse/FLINK-19439>,
> > https://issues.apache.org/jira/browse/FLINK-19440 <
> > https://issues.apache.org/jira/browse/FLINK-19440>,
> > https://issues.apache.org/jira/browse/FLINK-19441 <
> > https://issues.apache.org/jira/browse/FLINK-19441>)
> > - 3 Runtime (https://issues.apache.org/jira/browse/FLINK-19264 <
> > https://issues.apache.org/jira/browse/FLINK-19264>,
> > https://issues.apache.org/jira/browse/FLINK-19388 <
> > https://issues.apache.org/jira/browse/FLINK-19388>,
> > https://issues.apache.org/jira/browse/FLINK-19249 <
> > https://issues.apache.org/jira/browse/FLINK-19249>)
> > - 1 HBase connector (https://issues.apache.org/jira/browse/FLINK-19445 <
> > https://issues.apache.org/jira/browse/FLINK-19445>)
> > - 1 Application mode (https://issues.apache.org/jira/browse/FLINK-19154
> <
> > https://issues.apache.org/jira/browse/FLINK-19154>)
> > - 1 New source API (https://issues.apache.org/jira/browse/FLINK-19384 <
> > https://issues.apache.org/jira/browse/FLINK-19384>)
> > - 1 Kinesis (https://issues.apache.org/jira/browse/FLINK-19332 <
> > https://issues.apache.org/jira/browse/FLINK-19332>)
> >
> > == Recent notable build instabilities which still have no owners:
> > - New source API
> >    https://issues.apache.org/jira/browse/FLINK-19253 <
> > https://issues.apache.org/jira/browse/FLINK-19253>
> > SourceReaderTestBase.testAddSplitToExistingFetcher hangs
> >    https://issues.apache.org/jira/browse/FLINK-19370 <
> > https://issues.apache.org/jira/browse/FLINK-19370>
> > FileSourceTextLinesITCase.testContinuousTextFileSource failed as results
> > mismatch
> >    https://issues.apache.org/jira/browse/FLINK-19427 <
> > https://issues.apache.org/jira/browse/FLINK-19427>
> > SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable,
> >    https://issues.apache.org/jira/browse/FLINK-19437 <
> > https://issues.apache.org/jira/browse/FLINK-19437>
> > FileSourceTextLinesITCase.testContinuousTextFileSource failed with
> > "SimpleStreamFormat is not splittable, but found split end (0) different
> > from file length (198)"
> >    https://issues.apache.org/jira/browse/FLINK-19448 <
> > https://issues.apache.org/jira/browse/FLINK-19448>
> > CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs
> > - Runtime/Network
> >    https://issues.apache.org/jira/browse/FLINK-19426 <
> > https://issues.apache.org/jira/browse/FLINK-19426>  End-to-end test
> > sometimes fails with PartitionConnectionException
> > - Unaligned Checkpoint
> >    https://issues.apache.org/jira/browse/FLINK-19027 <
> > https://issues.apache.org/jira/browse/FLINK-19027>
> >
> UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointOnParallelRemoteChannel
> > failed because of test timeout
> > - Table
> >    https://issues.apache.org/jira/browse/FLINK-19340 <
> > https://issues.apache.org/jira/browse/FLINK-19340>
> > AggregateITCase.testListAggWithDistinct failed with "expected:<List(1,A,
> > 2,B, 3,C#A, 4,EF)> but was:<List(1,A, 2,B, 3,C#A, 4,EF#EF)>"
> > - HBase connector
> >    https://issues.apache.org/jira/browse/FLINK-18570 <
> > https://issues.apache.org/jira/browse/FLINK-18570>
> > SQLClientHBaseITCase.testHBase fails on azure
> >     https://issues.apache.org/jira/browse/FLINK-19447 <
> > https://issues.apache.org/jira/browse/FLINK-19447>
> > HBaseConnectorITCase.HBaseTestingClusterAutoStarter failed with "Master
> not
> > initialized after 200000ms"
> > - Avro
> >    https://issues.apache.org/jira/browse/FLINK-19422 <
> > https://issues.apache.org/jira/browse/FLINK-19422>  Avro Confluent
> Schema
> > Registry nightly end-to-end test failed with "Register operation timed
> out;
> > error code: 50002"
> >
> > Regards,
> > Dian
> >
> > > 在 2020年9月21日,下午2:32,Robert Metzger <rm...@apache.org> 写道:
> > >
> > > Hi all,
> > >
> > > An update on the release status:
> > > 1. We have 35 days = *5 weeks left until feature freeze*
> > > 2. There are currently 2 blockers for Flink
> > > <https://issues.apache.org/jira/browse/FLINK-19264?filter=12349334>,
> all
> > > making progress
> > > 3. We have 72 test instabilities
> > > <https://issues.apache.org/jira/browse/FLINK-19237> (down 7 from 2
> weeks
> > > ago). I have pinged people to help addressing frequent or critical
> > issues.
> > >
> > > Best,
> > > Robert
> > >
> > >
> > > On Mon, Sep 7, 2020 at 10:37 AM Robert Metzger <rm...@apache.org>
> > wrote:
> > >
> > >> Hi all,
> > >>
> > >> another two weeks have passed. We now have 5 blockers
> > >> <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>
> (Up
> > >> 3 from 2 weeks ago), but they are all making progress.
> > >>
> > >> We currently have 79 test-instabilities
> > >> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>,
> > >> since the last report, a few have been resolved, and some others have
> > been
> > >> added.
> > >> I have checked the tickets, closed some old ones and pinged people to
> > help
> > >> resolve new or frequent ones.
> > >> Except for Kafka, there are no major clusters of test instabilities.
> > Most
> > >> failures are rarely failing tests across the entire system.
> > >>
> > >>
> > >> On Tue, Aug 25, 2020 at 9:05 AM Rui Li <li...@gmail.com> wrote:
> > >>
> > >>> Thanks Dian for the pointer. I'll take a look.
> > >>>
> > >>> On Tue, Aug 25, 2020 at 3:02 PM Dian Fu <di...@gmail.com>
> wrote:
> > >>>
> > >>>> Thanks Rui for the info. This issue(hive related)
> > >>>> https://issues.apache.org/jira/browse/FLINK-19025 <
> > >>>> https://issues.apache.org/jira/browse/FLINK-19025> is marked as a
> > >>> blocker.
> > >>>>
> > >>>> Regards,
> > >>>> Dian
> > >>>>
> > >>>>> 在 2020年8月25日,下午2:58,Rui Li <li...@gmail.com> 写道:
> > >>>>>
> > >>>>> Hi Dian,
> > >>>>>
> > >>>>> FLINK-18682 has been fixed. Is there any other blocker in the hive
> > >>>>> connector?
> > >>>>>
> > >>>>> On Tue, Aug 25, 2020 at 2:41 PM Dian Fu <dian0511.fu@gmail.com
> > >>> <mailto:
> > >>>> dian0511.fu@gmail.com>> wrote:
> > >>>>>
> > >>>>>> Hi all,
> > >>>>>>
> > >>>>>> Two weeks have passed and it seems that none of the test
> stabilities
> > >>>>>> issues have been addressed since then.
> > >>>>>>
> > >>>>>> Here is an updated status report of Blockers and Test
> instabilities:
> > >>>>>>
> > >>>>>> Blockers <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
> <
> > >>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>
> <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
> <
> > >>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
> >>>:
> > >>>>>> Currently 2 blockers (1x Hive, 1x CI Infra)
> > >>>>>>
> > >>>>>> Test-Instabilities <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
> <
> > >>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>
> <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
> <
> > >>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
> >>>:
> > >>>>>> (total 80)
> > >>>>>>
> > >>>>>> Besides the issues already posted in previous mail, here are the
> new
> > >>>>>> instability issues which should be taken care of:
> > >>>>>>
> > >>>>>> - FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012
> <
> > >>>> https://issues.apache.org/jira/browse/FLINK-19012> <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-19012 <
> > >>>> https://issues.apache.org/jira/browse/FLINK-19012>>)
> > >>>>>> E2E test fails with "Cannot register Closeable, this
> > >>>>>> subtaskCheckpointCoordinator is already closed. Closing argument."
> > >>>>>>
> > >>>>>> -> This is a new issue occurred recently. It has occurred several
> > >>> times
> > >>>>>> and may indicate a bug somewhere and should be taken care of.
> > >>>>>>
> > >>>>>> - FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 <
> > >>>> https://issues.apache.org/jira/browse/FLINK-9992> <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-9992 <
> > >>>> https://issues.apache.org/jira/browse/FLINK-9992>>)
> > >>>>>> FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI
> > >>>>>>
> > >>>>>> -> There is already a PR for it and needs review.
> > >>>>>>
> > >>>>>> - FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842
> <
> > >>>> https://issues.apache.org/jira/browse/FLINK-18842> <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-18842 <
> > >>>> https://issues.apache.org/jira/browse/FLINK-18842>>)
> > >>>>>> e2e test failed to download "localhost:9999/flink.tgz" in
> "Wordcount
> > >>> on
> > >>>>>> Docker test"
> > >>>>>>
> > >>>>>>
> > >>>>>>> 在 2020年8月11日,下午2:08,Robert Metzger <rm...@apache.org> 写道:
> > >>>>>>>
> > >>>>>>> Hi team,
> > >>>>>>>
> > >>>>>>> 2 weeks have passed since the last update. None of the test
> > >>> stabilities
> > >>>>>>> I've mentioned have been addressed since then.
> > >>>>>>>
> > >>>>>>> Here's an updated status report of Blockers and Test
> instabilities:
> > >>>>>>>
> > >>>>>>> Blockers <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
> >:
> > >>>>>>> Currently 3 blockers (2x Hive, 1x CI Infra)
> > >>>>>>>
> > >>>>>>> Test-Instabilities
> > >>>>>>> <
> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
> > >
> > >>>>>> (total
> > >>>>>>> 79) which failed recently or frequently:
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807
> >
> > >>>>>>> FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
> > >>>>>>> failed with "Timeout expired after 60000milliseconds while
> awaiting
> > >>>>>>> EndTxn(COMMIT)"
> > >>>>>>>
> > >>>>>>> - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634
> >
> > >>>>>>> FlinkKafkaProducerITCase.testRecoverCommittedTransaction
> > >>>>>>> failed with "Timeout expired after 60000milliseconds while
> awaiting
> > >>>>>>> InitProducerId"
> > >>>>>>>
> > >>>>>>> - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908
> >
> > >>>>>>> FlinkKafkaProducerITCase
> > >>>>>>> testScaleUpAfterScalingDown Timeout expired while initializing
> > >>>>>>> transactional state in 60000ms.
> > >>>>>>>
> > >>>>>>> - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733
> >
> > >>>>>>> FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis
> > >>>>>>>
> > >>>>>>> --> The first three tickets seem related.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260
> >
> > >>>>>>> StreamingKafkaITCase failure on Azure
> > >>>>>>>
> > >>>>>>> --> This one seems really hard to reproduce
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768
> >
> > >>>>>>> HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
> > >>>>>>> hangs
> > >>>>>>>
> > >>>>>>> - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374
> >
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>>
> >
> HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
> > >>>>>>> produced no output for 900 seconds
> > >>>>>>>
> > >>>>>>> --> nobody seems to feel responsible for these tickets. My guess
> is
> > >>>> that
> > >>>>>>> the S3 connector should have shorter timeouts / faster retries to
> > >>>> finish
> > >>>>>>> within the 15 minutes test timeout. OR there is really something
> > >>> wrong
> > >>>>>> with
> > >>>>>>> the code.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> - FLINK-18333 UnsignedTypeConversionITCase failed caused by
> > >>> MariaDB4j
> > >>>>>>> "Asked to waitFor Program"
> > >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>
> > >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
> > >>>>>>> <https://issues.apache.org/jira/browse/FLINK-17159> ES6
> > >>>>>>> ElasticsearchSinkITCase unstable
> > >>>>>>>
> > >>>>>>> - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949
> >
> > >>>>>>>
> KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
> > >>>>>>> expected:<310> but was:<0>
> > >>>>>>>
> > >>>>>>> - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222
> >
> > >>>> "Avro
> > >>>>>>> Confluent Schema Registry nightly end-to-end test" unstable with
> > >>> "Kafka
> > >>>>>>> cluster did not start after 120 seconds"
> > >>>>>>>
> > >>>>>>> - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511
> >
> > >>>>>> "RocksDB
> > >>>>>>> Memory Management end-to-end test" fails with "Current block
> cache
> > >>>> usage
> > >>>>>>> 202123272 larger than expected memory limit 200000000"
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <
> > rmetzger@apache.org
> > >>>>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi team,
> > >>>>>>>>
> > >>>>>>>> We would like to use this thread as a permanent thread for
> > >>>>>>>> regularly syncing on stale blockers (need to have somebody
> > assigned
> > >>>>>> within
> > >>>>>>>> a week and progress, or a good plan) and build instabilities
> (need
> > >>> to
> > >>>>>> check
> > >>>>>>>> if its a blocker).
> > >>>>>>>>
> > >>>>>>>> Recent test-instabilities:
> > >>>>>>>>
> > >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
> > >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test
> > >>>>>> unstable)
> > >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test
> > >>>>>> unstable)
> > >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17949
> > >>>>>>>> (KafkaShuffleITCase)
> > >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
> > >>>>>>>> transactions)
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> It would be nice if the committers taking care of these
> components
> > >>>> could
> > >>>>>>>> look into the test failures.
> > >>>>>>>> If nothing happens, we'll personally reach out to people I
> believe
> > >>>> they
> > >>>>>>>> could look into the ticket.
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Dian & Robert
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Best regards!
> > >>>>> Rui Li
> > >>>>
> > >>>>
> > >>>
> > >>> --
> > >>> Best regards!
> > >>> Rui Li
> > >>>
> > >>
> >
> >
>