You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@flink.apache.org by Robert Metzger <rm...@apache.org> on 2020/07/27 18:42:02 UTC

[DISCUSS][Release 1.12] Stale blockers and build instabilities

Hi team,

We would like to use this thread as a permanent thread for
regularly syncing on stale blockers (need to have somebody assigned within
a week and progress, or a good plan) and build instabilities (need to check
if its a blocker).

Recent test-instabilities:

   - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
   - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test unstable)
   - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test unstable)
   - https://issues.apache.org/jira/browse/FLINK-17949 (KafkaShuffleITCase)
   - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka transactions)


It would be nice if the committers taking care of these components could
look into the test failures.
If nothing happens, we'll personally reach out to people I believe they
could look into the ticket.

Best,
Dian & Robert

Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Posted by Dian Fu <di...@gmail.com>.
Hi all,

Here is an update of the status about the blocker issues and build instabilities these days.

Currently there are 10 blocker issues and 78 test instabilities:

== Blockers ( https://issues.apache.org/jira/browse/FLINK-19805?filter=12349334 <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>)
- 1 performance regression (https://issues.apache.org/jira/browse/FLINK-19441 <https://issues.apache.org/jira/browse/FLINK-19441>)
- 2 New source API (https://issues.apache.org/jira/browse/FLINK-19384 <https://issues.apache.org/jira/browse/FLINK-19384>, https://issues.apache.org/jira/browse/FLINK-19717 <https://issues.apache.org/jira/browse/FLINK-19717>(PR available, needs review))
- 2 Runtime(https://issues.apache.org/jira/browse/FLINK-19645 <https://issues.apache.org/jira/browse/FLINK-19645>, https://issues.apache.org/jira/browse/FLINK-19805 <https://issues.apache.org/jira/browse/FLINK-19805>)
- 1 State Backend (https://issues.apache.org/jira/browse/FLINK-19741 <https://issues.apache.org/jira/browse/FLINK-19741> (under review))
- 1 Job submission (https://issues.apache.org/jira/browse/FLINK-19909 <https://issues.apache.org/jira/browse/FLINK-19909>)
- 1 Format/Parquet (https://issues.apache.org/jira/browse/FLINK-19843 <https://issues.apache.org/jira/browse/FLINK-19843>)
- 1 YARN (https://issues.apache.org/jira/browse/FLINK-19865 <https://issues.apache.org/jira/browse/FLINK-19865>)
- 1 license (https://issues.apache.org/jira/browse/FLINK-19849 <https://issues.apache.org/jira/browse/FLINK-19849> (under review))

== Test instabilities (https://issues.apache.org/jira/browse/FLINK-18117?filter=12348580&jql=project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20test-stability%20ORDER%20BY%20updated%20DESC%2C%20created%20DESC <https://issues.apache.org/jira/browse/FLINK-18117?filter=12348580&jql=project%20=%20FLINK%20AND%20resolution%20=%20Unresolved%20AND%20labels%20=%20test-stability%20ORDER%20BY%20updated%20DESC,%20created%20DESC>)

Recent notable build instabilities which still have no owners:
- https://issues.apache.org/jira/browse/FLINK-19838 <https://issues.apache.org/jira/browse/FLINK-19340> SQLClientKafkaITCase hangs
- https://issues.apache.org/jira/browse/FLINK-19863 SQLClientHBaseITCase.testHBase failed with "java.io.IOException: Process failed due to timeout"

Regards,
Dian


> 在 2020年10月14日,下午12:34,Yu Li <ca...@gmail.com> 写道:
> 
> Thanks for monitoring the release progress and kindly reminding us Robert!
> 
> Minor: the below link shows the complete list of existing blockers:
> https://issues.apache.org/jira/issues/?filter=12349334
> 
> Best Regards,
> Yu
> 
> 
> On Tue, 13 Oct 2020 at 03:03, Robert Metzger <rm...@apache.org> wrote:
> 
>> Hi all!
>> 
>> According to the plan
>> <https://cwiki.apache.org/confluence/display/FLINK/1.12+Release> discussed
>> earlier in the release cycle, the feature freeze is expected to happen in
>> the week of October 26th. That's in 2.5 weeks from now.
>> 
>> I believe now is the time to discuss if we want to postpone the feature
>> freeze.
>> In my opinion, I would prefer to stick to the original schedule and rather
>> delay features to the 1.13 release if they are not ready yet.
>> 
>> From a stability perspective, we currently have the following situation:
>> - 6 blockers:
>> https://issues.apache.org/jira/browse/FLINK-19154?filter=12349334, most of
>> them are making progress, I notified people on those where the status is
>> unclear.
>> - 80 test instabilities:
>> 
>> https://issues.apache.org/jira/browse/FLINK-18117?filter=12348580&jql=project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20test-stability%20ORDER%20BY%20updated%20DESC%2C%20created%20DESC
>> - The CI system is a bit unstable these days: The e2e tests are often
>> timing out. I will look into options to mitigate this.
>> 
>> 
>> 
>> Drilling deeper into the test instabilities, these are some notable
>> clusters of test instabilities  (with recent failures, usually more than
>> once) [tests marked with >> have nobody assigned]
>> 
>> E2E tests, probably all test infrastructure
>>>> "Kerberized YARN per-job on Docker test" fails with "Could not start
>> hadoop cluster." https://issues.apache.org/jira/browse/FLINK-18117
>>>> SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1) failed
>> due to download error https://issues.apache.org/jira/browse/FLINK-17424
>> - "ES6 ElasticsearchSinkITCase unstable"
>> https://issues.apache.org/jira/browse/FLINK-17159
>> - "Avro Confluent Schema Registry nightly end-to-end test failed with
>> "Register operation timed out; error code: 50002""
>> https://issues.apache.org/jira/browse/FLINK-19422
>> - "SQLClientHBaseITCase.testHBase fails on azure"
>> https://issues.apache.org/jira/browse/FLINK-18570
>> 
>> New Source API
>> - "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable"
>> https://issues.apache.org/jira/browse/FLINK-19427
>>>> "CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs"
>> https://issues.apache.org/jira/browse/FLINK-19448
>> - "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent gets stuck"
>> https://issues.apache.org/jira/browse/FLINK-19489
>> 
>> 
>> Distributed Coordination
>> - "LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed with
>> "NoResourceAvailableException: Could not allocate the required slot within
>> slot request timeout" https://issues.apache.org/jira/browse/FLINK-19237
>> - "TaskExecutorSubmissionTest#testFailingScheduleOrUpdateConsumers"
>> https://issues.apache.org/jira/browse/FLINK-17458
>> - "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange
>> times out" https://issues.apache.org/jira/browse/FLINK-19514
>> - "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange:
>> ZooKeeper unexpectedly modified"
>> https://issues.apache.org/jira/browse/FLINK-19458
>> 
>> Kafka
>>>> "KafkaITCase failing with "Failed to send data to Kafka: This server
>> does not host this topic-partition""
>> https://issues.apache.org/jira/browse/FLINK-18444
>>>> "KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
>> expected:<310> but was:<0>"
>> https://issues.apache.org/jira/browse/FLINK-17949
>> - "KafkaITCase.testKeyValueSupport failure due to assertion error.""
>> https://issues.apache.org/jira/browse/FLINK-15745
>> - "KafkaITCase.testStartFromGroupOffsets times out on azure"
>> https://issues.apache.org/jira/browse/FLINK-18648
>> - "FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis"
>> https://issues.apache.org/jira/browse/FLINK-13733
>> 
>> 
>> 
>> On Tue, Sep 29, 2020 at 11:49 AM Dian Fu <di...@gmail.com> wrote:
>> 
>>> Hi all,
>>> 
>>> I'd like to update the status about the blocker issues and build
>>> instabilities as there is only one month left and the number of blocker
>>> issues increases a lot compared to last week.
>>> 
>>> == Blockers:
>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>
>>> 
>>> Currently there are 10 blocker issues
>>> - 3 performance regression (
>>> https://issues.apache.org/jira/browse/FLINK-19439 <
>>> https://issues.apache.org/jira/browse/FLINK-19439>,
>>> https://issues.apache.org/jira/browse/FLINK-19440 <
>>> https://issues.apache.org/jira/browse/FLINK-19440>,
>>> https://issues.apache.org/jira/browse/FLINK-19441 <
>>> https://issues.apache.org/jira/browse/FLINK-19441>)
>>> - 3 Runtime (https://issues.apache.org/jira/browse/FLINK-19264 <
>>> https://issues.apache.org/jira/browse/FLINK-19264>,
>>> https://issues.apache.org/jira/browse/FLINK-19388 <
>>> https://issues.apache.org/jira/browse/FLINK-19388>,
>>> https://issues.apache.org/jira/browse/FLINK-19249 <
>>> https://issues.apache.org/jira/browse/FLINK-19249>)
>>> - 1 HBase connector (https://issues.apache.org/jira/browse/FLINK-19445 <
>>> https://issues.apache.org/jira/browse/FLINK-19445>)
>>> - 1 Application mode (https://issues.apache.org/jira/browse/FLINK-19154
>> <
>>> https://issues.apache.org/jira/browse/FLINK-19154>)
>>> - 1 New source API (https://issues.apache.org/jira/browse/FLINK-19384 <
>>> https://issues.apache.org/jira/browse/FLINK-19384>)
>>> - 1 Kinesis (https://issues.apache.org/jira/browse/FLINK-19332 <
>>> https://issues.apache.org/jira/browse/FLINK-19332>)
>>> 
>>> == Recent notable build instabilities which still have no owners:
>>> - New source API
>>>   https://issues.apache.org/jira/browse/FLINK-19253 <
>>> https://issues.apache.org/jira/browse/FLINK-19253>
>>> SourceReaderTestBase.testAddSplitToExistingFetcher hangs
>>>   https://issues.apache.org/jira/browse/FLINK-19370 <
>>> https://issues.apache.org/jira/browse/FLINK-19370>
>>> FileSourceTextLinesITCase.testContinuousTextFileSource failed as results
>>> mismatch
>>>   https://issues.apache.org/jira/browse/FLINK-19427 <
>>> https://issues.apache.org/jira/browse/FLINK-19427>
>>> SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable,
>>>   https://issues.apache.org/jira/browse/FLINK-19437 <
>>> https://issues.apache.org/jira/browse/FLINK-19437>
>>> FileSourceTextLinesITCase.testContinuousTextFileSource failed with
>>> "SimpleStreamFormat is not splittable, but found split end (0) different
>>> from file length (198)"
>>>   https://issues.apache.org/jira/browse/FLINK-19448 <
>>> https://issues.apache.org/jira/browse/FLINK-19448>
>>> CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs
>>> - Runtime/Network
>>>   https://issues.apache.org/jira/browse/FLINK-19426 <
>>> https://issues.apache.org/jira/browse/FLINK-19426>  End-to-end test
>>> sometimes fails with PartitionConnectionException
>>> - Unaligned Checkpoint
>>>   https://issues.apache.org/jira/browse/FLINK-19027 <
>>> https://issues.apache.org/jira/browse/FLINK-19027>
>>> 
>> UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointOnParallelRemoteChannel
>>> failed because of test timeout
>>> - Table
>>>   https://issues.apache.org/jira/browse/FLINK-19340 <
>>> https://issues.apache.org/jira/browse/FLINK-19340>
>>> AggregateITCase.testListAggWithDistinct failed with "expected:<List(1,A,
>>> 2,B, 3,C#A, 4,EF)> but was:<List(1,A, 2,B, 3,C#A, 4,EF#EF)>"
>>> - HBase connector
>>>   https://issues.apache.org/jira/browse/FLINK-18570 <
>>> https://issues.apache.org/jira/browse/FLINK-18570>
>>> SQLClientHBaseITCase.testHBase fails on azure
>>>    https://issues.apache.org/jira/browse/FLINK-19447 <
>>> https://issues.apache.org/jira/browse/FLINK-19447>
>>> HBaseConnectorITCase.HBaseTestingClusterAutoStarter failed with "Master
>> not
>>> initialized after 200000ms"
>>> - Avro
>>>   https://issues.apache.org/jira/browse/FLINK-19422 <
>>> https://issues.apache.org/jira/browse/FLINK-19422>  Avro Confluent
>> Schema
>>> Registry nightly end-to-end test failed with "Register operation timed
>> out;
>>> error code: 50002"
>>> 
>>> Regards,
>>> Dian
>>> 
>>>> 在 2020年9月21日,下午2:32,Robert Metzger <rm...@apache.org> 写道:
>>>> 
>>>> Hi all,
>>>> 
>>>> An update on the release status:
>>>> 1. We have 35 days = *5 weeks left until feature freeze*
>>>> 2. There are currently 2 blockers for Flink
>>>> <https://issues.apache.org/jira/browse/FLINK-19264?filter=12349334>,
>> all
>>>> making progress
>>>> 3. We have 72 test instabilities
>>>> <https://issues.apache.org/jira/browse/FLINK-19237> (down 7 from 2
>> weeks
>>>> ago). I have pinged people to help addressing frequent or critical
>>> issues.
>>>> 
>>>> Best,
>>>> Robert
>>>> 
>>>> 
>>>> On Mon, Sep 7, 2020 at 10:37 AM Robert Metzger <rm...@apache.org>
>>> wrote:
>>>> 
>>>>> Hi all,
>>>>> 
>>>>> another two weeks have passed. We now have 5 blockers
>>>>> <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>
>> (Up
>>>>> 3 from 2 weeks ago), but they are all making progress.
>>>>> 
>>>>> We currently have 79 test-instabilities
>>>>> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>,
>>>>> since the last report, a few have been resolved, and some others have
>>> been
>>>>> added.
>>>>> I have checked the tickets, closed some old ones and pinged people to
>>> help
>>>>> resolve new or frequent ones.
>>>>> Except for Kafka, there are no major clusters of test instabilities.
>>> Most
>>>>> failures are rarely failing tests across the entire system.
>>>>> 
>>>>> 
>>>>> On Tue, Aug 25, 2020 at 9:05 AM Rui Li <li...@gmail.com> wrote:
>>>>> 
>>>>>> Thanks Dian for the pointer. I'll take a look.
>>>>>> 
>>>>>> On Tue, Aug 25, 2020 at 3:02 PM Dian Fu <di...@gmail.com>
>> wrote:
>>>>>> 
>>>>>>> Thanks Rui for the info. This issue(hive related)
>>>>>>> https://issues.apache.org/jira/browse/FLINK-19025 <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-19025> is marked as a
>>>>>> blocker.
>>>>>>> 
>>>>>>> Regards,
>>>>>>> Dian
>>>>>>> 
>>>>>>>> 在 2020年8月25日,下午2:58,Rui Li <li...@gmail.com> 写道:
>>>>>>>> 
>>>>>>>> Hi Dian,
>>>>>>>> 
>>>>>>>> FLINK-18682 has been fixed. Is there any other blocker in the hive
>>>>>>>> connector?
>>>>>>>> 
>>>>>>>> On Tue, Aug 25, 2020 at 2:41 PM Dian Fu <dian0511.fu@gmail.com
>>>>>> <mailto:
>>>>>>> dian0511.fu@gmail.com>> wrote:
>>>>>>>> 
>>>>>>>>> Hi all,
>>>>>>>>> 
>>>>>>>>> Two weeks have passed and it seems that none of the test
>> stabilities
>>>>>>>>> issues have been addressed since then.
>>>>>>>>> 
>>>>>>>>> Here is an updated status report of Blockers and Test
>> instabilities:
>>>>>>>>> 
>>>>>>>>> Blockers <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
>> <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>
>> <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
>> <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
>>>>> :
>>>>>>>>> Currently 2 blockers (1x Hive, 1x CI Infra)
>>>>>>>>> 
>>>>>>>>> Test-Instabilities <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
>> <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>
>> <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
>> <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
>>>>> :
>>>>>>>>> (total 80)
>>>>>>>>> 
>>>>>>>>> Besides the issues already posted in previous mail, here are the
>> new
>>>>>>>>> instability issues which should be taken care of:
>>>>>>>>> 
>>>>>>>>> - FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012
>> <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-19012> <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-19012 <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-19012>>)
>>>>>>>>> E2E test fails with "Cannot register Closeable, this
>>>>>>>>> subtaskCheckpointCoordinator is already closed. Closing argument."
>>>>>>>>> 
>>>>>>>>> -> This is a new issue occurred recently. It has occurred several
>>>>>> times
>>>>>>>>> and may indicate a bug somewhere and should be taken care of.
>>>>>>>>> 
>>>>>>>>> - FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-9992> <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-9992 <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-9992>>)
>>>>>>>>> FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI
>>>>>>>>> 
>>>>>>>>> -> There is already a PR for it and needs review.
>>>>>>>>> 
>>>>>>>>> - FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842
>> <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-18842> <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18842 <
>>>>>>> https://issues.apache.org/jira/browse/FLINK-18842>>)
>>>>>>>>> e2e test failed to download "localhost:9999/flink.tgz" in
>> "Wordcount
>>>>>> on
>>>>>>>>> Docker test"
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>>> 在 2020年8月11日,下午2:08,Robert Metzger <rm...@apache.org> 写道:
>>>>>>>>>> 
>>>>>>>>>> Hi team,
>>>>>>>>>> 
>>>>>>>>>> 2 weeks have passed since the last update. None of the test
>>>>>> stabilities
>>>>>>>>>> I've mentioned have been addressed since then.
>>>>>>>>>> 
>>>>>>>>>> Here's an updated status report of Blockers and Test
>> instabilities:
>>>>>>>>>> 
>>>>>>>>>> Blockers <
>>>>>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
>>> :
>>>>>>>>>> Currently 3 blockers (2x Hive, 1x CI Infra)
>>>>>>>>>> 
>>>>>>>>>> Test-Instabilities
>>>>>>>>>> <
>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
>>>> 
>>>>>>>>> (total
>>>>>>>>>> 79) which failed recently or frequently:
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807
>>> 
>>>>>>>>>> FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
>>>>>>>>>> failed with "Timeout expired after 60000milliseconds while
>> awaiting
>>>>>>>>>> EndTxn(COMMIT)"
>>>>>>>>>> 
>>>>>>>>>> - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634
>>> 
>>>>>>>>>> FlinkKafkaProducerITCase.testRecoverCommittedTransaction
>>>>>>>>>> failed with "Timeout expired after 60000milliseconds while
>> awaiting
>>>>>>>>>> InitProducerId"
>>>>>>>>>> 
>>>>>>>>>> - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908
>>> 
>>>>>>>>>> FlinkKafkaProducerITCase
>>>>>>>>>> testScaleUpAfterScalingDown Timeout expired while initializing
>>>>>>>>>> transactional state in 60000ms.
>>>>>>>>>> 
>>>>>>>>>> - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733
>>> 
>>>>>>>>>> FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis
>>>>>>>>>> 
>>>>>>>>>> --> The first three tickets seem related.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260
>>> 
>>>>>>>>>> StreamingKafkaITCase failure on Azure
>>>>>>>>>> 
>>>>>>>>>> --> This one seems really hard to reproduce
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768
>>> 
>>>>>>>>>> HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
>>>>>>>>>> hangs
>>>>>>>>>> 
>>>>>>>>>> - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374
>>> 
>>>>>>>>>> 
>>>>>>>>> 
>>>>>>> 
>>>>>> 
>>> 
>> HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
>>>>>>>>>> produced no output for 900 seconds
>>>>>>>>>> 
>>>>>>>>>> --> nobody seems to feel responsible for these tickets. My guess
>> is
>>>>>>> that
>>>>>>>>>> the S3 connector should have shorter timeouts / faster retries to
>>>>>>> finish
>>>>>>>>>> within the 15 minutes test timeout. OR there is really something
>>>>>> wrong
>>>>>>>>> with
>>>>>>>>>> the code.
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> - FLINK-18333 UnsignedTypeConversionITCase failed caused by
>>>>>> MariaDB4j
>>>>>>>>>> "Asked to waitFor Program"
>>>>>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>
>>>>>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
>>>>>>>>>> <https://issues.apache.org/jira/browse/FLINK-17159> ES6
>>>>>>>>>> ElasticsearchSinkITCase unstable
>>>>>>>>>> 
>>>>>>>>>> - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949
>>> 
>>>>>>>>>> 
>> KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
>>>>>>>>>> expected:<310> but was:<0>
>>>>>>>>>> 
>>>>>>>>>> - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222
>>> 
>>>>>>> "Avro
>>>>>>>>>> Confluent Schema Registry nightly end-to-end test" unstable with
>>>>>> "Kafka
>>>>>>>>>> cluster did not start after 120 seconds"
>>>>>>>>>> 
>>>>>>>>>> - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511
>>> 
>>>>>>>>> "RocksDB
>>>>>>>>>> Memory Management end-to-end test" fails with "Current block
>> cache
>>>>>>> usage
>>>>>>>>>> 202123272 larger than expected memory limit 200000000"
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> 
>>>>>>>>>> On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <
>>> rmetzger@apache.org
>>>>>>> 
>>>>>>>>> wrote:
>>>>>>>>>> 
>>>>>>>>>>> Hi team,
>>>>>>>>>>> 
>>>>>>>>>>> We would like to use this thread as a permanent thread for
>>>>>>>>>>> regularly syncing on stale blockers (need to have somebody
>>> assigned
>>>>>>>>> within
>>>>>>>>>>> a week and progress, or a good plan) and build instabilities
>> (need
>>>>>> to
>>>>>>>>> check
>>>>>>>>>>> if its a blocker).
>>>>>>>>>>> 
>>>>>>>>>>> Recent test-instabilities:
>>>>>>>>>>> 
>>>>>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
>>>>>>>>>>> - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test
>>>>>>>>> unstable)
>>>>>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test
>>>>>>>>> unstable)
>>>>>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17949
>>>>>>>>>>> (KafkaShuffleITCase)
>>>>>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
>>>>>>>>>>> transactions)
>>>>>>>>>>> 
>>>>>>>>>>> 
>>>>>>>>>>> It would be nice if the committers taking care of these
>> components
>>>>>>> could
>>>>>>>>>>> look into the test failures.
>>>>>>>>>>> If nothing happens, we'll personally reach out to people I
>> believe
>>>>>>> they
>>>>>>>>>>> could look into the ticket.
>>>>>>>>>>> 
>>>>>>>>>>> Best,
>>>>>>>>>>> Dian & Robert
>>>>>>>>>>> 
>>>>>>>>> 
>>>>>>>>> 
>>>>>>>> 
>>>>>>>> --
>>>>>>>> Best regards!
>>>>>>>> Rui Li
>>>>>>> 
>>>>>>> 
>>>>>> 
>>>>>> --
>>>>>> Best regards!
>>>>>> Rui Li
>>>>>> 
>>>>> 
>>> 
>>> 
>> 


Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Posted by Yu Li <ca...@gmail.com>.
Thanks for monitoring the release progress and kindly reminding us Robert!

Minor: the below link shows the complete list of existing blockers:
https://issues.apache.org/jira/issues/?filter=12349334

Best Regards,
Yu


On Tue, 13 Oct 2020 at 03:03, Robert Metzger <rm...@apache.org> wrote:

> Hi all!
>
> According to the plan
> <https://cwiki.apache.org/confluence/display/FLINK/1.12+Release> discussed
> earlier in the release cycle, the feature freeze is expected to happen in
> the week of October 26th. That's in 2.5 weeks from now.
>
> I believe now is the time to discuss if we want to postpone the feature
> freeze.
> In my opinion, I would prefer to stick to the original schedule and rather
> delay features to the 1.13 release if they are not ready yet.
>
> From a stability perspective, we currently have the following situation:
> - 6 blockers:
> https://issues.apache.org/jira/browse/FLINK-19154?filter=12349334, most of
> them are making progress, I notified people on those where the status is
> unclear.
> - 80 test instabilities:
>
> https://issues.apache.org/jira/browse/FLINK-18117?filter=12348580&jql=project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20test-stability%20ORDER%20BY%20updated%20DESC%2C%20created%20DESC
> - The CI system is a bit unstable these days: The e2e tests are often
> timing out. I will look into options to mitigate this.
>
>
>
> Drilling deeper into the test instabilities, these are some notable
> clusters of test instabilities  (with recent failures, usually more than
> once) [tests marked with >> have nobody assigned]
>
> E2E tests, probably all test infrastructure
> >> "Kerberized YARN per-job on Docker test" fails with "Could not start
> hadoop cluster." https://issues.apache.org/jira/browse/FLINK-18117
> >> SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1) failed
> due to download error https://issues.apache.org/jira/browse/FLINK-17424
> - "ES6 ElasticsearchSinkITCase unstable"
> https://issues.apache.org/jira/browse/FLINK-17159
> - "Avro Confluent Schema Registry nightly end-to-end test failed with
> "Register operation timed out; error code: 50002""
> https://issues.apache.org/jira/browse/FLINK-19422
> - "SQLClientHBaseITCase.testHBase fails on azure"
> https://issues.apache.org/jira/browse/FLINK-18570
>
> New Source API
> - "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable"
> https://issues.apache.org/jira/browse/FLINK-19427
> >> "CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs"
> https://issues.apache.org/jira/browse/FLINK-19448
> - "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent gets stuck"
> https://issues.apache.org/jira/browse/FLINK-19489
>
>
> Distributed Coordination
> - "LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed with
> "NoResourceAvailableException: Could not allocate the required slot within
> slot request timeout" https://issues.apache.org/jira/browse/FLINK-19237
> - "TaskExecutorSubmissionTest#testFailingScheduleOrUpdateConsumers"
> https://issues.apache.org/jira/browse/FLINK-17458
> - "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange
> times out" https://issues.apache.org/jira/browse/FLINK-19514
> - "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange:
> ZooKeeper unexpectedly modified"
> https://issues.apache.org/jira/browse/FLINK-19458
>
> Kafka
> >> "KafkaITCase failing with "Failed to send data to Kafka: This server
> does not host this topic-partition""
> https://issues.apache.org/jira/browse/FLINK-18444
> >> "KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
> expected:<310> but was:<0>"
> https://issues.apache.org/jira/browse/FLINK-17949
> - "KafkaITCase.testKeyValueSupport failure due to assertion error.""
> https://issues.apache.org/jira/browse/FLINK-15745
> - "KafkaITCase.testStartFromGroupOffsets times out on azure"
> https://issues.apache.org/jira/browse/FLINK-18648
> - "FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis"
> https://issues.apache.org/jira/browse/FLINK-13733
>
>
>
> On Tue, Sep 29, 2020 at 11:49 AM Dian Fu <di...@gmail.com> wrote:
>
> > Hi all,
> >
> > I'd like to update the status about the blocker issues and build
> > instabilities as there is only one month left and the number of blocker
> > issues increases a lot compared to last week.
> >
> > == Blockers:
> > https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> > https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>
> >
> > Currently there are 10 blocker issues
> > - 3 performance regression (
> > https://issues.apache.org/jira/browse/FLINK-19439 <
> > https://issues.apache.org/jira/browse/FLINK-19439>,
> > https://issues.apache.org/jira/browse/FLINK-19440 <
> > https://issues.apache.org/jira/browse/FLINK-19440>,
> > https://issues.apache.org/jira/browse/FLINK-19441 <
> > https://issues.apache.org/jira/browse/FLINK-19441>)
> > - 3 Runtime (https://issues.apache.org/jira/browse/FLINK-19264 <
> > https://issues.apache.org/jira/browse/FLINK-19264>,
> > https://issues.apache.org/jira/browse/FLINK-19388 <
> > https://issues.apache.org/jira/browse/FLINK-19388>,
> > https://issues.apache.org/jira/browse/FLINK-19249 <
> > https://issues.apache.org/jira/browse/FLINK-19249>)
> > - 1 HBase connector (https://issues.apache.org/jira/browse/FLINK-19445 <
> > https://issues.apache.org/jira/browse/FLINK-19445>)
> > - 1 Application mode (https://issues.apache.org/jira/browse/FLINK-19154
> <
> > https://issues.apache.org/jira/browse/FLINK-19154>)
> > - 1 New source API (https://issues.apache.org/jira/browse/FLINK-19384 <
> > https://issues.apache.org/jira/browse/FLINK-19384>)
> > - 1 Kinesis (https://issues.apache.org/jira/browse/FLINK-19332 <
> > https://issues.apache.org/jira/browse/FLINK-19332>)
> >
> > == Recent notable build instabilities which still have no owners:
> > - New source API
> >    https://issues.apache.org/jira/browse/FLINK-19253 <
> > https://issues.apache.org/jira/browse/FLINK-19253>
> > SourceReaderTestBase.testAddSplitToExistingFetcher hangs
> >    https://issues.apache.org/jira/browse/FLINK-19370 <
> > https://issues.apache.org/jira/browse/FLINK-19370>
> > FileSourceTextLinesITCase.testContinuousTextFileSource failed as results
> > mismatch
> >    https://issues.apache.org/jira/browse/FLINK-19427 <
> > https://issues.apache.org/jira/browse/FLINK-19427>
> > SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable,
> >    https://issues.apache.org/jira/browse/FLINK-19437 <
> > https://issues.apache.org/jira/browse/FLINK-19437>
> > FileSourceTextLinesITCase.testContinuousTextFileSource failed with
> > "SimpleStreamFormat is not splittable, but found split end (0) different
> > from file length (198)"
> >    https://issues.apache.org/jira/browse/FLINK-19448 <
> > https://issues.apache.org/jira/browse/FLINK-19448>
> > CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs
> > - Runtime/Network
> >    https://issues.apache.org/jira/browse/FLINK-19426 <
> > https://issues.apache.org/jira/browse/FLINK-19426>  End-to-end test
> > sometimes fails with PartitionConnectionException
> > - Unaligned Checkpoint
> >    https://issues.apache.org/jira/browse/FLINK-19027 <
> > https://issues.apache.org/jira/browse/FLINK-19027>
> >
> UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointOnParallelRemoteChannel
> > failed because of test timeout
> > - Table
> >    https://issues.apache.org/jira/browse/FLINK-19340 <
> > https://issues.apache.org/jira/browse/FLINK-19340>
> > AggregateITCase.testListAggWithDistinct failed with "expected:<List(1,A,
> > 2,B, 3,C#A, 4,EF)> but was:<List(1,A, 2,B, 3,C#A, 4,EF#EF)>"
> > - HBase connector
> >    https://issues.apache.org/jira/browse/FLINK-18570 <
> > https://issues.apache.org/jira/browse/FLINK-18570>
> > SQLClientHBaseITCase.testHBase fails on azure
> >     https://issues.apache.org/jira/browse/FLINK-19447 <
> > https://issues.apache.org/jira/browse/FLINK-19447>
> > HBaseConnectorITCase.HBaseTestingClusterAutoStarter failed with "Master
> not
> > initialized after 200000ms"
> > - Avro
> >    https://issues.apache.org/jira/browse/FLINK-19422 <
> > https://issues.apache.org/jira/browse/FLINK-19422>  Avro Confluent
> Schema
> > Registry nightly end-to-end test failed with "Register operation timed
> out;
> > error code: 50002"
> >
> > Regards,
> > Dian
> >
> > > 在 2020年9月21日,下午2:32,Robert Metzger <rm...@apache.org> 写道:
> > >
> > > Hi all,
> > >
> > > An update on the release status:
> > > 1. We have 35 days = *5 weeks left until feature freeze*
> > > 2. There are currently 2 blockers for Flink
> > > <https://issues.apache.org/jira/browse/FLINK-19264?filter=12349334>,
> all
> > > making progress
> > > 3. We have 72 test instabilities
> > > <https://issues.apache.org/jira/browse/FLINK-19237> (down 7 from 2
> weeks
> > > ago). I have pinged people to help addressing frequent or critical
> > issues.
> > >
> > > Best,
> > > Robert
> > >
> > >
> > > On Mon, Sep 7, 2020 at 10:37 AM Robert Metzger <rm...@apache.org>
> > wrote:
> > >
> > >> Hi all,
> > >>
> > >> another two weeks have passed. We now have 5 blockers
> > >> <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>
> (Up
> > >> 3 from 2 weeks ago), but they are all making progress.
> > >>
> > >> We currently have 79 test-instabilities
> > >> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>,
> > >> since the last report, a few have been resolved, and some others have
> > been
> > >> added.
> > >> I have checked the tickets, closed some old ones and pinged people to
> > help
> > >> resolve new or frequent ones.
> > >> Except for Kafka, there are no major clusters of test instabilities.
> > Most
> > >> failures are rarely failing tests across the entire system.
> > >>
> > >>
> > >> On Tue, Aug 25, 2020 at 9:05 AM Rui Li <li...@gmail.com> wrote:
> > >>
> > >>> Thanks Dian for the pointer. I'll take a look.
> > >>>
> > >>> On Tue, Aug 25, 2020 at 3:02 PM Dian Fu <di...@gmail.com>
> wrote:
> > >>>
> > >>>> Thanks Rui for the info. This issue(hive related)
> > >>>> https://issues.apache.org/jira/browse/FLINK-19025 <
> > >>>> https://issues.apache.org/jira/browse/FLINK-19025> is marked as a
> > >>> blocker.
> > >>>>
> > >>>> Regards,
> > >>>> Dian
> > >>>>
> > >>>>> 在 2020年8月25日,下午2:58,Rui Li <li...@gmail.com> 写道:
> > >>>>>
> > >>>>> Hi Dian,
> > >>>>>
> > >>>>> FLINK-18682 has been fixed. Is there any other blocker in the hive
> > >>>>> connector?
> > >>>>>
> > >>>>> On Tue, Aug 25, 2020 at 2:41 PM Dian Fu <dian0511.fu@gmail.com
> > >>> <mailto:
> > >>>> dian0511.fu@gmail.com>> wrote:
> > >>>>>
> > >>>>>> Hi all,
> > >>>>>>
> > >>>>>> Two weeks have passed and it seems that none of the test
> stabilities
> > >>>>>> issues have been addressed since then.
> > >>>>>>
> > >>>>>> Here is an updated status report of Blockers and Test
> instabilities:
> > >>>>>>
> > >>>>>> Blockers <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
> <
> > >>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>
> <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
> <
> > >>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
> >>>:
> > >>>>>> Currently 2 blockers (1x Hive, 1x CI Infra)
> > >>>>>>
> > >>>>>> Test-Instabilities <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
> <
> > >>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>
> <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
> <
> > >>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
> >>>:
> > >>>>>> (total 80)
> > >>>>>>
> > >>>>>> Besides the issues already posted in previous mail, here are the
> new
> > >>>>>> instability issues which should be taken care of:
> > >>>>>>
> > >>>>>> - FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012
> <
> > >>>> https://issues.apache.org/jira/browse/FLINK-19012> <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-19012 <
> > >>>> https://issues.apache.org/jira/browse/FLINK-19012>>)
> > >>>>>> E2E test fails with "Cannot register Closeable, this
> > >>>>>> subtaskCheckpointCoordinator is already closed. Closing argument."
> > >>>>>>
> > >>>>>> -> This is a new issue occurred recently. It has occurred several
> > >>> times
> > >>>>>> and may indicate a bug somewhere and should be taken care of.
> > >>>>>>
> > >>>>>> - FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 <
> > >>>> https://issues.apache.org/jira/browse/FLINK-9992> <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-9992 <
> > >>>> https://issues.apache.org/jira/browse/FLINK-9992>>)
> > >>>>>> FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI
> > >>>>>>
> > >>>>>> -> There is already a PR for it and needs review.
> > >>>>>>
> > >>>>>> - FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842
> <
> > >>>> https://issues.apache.org/jira/browse/FLINK-18842> <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-18842 <
> > >>>> https://issues.apache.org/jira/browse/FLINK-18842>>)
> > >>>>>> e2e test failed to download "localhost:9999/flink.tgz" in
> "Wordcount
> > >>> on
> > >>>>>> Docker test"
> > >>>>>>
> > >>>>>>
> > >>>>>>> 在 2020年8月11日,下午2:08,Robert Metzger <rm...@apache.org> 写道:
> > >>>>>>>
> > >>>>>>> Hi team,
> > >>>>>>>
> > >>>>>>> 2 weeks have passed since the last update. None of the test
> > >>> stabilities
> > >>>>>>> I've mentioned have been addressed since then.
> > >>>>>>>
> > >>>>>>> Here's an updated status report of Blockers and Test
> instabilities:
> > >>>>>>>
> > >>>>>>> Blockers <
> > >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334
> >:
> > >>>>>>> Currently 3 blockers (2x Hive, 1x CI Infra)
> > >>>>>>>
> > >>>>>>> Test-Instabilities
> > >>>>>>> <
> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
> > >
> > >>>>>> (total
> > >>>>>>> 79) which failed recently or frequently:
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807
> >
> > >>>>>>> FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
> > >>>>>>> failed with "Timeout expired after 60000milliseconds while
> awaiting
> > >>>>>>> EndTxn(COMMIT)"
> > >>>>>>>
> > >>>>>>> - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634
> >
> > >>>>>>> FlinkKafkaProducerITCase.testRecoverCommittedTransaction
> > >>>>>>> failed with "Timeout expired after 60000milliseconds while
> awaiting
> > >>>>>>> InitProducerId"
> > >>>>>>>
> > >>>>>>> - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908
> >
> > >>>>>>> FlinkKafkaProducerITCase
> > >>>>>>> testScaleUpAfterScalingDown Timeout expired while initializing
> > >>>>>>> transactional state in 60000ms.
> > >>>>>>>
> > >>>>>>> - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733
> >
> > >>>>>>> FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis
> > >>>>>>>
> > >>>>>>> --> The first three tickets seem related.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260
> >
> > >>>>>>> StreamingKafkaITCase failure on Azure
> > >>>>>>>
> > >>>>>>> --> This one seems really hard to reproduce
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768
> >
> > >>>>>>> HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
> > >>>>>>> hangs
> > >>>>>>>
> > >>>>>>> - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374
> >
> > >>>>>>>
> > >>>>>>
> > >>>>
> > >>>
> >
> HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
> > >>>>>>> produced no output for 900 seconds
> > >>>>>>>
> > >>>>>>> --> nobody seems to feel responsible for these tickets. My guess
> is
> > >>>> that
> > >>>>>>> the S3 connector should have shorter timeouts / faster retries to
> > >>>> finish
> > >>>>>>> within the 15 minutes test timeout. OR there is really something
> > >>> wrong
> > >>>>>> with
> > >>>>>>> the code.
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> - FLINK-18333 UnsignedTypeConversionITCase failed caused by
> > >>> MariaDB4j
> > >>>>>>> "Asked to waitFor Program"
> > >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>
> > >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
> > >>>>>>> <https://issues.apache.org/jira/browse/FLINK-17159> ES6
> > >>>>>>> ElasticsearchSinkITCase unstable
> > >>>>>>>
> > >>>>>>> - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949
> >
> > >>>>>>>
> KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
> > >>>>>>> expected:<310> but was:<0>
> > >>>>>>>
> > >>>>>>> - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222
> >
> > >>>> "Avro
> > >>>>>>> Confluent Schema Registry nightly end-to-end test" unstable with
> > >>> "Kafka
> > >>>>>>> cluster did not start after 120 seconds"
> > >>>>>>>
> > >>>>>>> - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511
> >
> > >>>>>> "RocksDB
> > >>>>>>> Memory Management end-to-end test" fails with "Current block
> cache
> > >>>> usage
> > >>>>>>> 202123272 larger than expected memory limit 200000000"
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>>
> > >>>>>>> On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <
> > rmetzger@apache.org
> > >>>>
> > >>>>>> wrote:
> > >>>>>>>
> > >>>>>>>> Hi team,
> > >>>>>>>>
> > >>>>>>>> We would like to use this thread as a permanent thread for
> > >>>>>>>> regularly syncing on stale blockers (need to have somebody
> > assigned
> > >>>>>> within
> > >>>>>>>> a week and progress, or a good plan) and build instabilities
> (need
> > >>> to
> > >>>>>> check
> > >>>>>>>> if its a blocker).
> > >>>>>>>>
> > >>>>>>>> Recent test-instabilities:
> > >>>>>>>>
> > >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
> > >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test
> > >>>>>> unstable)
> > >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test
> > >>>>>> unstable)
> > >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17949
> > >>>>>>>> (KafkaShuffleITCase)
> > >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
> > >>>>>>>> transactions)
> > >>>>>>>>
> > >>>>>>>>
> > >>>>>>>> It would be nice if the committers taking care of these
> components
> > >>>> could
> > >>>>>>>> look into the test failures.
> > >>>>>>>> If nothing happens, we'll personally reach out to people I
> believe
> > >>>> they
> > >>>>>>>> could look into the ticket.
> > >>>>>>>>
> > >>>>>>>> Best,
> > >>>>>>>> Dian & Robert
> > >>>>>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>
> > >>>>> --
> > >>>>> Best regards!
> > >>>>> Rui Li
> > >>>>
> > >>>>
> > >>>
> > >>> --
> > >>> Best regards!
> > >>> Rui Li
> > >>>
> > >>
> >
> >
>

Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Posted by Robert Metzger <rm...@apache.org>.
Hi all!

According to the plan
<https://cwiki.apache.org/confluence/display/FLINK/1.12+Release> discussed
earlier in the release cycle, the feature freeze is expected to happen in
the week of October 26th. That's in 2.5 weeks from now.

I believe now is the time to discuss if we want to postpone the feature
freeze.
In my opinion, I would prefer to stick to the original schedule and rather
delay features to the 1.13 release if they are not ready yet.

From a stability perspective, we currently have the following situation:
- 6 blockers:
https://issues.apache.org/jira/browse/FLINK-19154?filter=12349334, most of
them are making progress, I notified people on those where the status is
unclear.
- 80 test instabilities:
https://issues.apache.org/jira/browse/FLINK-18117?filter=12348580&jql=project%20%3D%20FLINK%20AND%20resolution%20%3D%20Unresolved%20AND%20labels%20%3D%20test-stability%20ORDER%20BY%20updated%20DESC%2C%20created%20DESC
- The CI system is a bit unstable these days: The e2e tests are often
timing out. I will look into options to mitigate this.



Drilling deeper into the test instabilities, these are some notable
clusters of test instabilities  (with recent failures, usually more than
once) [tests marked with >> have nobody assigned]

E2E tests, probably all test infrastructure
>> "Kerberized YARN per-job on Docker test" fails with "Could not start
hadoop cluster." https://issues.apache.org/jira/browse/FLINK-18117
>> SQL Client end-to-end test (Old planner) Elasticsearch (v7.5.1) failed
due to download error https://issues.apache.org/jira/browse/FLINK-17424
- "ES6 ElasticsearchSinkITCase unstable"
https://issues.apache.org/jira/browse/FLINK-17159
- "Avro Confluent Schema Registry nightly end-to-end test failed with
"Register operation timed out; error code: 50002""
https://issues.apache.org/jira/browse/FLINK-19422
- "SQLClientHBaseITCase.testHBase fails on azure"
https://issues.apache.org/jira/browse/FLINK-18570

New Source API
- "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable"
https://issues.apache.org/jira/browse/FLINK-19427
>> "CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs"
https://issues.apache.org/jira/browse/FLINK-19448
- "SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent gets stuck"
https://issues.apache.org/jira/browse/FLINK-19489


Distributed Coordination
- "LeaderChangeClusterComponentsTest.testReelectionOfJobMaster failed with
"NoResourceAvailableException: Could not allocate the required slot within
slot request timeout" https://issues.apache.org/jira/browse/FLINK-19237
- "TaskExecutorSubmissionTest#testFailingScheduleOrUpdateConsumers"
https://issues.apache.org/jira/browse/FLINK-17458
- "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange
times out" https://issues.apache.org/jira/browse/FLINK-19514
- "ZooKeeperLeaderElectionITCase.testJobExecutionOnClusterWithLeaderChange:
ZooKeeper unexpectedly modified"
https://issues.apache.org/jira/browse/FLINK-19458

Kafka
>> "KafkaITCase failing with "Failed to send data to Kafka: This server
does not host this topic-partition""
https://issues.apache.org/jira/browse/FLINK-18444
>> "KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
expected:<310> but was:<0>"
https://issues.apache.org/jira/browse/FLINK-17949
- "KafkaITCase.testKeyValueSupport failure due to assertion error.""
https://issues.apache.org/jira/browse/FLINK-15745
- "KafkaITCase.testStartFromGroupOffsets times out on azure"
https://issues.apache.org/jira/browse/FLINK-18648
- "FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis"
https://issues.apache.org/jira/browse/FLINK-13733



On Tue, Sep 29, 2020 at 11:49 AM Dian Fu <di...@gmail.com> wrote:

> Hi all,
>
> I'd like to update the status about the blocker issues and build
> instabilities as there is only one month left and the number of blocker
> issues increases a lot compared to last week.
>
> == Blockers:
> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>
>
> Currently there are 10 blocker issues
> - 3 performance regression (
> https://issues.apache.org/jira/browse/FLINK-19439 <
> https://issues.apache.org/jira/browse/FLINK-19439>,
> https://issues.apache.org/jira/browse/FLINK-19440 <
> https://issues.apache.org/jira/browse/FLINK-19440>,
> https://issues.apache.org/jira/browse/FLINK-19441 <
> https://issues.apache.org/jira/browse/FLINK-19441>)
> - 3 Runtime (https://issues.apache.org/jira/browse/FLINK-19264 <
> https://issues.apache.org/jira/browse/FLINK-19264>,
> https://issues.apache.org/jira/browse/FLINK-19388 <
> https://issues.apache.org/jira/browse/FLINK-19388>,
> https://issues.apache.org/jira/browse/FLINK-19249 <
> https://issues.apache.org/jira/browse/FLINK-19249>)
> - 1 HBase connector (https://issues.apache.org/jira/browse/FLINK-19445 <
> https://issues.apache.org/jira/browse/FLINK-19445>)
> - 1 Application mode (https://issues.apache.org/jira/browse/FLINK-19154 <
> https://issues.apache.org/jira/browse/FLINK-19154>)
> - 1 New source API (https://issues.apache.org/jira/browse/FLINK-19384 <
> https://issues.apache.org/jira/browse/FLINK-19384>)
> - 1 Kinesis (https://issues.apache.org/jira/browse/FLINK-19332 <
> https://issues.apache.org/jira/browse/FLINK-19332>)
>
> == Recent notable build instabilities which still have no owners:
> - New source API
>    https://issues.apache.org/jira/browse/FLINK-19253 <
> https://issues.apache.org/jira/browse/FLINK-19253>
> SourceReaderTestBase.testAddSplitToExistingFetcher hangs
>    https://issues.apache.org/jira/browse/FLINK-19370 <
> https://issues.apache.org/jira/browse/FLINK-19370>
> FileSourceTextLinesITCase.testContinuousTextFileSource failed as results
> mismatch
>    https://issues.apache.org/jira/browse/FLINK-19427 <
> https://issues.apache.org/jira/browse/FLINK-19427>
> SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable,
>    https://issues.apache.org/jira/browse/FLINK-19437 <
> https://issues.apache.org/jira/browse/FLINK-19437>
> FileSourceTextLinesITCase.testContinuousTextFileSource failed with
> "SimpleStreamFormat is not splittable, but found split end (0) different
> from file length (198)"
>    https://issues.apache.org/jira/browse/FLINK-19448 <
> https://issues.apache.org/jira/browse/FLINK-19448>
> CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs
> - Runtime/Network
>    https://issues.apache.org/jira/browse/FLINK-19426 <
> https://issues.apache.org/jira/browse/FLINK-19426>  End-to-end test
> sometimes fails with PartitionConnectionException
> - Unaligned Checkpoint
>    https://issues.apache.org/jira/browse/FLINK-19027 <
> https://issues.apache.org/jira/browse/FLINK-19027>
> UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointOnParallelRemoteChannel
> failed because of test timeout
> - Table
>    https://issues.apache.org/jira/browse/FLINK-19340 <
> https://issues.apache.org/jira/browse/FLINK-19340>
> AggregateITCase.testListAggWithDistinct failed with "expected:<List(1,A,
> 2,B, 3,C#A, 4,EF)> but was:<List(1,A, 2,B, 3,C#A, 4,EF#EF)>"
> - HBase connector
>    https://issues.apache.org/jira/browse/FLINK-18570 <
> https://issues.apache.org/jira/browse/FLINK-18570>
> SQLClientHBaseITCase.testHBase fails on azure
>     https://issues.apache.org/jira/browse/FLINK-19447 <
> https://issues.apache.org/jira/browse/FLINK-19447>
> HBaseConnectorITCase.HBaseTestingClusterAutoStarter failed with "Master not
> initialized after 200000ms"
> - Avro
>    https://issues.apache.org/jira/browse/FLINK-19422 <
> https://issues.apache.org/jira/browse/FLINK-19422>  Avro Confluent Schema
> Registry nightly end-to-end test failed with "Register operation timed out;
> error code: 50002"
>
> Regards,
> Dian
>
> > 在 2020年9月21日,下午2:32,Robert Metzger <rm...@apache.org> 写道:
> >
> > Hi all,
> >
> > An update on the release status:
> > 1. We have 35 days = *5 weeks left until feature freeze*
> > 2. There are currently 2 blockers for Flink
> > <https://issues.apache.org/jira/browse/FLINK-19264?filter=12349334>, all
> > making progress
> > 3. We have 72 test instabilities
> > <https://issues.apache.org/jira/browse/FLINK-19237> (down 7 from 2 weeks
> > ago). I have pinged people to help addressing frequent or critical
> issues.
> >
> > Best,
> > Robert
> >
> >
> > On Mon, Sep 7, 2020 at 10:37 AM Robert Metzger <rm...@apache.org>
> wrote:
> >
> >> Hi all,
> >>
> >> another two weeks have passed. We now have 5 blockers
> >> <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> (Up
> >> 3 from 2 weeks ago), but they are all making progress.
> >>
> >> We currently have 79 test-instabilities
> >> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>,
> >> since the last report, a few have been resolved, and some others have
> been
> >> added.
> >> I have checked the tickets, closed some old ones and pinged people to
> help
> >> resolve new or frequent ones.
> >> Except for Kafka, there are no major clusters of test instabilities.
> Most
> >> failures are rarely failing tests across the entire system.
> >>
> >>
> >> On Tue, Aug 25, 2020 at 9:05 AM Rui Li <li...@gmail.com> wrote:
> >>
> >>> Thanks Dian for the pointer. I'll take a look.
> >>>
> >>> On Tue, Aug 25, 2020 at 3:02 PM Dian Fu <di...@gmail.com> wrote:
> >>>
> >>>> Thanks Rui for the info. This issue(hive related)
> >>>> https://issues.apache.org/jira/browse/FLINK-19025 <
> >>>> https://issues.apache.org/jira/browse/FLINK-19025> is marked as a
> >>> blocker.
> >>>>
> >>>> Regards,
> >>>> Dian
> >>>>
> >>>>> 在 2020年8月25日,下午2:58,Rui Li <li...@gmail.com> 写道:
> >>>>>
> >>>>> Hi Dian,
> >>>>>
> >>>>> FLINK-18682 has been fixed. Is there any other blocker in the hive
> >>>>> connector?
> >>>>>
> >>>>> On Tue, Aug 25, 2020 at 2:41 PM Dian Fu <dian0511.fu@gmail.com
> >>> <mailto:
> >>>> dian0511.fu@gmail.com>> wrote:
> >>>>>
> >>>>>> Hi all,
> >>>>>>
> >>>>>> Two weeks have passed and it seems that none of the test stabilities
> >>>>>> issues have been addressed since then.
> >>>>>>
> >>>>>> Here is an updated status report of Blockers and Test instabilities:
> >>>>>>
> >>>>>> Blockers <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>>>:
> >>>>>> Currently 2 blockers (1x Hive, 1x CI Infra)
> >>>>>>
> >>>>>> Test-Instabilities <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580> <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>>>:
> >>>>>> (total 80)
> >>>>>>
> >>>>>> Besides the issues already posted in previous mail, here are the new
> >>>>>> instability issues which should be taken care of:
> >>>>>>
> >>>>>> - FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012 <
> >>>> https://issues.apache.org/jira/browse/FLINK-19012> <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-19012 <
> >>>> https://issues.apache.org/jira/browse/FLINK-19012>>)
> >>>>>> E2E test fails with "Cannot register Closeable, this
> >>>>>> subtaskCheckpointCoordinator is already closed. Closing argument."
> >>>>>>
> >>>>>> -> This is a new issue occurred recently. It has occurred several
> >>> times
> >>>>>> and may indicate a bug somewhere and should be taken care of.
> >>>>>>
> >>>>>> - FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 <
> >>>> https://issues.apache.org/jira/browse/FLINK-9992> <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-9992 <
> >>>> https://issues.apache.org/jira/browse/FLINK-9992>>)
> >>>>>> FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI
> >>>>>>
> >>>>>> -> There is already a PR for it and needs review.
> >>>>>>
> >>>>>> - FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18842> <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18842 <
> >>>> https://issues.apache.org/jira/browse/FLINK-18842>>)
> >>>>>> e2e test failed to download "localhost:9999/flink.tgz" in "Wordcount
> >>> on
> >>>>>> Docker test"
> >>>>>>
> >>>>>>
> >>>>>>> 在 2020年8月11日,下午2:08,Robert Metzger <rm...@apache.org> 写道:
> >>>>>>>
> >>>>>>> Hi team,
> >>>>>>>
> >>>>>>> 2 weeks have passed since the last update. None of the test
> >>> stabilities
> >>>>>>> I've mentioned have been addressed since then.
> >>>>>>>
> >>>>>>> Here's an updated status report of Blockers and Test instabilities:
> >>>>>>>
> >>>>>>> Blockers <
> >>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>:
> >>>>>>> Currently 3 blockers (2x Hive, 1x CI Infra)
> >>>>>>>
> >>>>>>> Test-Instabilities
> >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580
> >
> >>>>>> (total
> >>>>>>> 79) which failed recently or frequently:
> >>>>>>>
> >>>>>>>
> >>>>>>> - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807>
> >>>>>>> FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
> >>>>>>> failed with "Timeout expired after 60000milliseconds while awaiting
> >>>>>>> EndTxn(COMMIT)"
> >>>>>>>
> >>>>>>> - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634>
> >>>>>>> FlinkKafkaProducerITCase.testRecoverCommittedTransaction
> >>>>>>> failed with "Timeout expired after 60000milliseconds while awaiting
> >>>>>>> InitProducerId"
> >>>>>>>
> >>>>>>> - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908>
> >>>>>>> FlinkKafkaProducerITCase
> >>>>>>> testScaleUpAfterScalingDown Timeout expired while initializing
> >>>>>>> transactional state in 60000ms.
> >>>>>>>
> >>>>>>> - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733>
> >>>>>>> FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis
> >>>>>>>
> >>>>>>> --> The first three tickets seem related.
> >>>>>>>
> >>>>>>>
> >>>>>>> - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260>
> >>>>>>> StreamingKafkaITCase failure on Azure
> >>>>>>>
> >>>>>>> --> This one seems really hard to reproduce
> >>>>>>>
> >>>>>>>
> >>>>>>> - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768>
> >>>>>>> HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
> >>>>>>> hangs
> >>>>>>>
> >>>>>>> - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374>
> >>>>>>>
> >>>>>>
> >>>>
> >>>
> HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
> >>>>>>> produced no output for 900 seconds
> >>>>>>>
> >>>>>>> --> nobody seems to feel responsible for these tickets. My guess is
> >>>> that
> >>>>>>> the S3 connector should have shorter timeouts / faster retries to
> >>>> finish
> >>>>>>> within the 15 minutes test timeout. OR there is really something
> >>> wrong
> >>>>>> with
> >>>>>>> the code.
> >>>>>>>
> >>>>>>>
> >>>>>>> - FLINK-18333 UnsignedTypeConversionITCase failed caused by
> >>> MariaDB4j
> >>>>>>> "Asked to waitFor Program"
> >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>
> >>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
> >>>>>>> <https://issues.apache.org/jira/browse/FLINK-17159> ES6
> >>>>>>> ElasticsearchSinkITCase unstable
> >>>>>>>
> >>>>>>> - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949>
> >>>>>>> KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
> >>>>>>> expected:<310> but was:<0>
> >>>>>>>
> >>>>>>> - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222>
> >>>> "Avro
> >>>>>>> Confluent Schema Registry nightly end-to-end test" unstable with
> >>> "Kafka
> >>>>>>> cluster did not start after 120 seconds"
> >>>>>>>
> >>>>>>> - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511>
> >>>>>> "RocksDB
> >>>>>>> Memory Management end-to-end test" fails with "Current block cache
> >>>> usage
> >>>>>>> 202123272 larger than expected memory limit 200000000"
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>>
> >>>>>>> On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <
> rmetzger@apache.org
> >>>>
> >>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi team,
> >>>>>>>>
> >>>>>>>> We would like to use this thread as a permanent thread for
> >>>>>>>> regularly syncing on stale blockers (need to have somebody
> assigned
> >>>>>> within
> >>>>>>>> a week and progress, or a good plan) and build instabilities (need
> >>> to
> >>>>>> check
> >>>>>>>> if its a blocker).
> >>>>>>>>
> >>>>>>>> Recent test-instabilities:
> >>>>>>>>
> >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
> >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test
> >>>>>> unstable)
> >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test
> >>>>>> unstable)
> >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17949
> >>>>>>>> (KafkaShuffleITCase)
> >>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
> >>>>>>>> transactions)
> >>>>>>>>
> >>>>>>>>
> >>>>>>>> It would be nice if the committers taking care of these components
> >>>> could
> >>>>>>>> look into the test failures.
> >>>>>>>> If nothing happens, we'll personally reach out to people I believe
> >>>> they
> >>>>>>>> could look into the ticket.
> >>>>>>>>
> >>>>>>>> Best,
> >>>>>>>> Dian & Robert
> >>>>>>>>
> >>>>>>
> >>>>>>
> >>>>>
> >>>>> --
> >>>>> Best regards!
> >>>>> Rui Li
> >>>>
> >>>>
> >>>
> >>> --
> >>> Best regards!
> >>> Rui Li
> >>>
> >>
>
>

Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Posted by Dian Fu <di...@gmail.com>.
Hi all,

I'd like to update the status about the blocker issues and build instabilities as there is only one month left and the number of blocker issues increases a lot compared to last week.

== Blockers: https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>

Currently there are 10 blocker issues 
- 3 performance regression (https://issues.apache.org/jira/browse/FLINK-19439 <https://issues.apache.org/jira/browse/FLINK-19439>, https://issues.apache.org/jira/browse/FLINK-19440 <https://issues.apache.org/jira/browse/FLINK-19440>, https://issues.apache.org/jira/browse/FLINK-19441 <https://issues.apache.org/jira/browse/FLINK-19441>)
- 3 Runtime (https://issues.apache.org/jira/browse/FLINK-19264 <https://issues.apache.org/jira/browse/FLINK-19264>, https://issues.apache.org/jira/browse/FLINK-19388 <https://issues.apache.org/jira/browse/FLINK-19388>, https://issues.apache.org/jira/browse/FLINK-19249 <https://issues.apache.org/jira/browse/FLINK-19249>)
- 1 HBase connector (https://issues.apache.org/jira/browse/FLINK-19445 <https://issues.apache.org/jira/browse/FLINK-19445>)
- 1 Application mode (https://issues.apache.org/jira/browse/FLINK-19154 <https://issues.apache.org/jira/browse/FLINK-19154>)
- 1 New source API (https://issues.apache.org/jira/browse/FLINK-19384 <https://issues.apache.org/jira/browse/FLINK-19384>)
- 1 Kinesis (https://issues.apache.org/jira/browse/FLINK-19332 <https://issues.apache.org/jira/browse/FLINK-19332>)

== Recent notable build instabilities which still have no owners:
- New source API
   https://issues.apache.org/jira/browse/FLINK-19253 <https://issues.apache.org/jira/browse/FLINK-19253>  SourceReaderTestBase.testAddSplitToExistingFetcher hangs
   https://issues.apache.org/jira/browse/FLINK-19370 <https://issues.apache.org/jira/browse/FLINK-19370>  FileSourceTextLinesITCase.testContinuousTextFileSource failed as results mismatch
   https://issues.apache.org/jira/browse/FLINK-19427 <https://issues.apache.org/jira/browse/FLINK-19427>  SplitFetcherTest.testNotifiesWhenGoingIdleConcurrent is instable, 
   https://issues.apache.org/jira/browse/FLINK-19437 <https://issues.apache.org/jira/browse/FLINK-19437>  FileSourceTextLinesITCase.testContinuousTextFileSource failed with "SimpleStreamFormat is not splittable, but found split end (0) different from file length (198)"
   https://issues.apache.org/jira/browse/FLINK-19448 <https://issues.apache.org/jira/browse/FLINK-19448>  CoordinatedSourceITCase.testEnumeratorReaderCommunication hangs
- Runtime/Network 
   https://issues.apache.org/jira/browse/FLINK-19426 <https://issues.apache.org/jira/browse/FLINK-19426>  End-to-end test sometimes fails with PartitionConnectionException
- Unaligned Checkpoint
   https://issues.apache.org/jira/browse/FLINK-19027 <https://issues.apache.org/jira/browse/FLINK-19027>  UnalignedCheckpointITCase.shouldPerformUnalignedCheckpointOnParallelRemoteChannel failed because of test timeout
- Table 
   https://issues.apache.org/jira/browse/FLINK-19340 <https://issues.apache.org/jira/browse/FLINK-19340> AggregateITCase.testListAggWithDistinct failed with "expected:<List(1,A, 2,B, 3,C#A, 4,EF)> but was:<List(1,A, 2,B, 3,C#A, 4,EF#EF)>"
- HBase connector
   https://issues.apache.org/jira/browse/FLINK-18570 <https://issues.apache.org/jira/browse/FLINK-18570>  SQLClientHBaseITCase.testHBase fails on azure
    https://issues.apache.org/jira/browse/FLINK-19447 <https://issues.apache.org/jira/browse/FLINK-19447>  HBaseConnectorITCase.HBaseTestingClusterAutoStarter failed with "Master not initialized after 200000ms"
- Avro
   https://issues.apache.org/jira/browse/FLINK-19422 <https://issues.apache.org/jira/browse/FLINK-19422>  Avro Confluent Schema Registry nightly end-to-end test failed with "Register operation timed out; error code: 50002"

Regards,
Dian

> 在 2020年9月21日,下午2:32,Robert Metzger <rm...@apache.org> 写道:
> 
> Hi all,
> 
> An update on the release status:
> 1. We have 35 days = *5 weeks left until feature freeze*
> 2. There are currently 2 blockers for Flink
> <https://issues.apache.org/jira/browse/FLINK-19264?filter=12349334>, all
> making progress
> 3. We have 72 test instabilities
> <https://issues.apache.org/jira/browse/FLINK-19237> (down 7 from 2 weeks
> ago). I have pinged people to help addressing frequent or critical issues.
> 
> Best,
> Robert
> 
> 
> On Mon, Sep 7, 2020 at 10:37 AM Robert Metzger <rm...@apache.org> wrote:
> 
>> Hi all,
>> 
>> another two weeks have passed. We now have 5 blockers
>> <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> (Up
>> 3 from 2 weeks ago), but they are all making progress.
>> 
>> We currently have 79 test-instabilities
>> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>,
>> since the last report, a few have been resolved, and some others have been
>> added.
>> I have checked the tickets, closed some old ones and pinged people to help
>> resolve new or frequent ones.
>> Except for Kafka, there are no major clusters of test instabilities. Most
>> failures are rarely failing tests across the entire system.
>> 
>> 
>> On Tue, Aug 25, 2020 at 9:05 AM Rui Li <li...@gmail.com> wrote:
>> 
>>> Thanks Dian for the pointer. I'll take a look.
>>> 
>>> On Tue, Aug 25, 2020 at 3:02 PM Dian Fu <di...@gmail.com> wrote:
>>> 
>>>> Thanks Rui for the info. This issue(hive related)
>>>> https://issues.apache.org/jira/browse/FLINK-19025 <
>>>> https://issues.apache.org/jira/browse/FLINK-19025> is marked as a
>>> blocker.
>>>> 
>>>> Regards,
>>>> Dian
>>>> 
>>>>> 在 2020年8月25日,下午2:58,Rui Li <li...@gmail.com> 写道:
>>>>> 
>>>>> Hi Dian,
>>>>> 
>>>>> FLINK-18682 has been fixed. Is there any other blocker in the hive
>>>>> connector?
>>>>> 
>>>>> On Tue, Aug 25, 2020 at 2:41 PM Dian Fu <dian0511.fu@gmail.com
>>> <mailto:
>>>> dian0511.fu@gmail.com>> wrote:
>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> Two weeks have passed and it seems that none of the test stabilities
>>>>>> issues have been addressed since then.
>>>>>> 
>>>>>> Here is an updated status report of Blockers and Test instabilities:
>>>>>> 
>>>>>> Blockers <
>>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> <
>>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>>>:
>>>>>> Currently 2 blockers (1x Hive, 1x CI Infra)
>>>>>> 
>>>>>> Test-Instabilities <
>>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <
>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580> <
>>>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <
>>>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>>>:
>>>>>> (total 80)
>>>>>> 
>>>>>> Besides the issues already posted in previous mail, here are the new
>>>>>> instability issues which should be taken care of:
>>>>>> 
>>>>>> - FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012 <
>>>> https://issues.apache.org/jira/browse/FLINK-19012> <
>>>>>> https://issues.apache.org/jira/browse/FLINK-19012 <
>>>> https://issues.apache.org/jira/browse/FLINK-19012>>)
>>>>>> E2E test fails with "Cannot register Closeable, this
>>>>>> subtaskCheckpointCoordinator is already closed. Closing argument."
>>>>>> 
>>>>>> -> This is a new issue occurred recently. It has occurred several
>>> times
>>>>>> and may indicate a bug somewhere and should be taken care of.
>>>>>> 
>>>>>> - FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 <
>>>> https://issues.apache.org/jira/browse/FLINK-9992> <
>>>>>> https://issues.apache.org/jira/browse/FLINK-9992 <
>>>> https://issues.apache.org/jira/browse/FLINK-9992>>)
>>>>>> FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI
>>>>>> 
>>>>>> -> There is already a PR for it and needs review.
>>>>>> 
>>>>>> - FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842 <
>>>> https://issues.apache.org/jira/browse/FLINK-18842> <
>>>>>> https://issues.apache.org/jira/browse/FLINK-18842 <
>>>> https://issues.apache.org/jira/browse/FLINK-18842>>)
>>>>>> e2e test failed to download "localhost:9999/flink.tgz" in "Wordcount
>>> on
>>>>>> Docker test"
>>>>>> 
>>>>>> 
>>>>>>> 在 2020年8月11日,下午2:08,Robert Metzger <rm...@apache.org> 写道:
>>>>>>> 
>>>>>>> Hi team,
>>>>>>> 
>>>>>>> 2 weeks have passed since the last update. None of the test
>>> stabilities
>>>>>>> I've mentioned have been addressed since then.
>>>>>>> 
>>>>>>> Here's an updated status report of Blockers and Test instabilities:
>>>>>>> 
>>>>>>> Blockers <
>>>>>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>:
>>>>>>> Currently 3 blockers (2x Hive, 1x CI Infra)
>>>>>>> 
>>>>>>> Test-Instabilities
>>>>>>> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>
>>>>>> (total
>>>>>>> 79) which failed recently or frequently:
>>>>>>> 
>>>>>>> 
>>>>>>> - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807>
>>>>>>> FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
>>>>>>> failed with "Timeout expired after 60000milliseconds while awaiting
>>>>>>> EndTxn(COMMIT)"
>>>>>>> 
>>>>>>> - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634>
>>>>>>> FlinkKafkaProducerITCase.testRecoverCommittedTransaction
>>>>>>> failed with "Timeout expired after 60000milliseconds while awaiting
>>>>>>> InitProducerId"
>>>>>>> 
>>>>>>> - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908>
>>>>>>> FlinkKafkaProducerITCase
>>>>>>> testScaleUpAfterScalingDown Timeout expired while initializing
>>>>>>> transactional state in 60000ms.
>>>>>>> 
>>>>>>> - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733>
>>>>>>> FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis
>>>>>>> 
>>>>>>> --> The first three tickets seem related.
>>>>>>> 
>>>>>>> 
>>>>>>> - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260>
>>>>>>> StreamingKafkaITCase failure on Azure
>>>>>>> 
>>>>>>> --> This one seems really hard to reproduce
>>>>>>> 
>>>>>>> 
>>>>>>> - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768>
>>>>>>> HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
>>>>>>> hangs
>>>>>>> 
>>>>>>> - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374>
>>>>>>> 
>>>>>> 
>>>> 
>>> HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
>>>>>>> produced no output for 900 seconds
>>>>>>> 
>>>>>>> --> nobody seems to feel responsible for these tickets. My guess is
>>>> that
>>>>>>> the S3 connector should have shorter timeouts / faster retries to
>>>> finish
>>>>>>> within the 15 minutes test timeout. OR there is really something
>>> wrong
>>>>>> with
>>>>>>> the code.
>>>>>>> 
>>>>>>> 
>>>>>>> - FLINK-18333 UnsignedTypeConversionITCase failed caused by
>>> MariaDB4j
>>>>>>> "Asked to waitFor Program"
>>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>
>>>>>>> <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
>>>>>>> <https://issues.apache.org/jira/browse/FLINK-17159> ES6
>>>>>>> ElasticsearchSinkITCase unstable
>>>>>>> 
>>>>>>> - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949>
>>>>>>> KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
>>>>>>> expected:<310> but was:<0>
>>>>>>> 
>>>>>>> - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222>
>>>> "Avro
>>>>>>> Confluent Schema Registry nightly end-to-end test" unstable with
>>> "Kafka
>>>>>>> cluster did not start after 120 seconds"
>>>>>>> 
>>>>>>> - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511>
>>>>>> "RocksDB
>>>>>>> Memory Management end-to-end test" fails with "Current block cache
>>>> usage
>>>>>>> 202123272 larger than expected memory limit 200000000"
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <rmetzger@apache.org
>>>> 
>>>>>> wrote:
>>>>>>> 
>>>>>>>> Hi team,
>>>>>>>> 
>>>>>>>> We would like to use this thread as a permanent thread for
>>>>>>>> regularly syncing on stale blockers (need to have somebody assigned
>>>>>> within
>>>>>>>> a week and progress, or a good plan) and build instabilities (need
>>> to
>>>>>> check
>>>>>>>> if its a blocker).
>>>>>>>> 
>>>>>>>> Recent test-instabilities:
>>>>>>>> 
>>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
>>>>>>>> - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test
>>>>>> unstable)
>>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test
>>>>>> unstable)
>>>>>>>> - https://issues.apache.org/jira/browse/FLINK-17949
>>>>>>>> (KafkaShuffleITCase)
>>>>>>>> - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
>>>>>>>> transactions)
>>>>>>>> 
>>>>>>>> 
>>>>>>>> It would be nice if the committers taking care of these components
>>>> could
>>>>>>>> look into the test failures.
>>>>>>>> If nothing happens, we'll personally reach out to people I believe
>>>> they
>>>>>>>> could look into the ticket.
>>>>>>>> 
>>>>>>>> Best,
>>>>>>>> Dian & Robert
>>>>>>>> 
>>>>>> 
>>>>>> 
>>>>> 
>>>>> --
>>>>> Best regards!
>>>>> Rui Li
>>>> 
>>>> 
>>> 
>>> --
>>> Best regards!
>>> Rui Li
>>> 
>> 


Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Posted by Robert Metzger <rm...@apache.org>.
Hi all,

An update on the release status:
1. We have 35 days = *5 weeks left until feature freeze*
2. There are currently 2 blockers for Flink
<https://issues.apache.org/jira/browse/FLINK-19264?filter=12349334>, all
making progress
3. We have 72 test instabilities
<https://issues.apache.org/jira/browse/FLINK-19237> (down 7 from 2 weeks
ago). I have pinged people to help addressing frequent or critical issues.

Best,
Robert


On Mon, Sep 7, 2020 at 10:37 AM Robert Metzger <rm...@apache.org> wrote:

> Hi all,
>
> another two weeks have passed. We now have 5 blockers
> <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> (Up
> 3 from 2 weeks ago), but they are all making progress.
>
> We currently have 79 test-instabilities
> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>,
> since the last report, a few have been resolved, and some others have been
> added.
> I have checked the tickets, closed some old ones and pinged people to help
> resolve new or frequent ones.
> Except for Kafka, there are no major clusters of test instabilities. Most
> failures are rarely failing tests across the entire system.
>
>
> On Tue, Aug 25, 2020 at 9:05 AM Rui Li <li...@gmail.com> wrote:
>
>> Thanks Dian for the pointer. I'll take a look.
>>
>> On Tue, Aug 25, 2020 at 3:02 PM Dian Fu <di...@gmail.com> wrote:
>>
>> > Thanks Rui for the info. This issue(hive related)
>> > https://issues.apache.org/jira/browse/FLINK-19025 <
>> > https://issues.apache.org/jira/browse/FLINK-19025> is marked as a
>> blocker.
>> >
>> > Regards,
>> > Dian
>> >
>> > > 在 2020年8月25日,下午2:58,Rui Li <li...@gmail.com> 写道:
>> > >
>> > > Hi Dian,
>> > >
>> > > FLINK-18682 has been fixed. Is there any other blocker in the hive
>> > > connector?
>> > >
>> > > On Tue, Aug 25, 2020 at 2:41 PM Dian Fu <dian0511.fu@gmail.com
>> <mailto:
>> > dian0511.fu@gmail.com>> wrote:
>> > >
>> > >> Hi all,
>> > >>
>> > >> Two weeks have passed and it seems that none of the test stabilities
>> > >> issues have been addressed since then.
>> > >>
>> > >> Here is an updated status report of Blockers and Test instabilities:
>> > >>
>> > >> Blockers <
>> > >> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
>> > https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> <
>> > >> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
>> > https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>>>:
>> > >> Currently 2 blockers (1x Hive, 1x CI Infra)
>> > >>
>> > >> Test-Instabilities <
>> > >> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <
>> > https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580> <
>> > >> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <
>> > https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>>>:
>> > >> (total 80)
>> > >>
>> > >> Besides the issues already posted in previous mail, here are the new
>> > >> instability issues which should be taken care of:
>> > >>
>> > >> - FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012 <
>> > https://issues.apache.org/jira/browse/FLINK-19012> <
>> > >> https://issues.apache.org/jira/browse/FLINK-19012 <
>> > https://issues.apache.org/jira/browse/FLINK-19012>>)
>> > >> E2E test fails with "Cannot register Closeable, this
>> > >> subtaskCheckpointCoordinator is already closed. Closing argument."
>> > >>
>> > >> -> This is a new issue occurred recently. It has occurred several
>> times
>> > >> and may indicate a bug somewhere and should be taken care of.
>> > >>
>> > >> - FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 <
>> > https://issues.apache.org/jira/browse/FLINK-9992> <
>> > >> https://issues.apache.org/jira/browse/FLINK-9992 <
>> > https://issues.apache.org/jira/browse/FLINK-9992>>)
>> > >> FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI
>> > >>
>> > >> -> There is already a PR for it and needs review.
>> > >>
>> > >> - FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842 <
>> > https://issues.apache.org/jira/browse/FLINK-18842> <
>> > >> https://issues.apache.org/jira/browse/FLINK-18842 <
>> > https://issues.apache.org/jira/browse/FLINK-18842>>)
>> > >> e2e test failed to download "localhost:9999/flink.tgz" in "Wordcount
>> on
>> > >> Docker test"
>> > >>
>> > >>
>> > >>> 在 2020年8月11日,下午2:08,Robert Metzger <rm...@apache.org> 写道:
>> > >>>
>> > >>> Hi team,
>> > >>>
>> > >>> 2 weeks have passed since the last update. None of the test
>> stabilities
>> > >>> I've mentioned have been addressed since then.
>> > >>>
>> > >>> Here's an updated status report of Blockers and Test instabilities:
>> > >>>
>> > >>> Blockers <
>> > >> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>:
>> > >>> Currently 3 blockers (2x Hive, 1x CI Infra)
>> > >>>
>> > >>> Test-Instabilities
>> > >>> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>
>> > >> (total
>> > >>> 79) which failed recently or frequently:
>> > >>>
>> > >>>
>> > >>> - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807>
>> > >>> FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
>> > >>> failed with "Timeout expired after 60000milliseconds while awaiting
>> > >>> EndTxn(COMMIT)"
>> > >>>
>> > >>> - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634>
>> > >>> FlinkKafkaProducerITCase.testRecoverCommittedTransaction
>> > >>> failed with "Timeout expired after 60000milliseconds while awaiting
>> > >>> InitProducerId"
>> > >>>
>> > >>> - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908>
>> > >>> FlinkKafkaProducerITCase
>> > >>> testScaleUpAfterScalingDown Timeout expired while initializing
>> > >>> transactional state in 60000ms.
>> > >>>
>> > >>> - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733>
>> > >>> FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis
>> > >>>
>> > >>> --> The first three tickets seem related.
>> > >>>
>> > >>>
>> > >>> - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260>
>> > >>> StreamingKafkaITCase failure on Azure
>> > >>>
>> > >>> --> This one seems really hard to reproduce
>> > >>>
>> > >>>
>> > >>> - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768>
>> > >>> HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
>> > >>> hangs
>> > >>>
>> > >>> - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374>
>> > >>>
>> > >>
>> >
>> HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
>> > >>> produced no output for 900 seconds
>> > >>>
>> > >>> --> nobody seems to feel responsible for these tickets. My guess is
>> > that
>> > >>> the S3 connector should have shorter timeouts / faster retries to
>> > finish
>> > >>> within the 15 minutes test timeout. OR there is really something
>> wrong
>> > >> with
>> > >>> the code.
>> > >>>
>> > >>>
>> > >>> - FLINK-18333 UnsignedTypeConversionITCase failed caused by
>> MariaDB4j
>> > >>> "Asked to waitFor Program"
>> > >>> <https://issues.apache.org/jira/browse/FLINK-18333>
>> > >>> <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
>> > >>> <https://issues.apache.org/jira/browse/FLINK-17159> ES6
>> > >>> ElasticsearchSinkITCase unstable
>> > >>>
>> > >>> - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949>
>> > >>> KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
>> > >>> expected:<310> but was:<0>
>> > >>>
>> > >>> - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222>
>> > "Avro
>> > >>> Confluent Schema Registry nightly end-to-end test" unstable with
>> "Kafka
>> > >>> cluster did not start after 120 seconds"
>> > >>>
>> > >>> - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511>
>> > >> "RocksDB
>> > >>> Memory Management end-to-end test" fails with "Current block cache
>> > usage
>> > >>> 202123272 larger than expected memory limit 200000000"
>> > >>>
>> > >>>
>> > >>>
>> > >>>
>> > >>> On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <rmetzger@apache.org
>> >
>> > >> wrote:
>> > >>>
>> > >>>> Hi team,
>> > >>>>
>> > >>>> We would like to use this thread as a permanent thread for
>> > >>>> regularly syncing on stale blockers (need to have somebody assigned
>> > >> within
>> > >>>> a week and progress, or a good plan) and build instabilities (need
>> to
>> > >> check
>> > >>>> if its a blocker).
>> > >>>>
>> > >>>> Recent test-instabilities:
>> > >>>>
>> > >>>>  - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
>> > >>>>  - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test
>> > >> unstable)
>> > >>>>  - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test
>> > >> unstable)
>> > >>>>  - https://issues.apache.org/jira/browse/FLINK-17949
>> > >>>>  (KafkaShuffleITCase)
>> > >>>>  - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
>> > >>>>  transactions)
>> > >>>>
>> > >>>>
>> > >>>> It would be nice if the committers taking care of these components
>> > could
>> > >>>> look into the test failures.
>> > >>>> If nothing happens, we'll personally reach out to people I believe
>> > they
>> > >>>> could look into the ticket.
>> > >>>>
>> > >>>> Best,
>> > >>>> Dian & Robert
>> > >>>>
>> > >>
>> > >>
>> > >
>> > > --
>> > > Best regards!
>> > > Rui Li
>> >
>> >
>>
>> --
>> Best regards!
>> Rui Li
>>
>

Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Posted by Robert Metzger <rm...@apache.org>.
Hi all,

another two weeks have passed. We now have 5 blockers
<https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> (Up
3 from 2 weeks ago), but they are all making progress.

We currently have 79 test-instabilities
<https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>, since
the last report, a few have been resolved, and some others have been added.
I have checked the tickets, closed some old ones and pinged people to help
resolve new or frequent ones.
Except for Kafka, there are no major clusters of test instabilities. Most
failures are rarely failing tests across the entire system.


On Tue, Aug 25, 2020 at 9:05 AM Rui Li <li...@gmail.com> wrote:

> Thanks Dian for the pointer. I'll take a look.
>
> On Tue, Aug 25, 2020 at 3:02 PM Dian Fu <di...@gmail.com> wrote:
>
> > Thanks Rui for the info. This issue(hive related)
> > https://issues.apache.org/jira/browse/FLINK-19025 <
> > https://issues.apache.org/jira/browse/FLINK-19025> is marked as a
> blocker.
> >
> > Regards,
> > Dian
> >
> > > 在 2020年8月25日,下午2:58,Rui Li <li...@gmail.com> 写道:
> > >
> > > Hi Dian,
> > >
> > > FLINK-18682 has been fixed. Is there any other blocker in the hive
> > > connector?
> > >
> > > On Tue, Aug 25, 2020 at 2:41 PM Dian Fu <dian0511.fu@gmail.com
> <mailto:
> > dian0511.fu@gmail.com>> wrote:
> > >
> > >> Hi all,
> > >>
> > >> Two weeks have passed and it seems that none of the test stabilities
> > >> issues have been addressed since then.
> > >>
> > >> Here is an updated status report of Blockers and Test instabilities:
> > >>
> > >> Blockers <
> > >> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> > https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> <
> > >> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> > https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>>>:
> > >> Currently 2 blockers (1x Hive, 1x CI Infra)
> > >>
> > >> Test-Instabilities <
> > >> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <
> > https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580> <
> > >> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <
> > https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>>>:
> > >> (total 80)
> > >>
> > >> Besides the issues already posted in previous mail, here are the new
> > >> instability issues which should be taken care of:
> > >>
> > >> - FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012 <
> > https://issues.apache.org/jira/browse/FLINK-19012> <
> > >> https://issues.apache.org/jira/browse/FLINK-19012 <
> > https://issues.apache.org/jira/browse/FLINK-19012>>)
> > >> E2E test fails with "Cannot register Closeable, this
> > >> subtaskCheckpointCoordinator is already closed. Closing argument."
> > >>
> > >> -> This is a new issue occurred recently. It has occurred several
> times
> > >> and may indicate a bug somewhere and should be taken care of.
> > >>
> > >> - FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 <
> > https://issues.apache.org/jira/browse/FLINK-9992> <
> > >> https://issues.apache.org/jira/browse/FLINK-9992 <
> > https://issues.apache.org/jira/browse/FLINK-9992>>)
> > >> FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI
> > >>
> > >> -> There is already a PR for it and needs review.
> > >>
> > >> - FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842 <
> > https://issues.apache.org/jira/browse/FLINK-18842> <
> > >> https://issues.apache.org/jira/browse/FLINK-18842 <
> > https://issues.apache.org/jira/browse/FLINK-18842>>)
> > >> e2e test failed to download "localhost:9999/flink.tgz" in "Wordcount
> on
> > >> Docker test"
> > >>
> > >>
> > >>> 在 2020年8月11日,下午2:08,Robert Metzger <rm...@apache.org> 写道:
> > >>>
> > >>> Hi team,
> > >>>
> > >>> 2 weeks have passed since the last update. None of the test
> stabilities
> > >>> I've mentioned have been addressed since then.
> > >>>
> > >>> Here's an updated status report of Blockers and Test instabilities:
> > >>>
> > >>> Blockers <
> > >> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>:
> > >>> Currently 3 blockers (2x Hive, 1x CI Infra)
> > >>>
> > >>> Test-Instabilities
> > >>> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>
> > >> (total
> > >>> 79) which failed recently or frequently:
> > >>>
> > >>>
> > >>> - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807>
> > >>> FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
> > >>> failed with "Timeout expired after 60000milliseconds while awaiting
> > >>> EndTxn(COMMIT)"
> > >>>
> > >>> - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634>
> > >>> FlinkKafkaProducerITCase.testRecoverCommittedTransaction
> > >>> failed with "Timeout expired after 60000milliseconds while awaiting
> > >>> InitProducerId"
> > >>>
> > >>> - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908>
> > >>> FlinkKafkaProducerITCase
> > >>> testScaleUpAfterScalingDown Timeout expired while initializing
> > >>> transactional state in 60000ms.
> > >>>
> > >>> - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733>
> > >>> FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis
> > >>>
> > >>> --> The first three tickets seem related.
> > >>>
> > >>>
> > >>> - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260>
> > >>> StreamingKafkaITCase failure on Azure
> > >>>
> > >>> --> This one seems really hard to reproduce
> > >>>
> > >>>
> > >>> - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768>
> > >>> HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
> > >>> hangs
> > >>>
> > >>> - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374>
> > >>>
> > >>
> >
> HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
> > >>> produced no output for 900 seconds
> > >>>
> > >>> --> nobody seems to feel responsible for these tickets. My guess is
> > that
> > >>> the S3 connector should have shorter timeouts / faster retries to
> > finish
> > >>> within the 15 minutes test timeout. OR there is really something
> wrong
> > >> with
> > >>> the code.
> > >>>
> > >>>
> > >>> - FLINK-18333 UnsignedTypeConversionITCase failed caused by MariaDB4j
> > >>> "Asked to waitFor Program"
> > >>> <https://issues.apache.org/jira/browse/FLINK-18333>
> > >>> <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
> > >>> <https://issues.apache.org/jira/browse/FLINK-17159> ES6
> > >>> ElasticsearchSinkITCase unstable
> > >>>
> > >>> - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949>
> > >>> KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
> > >>> expected:<310> but was:<0>
> > >>>
> > >>> - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222>
> > "Avro
> > >>> Confluent Schema Registry nightly end-to-end test" unstable with
> "Kafka
> > >>> cluster did not start after 120 seconds"
> > >>>
> > >>> - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511>
> > >> "RocksDB
> > >>> Memory Management end-to-end test" fails with "Current block cache
> > usage
> > >>> 202123272 larger than expected memory limit 200000000"
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <rm...@apache.org>
> > >> wrote:
> > >>>
> > >>>> Hi team,
> > >>>>
> > >>>> We would like to use this thread as a permanent thread for
> > >>>> regularly syncing on stale blockers (need to have somebody assigned
> > >> within
> > >>>> a week and progress, or a good plan) and build instabilities (need
> to
> > >> check
> > >>>> if its a blocker).
> > >>>>
> > >>>> Recent test-instabilities:
> > >>>>
> > >>>>  - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
> > >>>>  - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test
> > >> unstable)
> > >>>>  - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test
> > >> unstable)
> > >>>>  - https://issues.apache.org/jira/browse/FLINK-17949
> > >>>>  (KafkaShuffleITCase)
> > >>>>  - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
> > >>>>  transactions)
> > >>>>
> > >>>>
> > >>>> It would be nice if the committers taking care of these components
> > could
> > >>>> look into the test failures.
> > >>>> If nothing happens, we'll personally reach out to people I believe
> > they
> > >>>> could look into the ticket.
> > >>>>
> > >>>> Best,
> > >>>> Dian & Robert
> > >>>>
> > >>
> > >>
> > >
> > > --
> > > Best regards!
> > > Rui Li
> >
> >
>
> --
> Best regards!
> Rui Li
>

Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Posted by Rui Li <li...@gmail.com>.
Thanks Dian for the pointer. I'll take a look.

On Tue, Aug 25, 2020 at 3:02 PM Dian Fu <di...@gmail.com> wrote:

> Thanks Rui for the info. This issue(hive related)
> https://issues.apache.org/jira/browse/FLINK-19025 <
> https://issues.apache.org/jira/browse/FLINK-19025> is marked as a blocker.
>
> Regards,
> Dian
>
> > 在 2020年8月25日,下午2:58,Rui Li <li...@gmail.com> 写道:
> >
> > Hi Dian,
> >
> > FLINK-18682 has been fixed. Is there any other blocker in the hive
> > connector?
> >
> > On Tue, Aug 25, 2020 at 2:41 PM Dian Fu <dian0511.fu@gmail.com <mailto:
> dian0511.fu@gmail.com>> wrote:
> >
> >> Hi all,
> >>
> >> Two weeks have passed and it seems that none of the test stabilities
> >> issues have been addressed since then.
> >>
> >> Here is an updated status report of Blockers and Test instabilities:
> >>
> >> Blockers <
> >> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> <
> >> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>>>:
> >> Currently 2 blockers (1x Hive, 1x CI Infra)
> >>
> >> Test-Instabilities <
> >> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <
> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580> <
> >> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <
> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>>>:
> >> (total 80)
> >>
> >> Besides the issues already posted in previous mail, here are the new
> >> instability issues which should be taken care of:
> >>
> >> - FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012 <
> https://issues.apache.org/jira/browse/FLINK-19012> <
> >> https://issues.apache.org/jira/browse/FLINK-19012 <
> https://issues.apache.org/jira/browse/FLINK-19012>>)
> >> E2E test fails with "Cannot register Closeable, this
> >> subtaskCheckpointCoordinator is already closed. Closing argument."
> >>
> >> -> This is a new issue occurred recently. It has occurred several times
> >> and may indicate a bug somewhere and should be taken care of.
> >>
> >> - FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 <
> https://issues.apache.org/jira/browse/FLINK-9992> <
> >> https://issues.apache.org/jira/browse/FLINK-9992 <
> https://issues.apache.org/jira/browse/FLINK-9992>>)
> >> FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI
> >>
> >> -> There is already a PR for it and needs review.
> >>
> >> - FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842 <
> https://issues.apache.org/jira/browse/FLINK-18842> <
> >> https://issues.apache.org/jira/browse/FLINK-18842 <
> https://issues.apache.org/jira/browse/FLINK-18842>>)
> >> e2e test failed to download "localhost:9999/flink.tgz" in "Wordcount on
> >> Docker test"
> >>
> >>
> >>> 在 2020年8月11日,下午2:08,Robert Metzger <rm...@apache.org> 写道:
> >>>
> >>> Hi team,
> >>>
> >>> 2 weeks have passed since the last update. None of the test stabilities
> >>> I've mentioned have been addressed since then.
> >>>
> >>> Here's an updated status report of Blockers and Test instabilities:
> >>>
> >>> Blockers <
> >> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>:
> >>> Currently 3 blockers (2x Hive, 1x CI Infra)
> >>>
> >>> Test-Instabilities
> >>> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>
> >> (total
> >>> 79) which failed recently or frequently:
> >>>
> >>>
> >>> - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807>
> >>> FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
> >>> failed with "Timeout expired after 60000milliseconds while awaiting
> >>> EndTxn(COMMIT)"
> >>>
> >>> - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634>
> >>> FlinkKafkaProducerITCase.testRecoverCommittedTransaction
> >>> failed with "Timeout expired after 60000milliseconds while awaiting
> >>> InitProducerId"
> >>>
> >>> - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908>
> >>> FlinkKafkaProducerITCase
> >>> testScaleUpAfterScalingDown Timeout expired while initializing
> >>> transactional state in 60000ms.
> >>>
> >>> - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733>
> >>> FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis
> >>>
> >>> --> The first three tickets seem related.
> >>>
> >>>
> >>> - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260>
> >>> StreamingKafkaITCase failure on Azure
> >>>
> >>> --> This one seems really hard to reproduce
> >>>
> >>>
> >>> - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768>
> >>> HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
> >>> hangs
> >>>
> >>> - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374>
> >>>
> >>
> HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
> >>> produced no output for 900 seconds
> >>>
> >>> --> nobody seems to feel responsible for these tickets. My guess is
> that
> >>> the S3 connector should have shorter timeouts / faster retries to
> finish
> >>> within the 15 minutes test timeout. OR there is really something wrong
> >> with
> >>> the code.
> >>>
> >>>
> >>> - FLINK-18333 UnsignedTypeConversionITCase failed caused by MariaDB4j
> >>> "Asked to waitFor Program"
> >>> <https://issues.apache.org/jira/browse/FLINK-18333>
> >>> <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
> >>> <https://issues.apache.org/jira/browse/FLINK-17159> ES6
> >>> ElasticsearchSinkITCase unstable
> >>>
> >>> - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949>
> >>> KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
> >>> expected:<310> but was:<0>
> >>>
> >>> - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222>
> "Avro
> >>> Confluent Schema Registry nightly end-to-end test" unstable with "Kafka
> >>> cluster did not start after 120 seconds"
> >>>
> >>> - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511>
> >> "RocksDB
> >>> Memory Management end-to-end test" fails with "Current block cache
> usage
> >>> 202123272 larger than expected memory limit 200000000"
> >>>
> >>>
> >>>
> >>>
> >>> On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <rm...@apache.org>
> >> wrote:
> >>>
> >>>> Hi team,
> >>>>
> >>>> We would like to use this thread as a permanent thread for
> >>>> regularly syncing on stale blockers (need to have somebody assigned
> >> within
> >>>> a week and progress, or a good plan) and build instabilities (need to
> >> check
> >>>> if its a blocker).
> >>>>
> >>>> Recent test-instabilities:
> >>>>
> >>>>  - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
> >>>>  - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test
> >> unstable)
> >>>>  - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test
> >> unstable)
> >>>>  - https://issues.apache.org/jira/browse/FLINK-17949
> >>>>  (KafkaShuffleITCase)
> >>>>  - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
> >>>>  transactions)
> >>>>
> >>>>
> >>>> It would be nice if the committers taking care of these components
> could
> >>>> look into the test failures.
> >>>> If nothing happens, we'll personally reach out to people I believe
> they
> >>>> could look into the ticket.
> >>>>
> >>>> Best,
> >>>> Dian & Robert
> >>>>
> >>
> >>
> >
> > --
> > Best regards!
> > Rui Li
>
>

-- 
Best regards!
Rui Li

Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Posted by Dian Fu <di...@gmail.com>.
Thanks Rui for the info. This issue(hive related) https://issues.apache.org/jira/browse/FLINK-19025 <https://issues.apache.org/jira/browse/FLINK-19025> is marked as a blocker.

Regards,
Dian

> 在 2020年8月25日,下午2:58,Rui Li <li...@gmail.com> 写道:
> 
> Hi Dian,
> 
> FLINK-18682 has been fixed. Is there any other blocker in the hive
> connector?
> 
> On Tue, Aug 25, 2020 at 2:41 PM Dian Fu <dian0511.fu@gmail.com <ma...@gmail.com>> wrote:
> 
>> Hi all,
>> 
>> Two weeks have passed and it seems that none of the test stabilities
>> issues have been addressed since then.
>> 
>> Here is an updated status report of Blockers and Test instabilities:
>> 
>> Blockers <
>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334> <
>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>>>:
>> Currently 2 blockers (1x Hive, 1x CI Infra)
>> 
>> Test-Instabilities <
>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580> <
>> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>>>:
>> (total 80)
>> 
>> Besides the issues already posted in previous mail, here are the new
>> instability issues which should be taken care of:
>> 
>> - FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012 <https://issues.apache.org/jira/browse/FLINK-19012> <
>> https://issues.apache.org/jira/browse/FLINK-19012 <https://issues.apache.org/jira/browse/FLINK-19012>>)
>> E2E test fails with "Cannot register Closeable, this
>> subtaskCheckpointCoordinator is already closed. Closing argument."
>> 
>> -> This is a new issue occurred recently. It has occurred several times
>> and may indicate a bug somewhere and should be taken care of.
>> 
>> - FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 <https://issues.apache.org/jira/browse/FLINK-9992> <
>> https://issues.apache.org/jira/browse/FLINK-9992 <https://issues.apache.org/jira/browse/FLINK-9992>>)
>> FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI
>> 
>> -> There is already a PR for it and needs review.
>> 
>> - FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842 <https://issues.apache.org/jira/browse/FLINK-18842> <
>> https://issues.apache.org/jira/browse/FLINK-18842 <https://issues.apache.org/jira/browse/FLINK-18842>>)
>> e2e test failed to download "localhost:9999/flink.tgz" in "Wordcount on
>> Docker test"
>> 
>> 
>>> 在 2020年8月11日,下午2:08,Robert Metzger <rm...@apache.org> 写道:
>>> 
>>> Hi team,
>>> 
>>> 2 weeks have passed since the last update. None of the test stabilities
>>> I've mentioned have been addressed since then.
>>> 
>>> Here's an updated status report of Blockers and Test instabilities:
>>> 
>>> Blockers <
>> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>:
>>> Currently 3 blockers (2x Hive, 1x CI Infra)
>>> 
>>> Test-Instabilities
>>> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>
>> (total
>>> 79) which failed recently or frequently:
>>> 
>>> 
>>> - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807>
>>> FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
>>> failed with "Timeout expired after 60000milliseconds while awaiting
>>> EndTxn(COMMIT)"
>>> 
>>> - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634>
>>> FlinkKafkaProducerITCase.testRecoverCommittedTransaction
>>> failed with "Timeout expired after 60000milliseconds while awaiting
>>> InitProducerId"
>>> 
>>> - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908>
>>> FlinkKafkaProducerITCase
>>> testScaleUpAfterScalingDown Timeout expired while initializing
>>> transactional state in 60000ms.
>>> 
>>> - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733>
>>> FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis
>>> 
>>> --> The first three tickets seem related.
>>> 
>>> 
>>> - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260>
>>> StreamingKafkaITCase failure on Azure
>>> 
>>> --> This one seems really hard to reproduce
>>> 
>>> 
>>> - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768>
>>> HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
>>> hangs
>>> 
>>> - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374>
>>> 
>> HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
>>> produced no output for 900 seconds
>>> 
>>> --> nobody seems to feel responsible for these tickets. My guess is that
>>> the S3 connector should have shorter timeouts / faster retries to finish
>>> within the 15 minutes test timeout. OR there is really something wrong
>> with
>>> the code.
>>> 
>>> 
>>> - FLINK-18333 UnsignedTypeConversionITCase failed caused by MariaDB4j
>>> "Asked to waitFor Program"
>>> <https://issues.apache.org/jira/browse/FLINK-18333>
>>> <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
>>> <https://issues.apache.org/jira/browse/FLINK-17159> ES6
>>> ElasticsearchSinkITCase unstable
>>> 
>>> - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949>
>>> KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
>>> expected:<310> but was:<0>
>>> 
>>> - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222> "Avro
>>> Confluent Schema Registry nightly end-to-end test" unstable with "Kafka
>>> cluster did not start after 120 seconds"
>>> 
>>> - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511>
>> "RocksDB
>>> Memory Management end-to-end test" fails with "Current block cache usage
>>> 202123272 larger than expected memory limit 200000000"
>>> 
>>> 
>>> 
>>> 
>>> On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <rm...@apache.org>
>> wrote:
>>> 
>>>> Hi team,
>>>> 
>>>> We would like to use this thread as a permanent thread for
>>>> regularly syncing on stale blockers (need to have somebody assigned
>> within
>>>> a week and progress, or a good plan) and build instabilities (need to
>> check
>>>> if its a blocker).
>>>> 
>>>> Recent test-instabilities:
>>>> 
>>>>  - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
>>>>  - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test
>> unstable)
>>>>  - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test
>> unstable)
>>>>  - https://issues.apache.org/jira/browse/FLINK-17949
>>>>  (KafkaShuffleITCase)
>>>>  - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
>>>>  transactions)
>>>> 
>>>> 
>>>> It would be nice if the committers taking care of these components could
>>>> look into the test failures.
>>>> If nothing happens, we'll personally reach out to people I believe they
>>>> could look into the ticket.
>>>> 
>>>> Best,
>>>> Dian & Robert
>>>> 
>> 
>> 
> 
> -- 
> Best regards!
> Rui Li


Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Posted by Rui Li <li...@gmail.com>.
Hi Dian,

FLINK-18682 has been fixed. Is there any other blocker in the hive
connector?

On Tue, Aug 25, 2020 at 2:41 PM Dian Fu <di...@gmail.com> wrote:

> Hi all,
>
> Two weeks have passed and it seems that none of the test stabilities
> issues have been addressed since then.
>
> Here is an updated status report of Blockers and Test instabilities:
>
> Blockers <
> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <
> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>>:
> Currently 2 blockers (1x Hive, 1x CI Infra)
>
> Test-Instabilities <
> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <
> https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>>:
> (total 80)
>
> Besides the issues already posted in previous mail, here are the new
> instability issues which should be taken care of:
>
> - FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012 <
> https://issues.apache.org/jira/browse/FLINK-19012>)
> E2E test fails with "Cannot register Closeable, this
> subtaskCheckpointCoordinator is already closed. Closing argument."
>
> -> This is a new issue occurred recently. It has occurred several times
> and may indicate a bug somewhere and should be taken care of.
>
> - FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 <
> https://issues.apache.org/jira/browse/FLINK-9992>)
> FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI
>
> -> There is already a PR for it and needs review.
>
> - FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842 <
> https://issues.apache.org/jira/browse/FLINK-18842>)
> e2e test failed to download "localhost:9999/flink.tgz" in "Wordcount on
> Docker test"
>
>
> > 在 2020年8月11日,下午2:08,Robert Metzger <rm...@apache.org> 写道:
> >
> > Hi team,
> >
> > 2 weeks have passed since the last update. None of the test stabilities
> > I've mentioned have been addressed since then.
> >
> > Here's an updated status report of Blockers and Test instabilities:
> >
> > Blockers <
> https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>:
> > Currently 3 blockers (2x Hive, 1x CI Infra)
> >
> > Test-Instabilities
> > <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>
> (total
> > 79) which failed recently or frequently:
> >
> >
> > - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807>
> > FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
> > failed with "Timeout expired after 60000milliseconds while awaiting
> > EndTxn(COMMIT)"
> >
> > - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634>
> > FlinkKafkaProducerITCase.testRecoverCommittedTransaction
> > failed with "Timeout expired after 60000milliseconds while awaiting
> > InitProducerId"
> >
> > - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908>
> > FlinkKafkaProducerITCase
> > testScaleUpAfterScalingDown Timeout expired while initializing
> > transactional state in 60000ms.
> >
> > - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733>
> > FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis
> >
> > --> The first three tickets seem related.
> >
> >
> > - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260>
> > StreamingKafkaITCase failure on Azure
> >
> > --> This one seems really hard to reproduce
> >
> >
> > - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768>
> > HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
> > hangs
> >
> > - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374>
> >
> HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
> > produced no output for 900 seconds
> >
> > --> nobody seems to feel responsible for these tickets. My guess is that
> > the S3 connector should have shorter timeouts / faster retries to finish
> > within the 15 minutes test timeout. OR there is really something wrong
> with
> > the code.
> >
> >
> > - FLINK-18333 UnsignedTypeConversionITCase failed caused by MariaDB4j
> > "Asked to waitFor Program"
> > <https://issues.apache.org/jira/browse/FLINK-18333>
> > <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
> > <https://issues.apache.org/jira/browse/FLINK-17159> ES6
> > ElasticsearchSinkITCase unstable
> >
> > - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949>
> > KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
> > expected:<310> but was:<0>
> >
> > - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222> "Avro
> > Confluent Schema Registry nightly end-to-end test" unstable with "Kafka
> > cluster did not start after 120 seconds"
> >
> > - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511>
> "RocksDB
> > Memory Management end-to-end test" fails with "Current block cache usage
> > 202123272 larger than expected memory limit 200000000"
> >
> >
> >
> >
> > On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <rm...@apache.org>
> wrote:
> >
> >> Hi team,
> >>
> >> We would like to use this thread as a permanent thread for
> >> regularly syncing on stale blockers (need to have somebody assigned
> within
> >> a week and progress, or a good plan) and build instabilities (need to
> check
> >> if its a blocker).
> >>
> >> Recent test-instabilities:
> >>
> >>   - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
> >>   - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test
> unstable)
> >>   - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test
> unstable)
> >>   - https://issues.apache.org/jira/browse/FLINK-17949
> >>   (KafkaShuffleITCase)
> >>   - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
> >>   transactions)
> >>
> >>
> >> It would be nice if the committers taking care of these components could
> >> look into the test failures.
> >> If nothing happens, we'll personally reach out to people I believe they
> >> could look into the ticket.
> >>
> >> Best,
> >> Dian & Robert
> >>
>
>

-- 
Best regards!
Rui Li

Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Posted by Dian Fu <di...@gmail.com>.
Hi all,

Two weeks have passed and it seems that none of the test stabilities issues have been addressed since then.

Here is an updated status report of Blockers and Test instabilities:

Blockers <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334 <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>>:
Currently 2 blockers (1x Hive, 1x CI Infra)

Test-Instabilities <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580 <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580>>:
(total 80)

Besides the issues already posted in previous mail, here are the new instability issues which should be taken care of:

- FLINK-19012 (https://issues.apache.org/jira/browse/FLINK-19012 <https://issues.apache.org/jira/browse/FLINK-19012>)
E2E test fails with "Cannot register Closeable, this subtaskCheckpointCoordinator is already closed. Closing argument."

-> This is a new issue occurred recently. It has occurred several times and may indicate a bug somewhere and should be taken care of.

- FLINK-9992 (https://issues.apache.org/jira/browse/FLINK-9992 <https://issues.apache.org/jira/browse/FLINK-9992>)
FsStorageLocationReferenceTest#testEncodeAndDecode failed in CI

-> There is already a PR for it and needs review.

- FLINK-18842 (https://issues.apache.org/jira/browse/FLINK-18842 <https://issues.apache.org/jira/browse/FLINK-18842>) 
e2e test failed to download "localhost:9999/flink.tgz" in "Wordcount on Docker test"


> 在 2020年8月11日,下午2:08,Robert Metzger <rm...@apache.org> 写道:
> 
> Hi team,
> 
> 2 weeks have passed since the last update. None of the test stabilities
> I've mentioned have been addressed since then.
> 
> Here's an updated status report of Blockers and Test instabilities:
> 
> Blockers <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>:
> Currently 3 blockers (2x Hive, 1x CI Infra)
> 
> Test-Instabilities
> <https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580> (total
> 79) which failed recently or frequently:
> 
> 
> - FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807>
> FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
> failed with "Timeout expired after 60000milliseconds while awaiting
> EndTxn(COMMIT)"
> 
> - FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634>
> FlinkKafkaProducerITCase.testRecoverCommittedTransaction
> failed with "Timeout expired after 60000milliseconds while awaiting
> InitProducerId"
> 
> - FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908>
> FlinkKafkaProducerITCase
> testScaleUpAfterScalingDown Timeout expired while initializing
> transactional state in 60000ms.
> 
> - FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733>
> FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis
> 
> --> The first three tickets seem related.
> 
> 
> - FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260>
> StreamingKafkaITCase failure on Azure
> 
> --> This one seems really hard to reproduce
> 
> 
> - FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768>
> HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
> hangs
> 
> - FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374>
> HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
> produced no output for 900 seconds
> 
> --> nobody seems to feel responsible for these tickets. My guess is that
> the S3 connector should have shorter timeouts / faster retries to finish
> within the 15 minutes test timeout. OR there is really something wrong with
> the code.
> 
> 
> - FLINK-18333 UnsignedTypeConversionITCase failed caused by MariaDB4j
> "Asked to waitFor Program"
> <https://issues.apache.org/jira/browse/FLINK-18333>
> <https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
> <https://issues.apache.org/jira/browse/FLINK-17159> ES6
> ElasticsearchSinkITCase unstable
> 
> - FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949>
> KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
> expected:<310> but was:<0>
> 
> - FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222> "Avro
> Confluent Schema Registry nightly end-to-end test" unstable with "Kafka
> cluster did not start after 120 seconds"
> 
> - FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511> "RocksDB
> Memory Management end-to-end test" fails with "Current block cache usage
> 202123272 larger than expected memory limit 200000000"
> 
> 
> 
> 
> On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <rm...@apache.org> wrote:
> 
>> Hi team,
>> 
>> We would like to use this thread as a permanent thread for
>> regularly syncing on stale blockers (need to have somebody assigned within
>> a week and progress, or a good plan) and build instabilities (need to check
>> if its a blocker).
>> 
>> Recent test-instabilities:
>> 
>>   - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
>>   - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test unstable)
>>   - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test unstable)
>>   - https://issues.apache.org/jira/browse/FLINK-17949
>>   (KafkaShuffleITCase)
>>   - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
>>   transactions)
>> 
>> 
>> It would be nice if the committers taking care of these components could
>> look into the test failures.
>> If nothing happens, we'll personally reach out to people I believe they
>> could look into the ticket.
>> 
>> Best,
>> Dian & Robert
>> 


Re: [DISCUSS][Release 1.12] Stale blockers and build instabilities

Posted by Robert Metzger <rm...@apache.org>.
Hi team,

2 weeks have passed since the last update. None of the test stabilities
I've mentioned have been addressed since then.

Here's an updated status report of Blockers and Test instabilities:

Blockers <https://issues.apache.org/jira/browse/FLINK-18682?filter=12349334>:
Currently 3 blockers (2x Hive, 1x CI Infra)

Test-Instabilities
<https://issues.apache.org/jira/browse/FLINK-18869?filter=12348580> (total
79) which failed recently or frequently:


- FLINK-18807 <https://issues.apache.org/jira/browse/FLINK-18807>
FlinkKafkaProducerITCase.testScaleUpAfterScalingDown
failed with "Timeout expired after 60000milliseconds while awaiting
EndTxn(COMMIT)"

- FLINK-18634 <https://issues.apache.org/jira/browse/FLINK-18634>
FlinkKafkaProducerITCase.testRecoverCommittedTransaction
failed with "Timeout expired after 60000milliseconds while awaiting
InitProducerId"

- FLINK-16908 <https://issues.apache.org/jira/browse/FLINK-16908>
FlinkKafkaProducerITCase
testScaleUpAfterScalingDown Timeout expired while initializing
transactional state in 60000ms.

- FLINK-13733 <https://issues.apache.org/jira/browse/FLINK-13733>
FlinkKafkaInternalProducerITCase.testHappyPath fails on Travis

--> The first three tickets seem related.


- FLINK-17260 <https://issues.apache.org/jira/browse/FLINK-17260>
StreamingKafkaITCase failure on Azure

--> This one seems really hard to reproduce


- FLINK-16768 <https://issues.apache.org/jira/browse/FLINK-16768>
HadoopS3RecoverableWriterITCase.testRecoverWithStateWithMultiPart
hangs

- FLINK-18374 <https://issues.apache.org/jira/browse/FLINK-18374>
HadoopS3RecoverableWriterITCase.testRecoverAfterMultiplePersistsStateWithMultiPart
produced no output for 900 seconds

--> nobody seems to feel responsible for these tickets. My guess is that
the S3 connector should have shorter timeouts / faster retries to finish
within the 15 minutes test timeout. OR there is really something wrong with
the code.


- FLINK-18333 UnsignedTypeConversionITCase failed caused by MariaDB4j
"Asked to waitFor Program"
<https://issues.apache.org/jira/browse/FLINK-18333>
<https://issues.apache.org/jira/browse/FLINK-18333>- FLINK-17159
<https://issues.apache.org/jira/browse/FLINK-17159> ES6
ElasticsearchSinkITCase unstable

- FLINK-17949 <https://issues.apache.org/jira/browse/FLINK-17949>
KafkaShuffleITCase.testSerDeIngestionTime:156->testRecordSerDe:388
expected:<310> but was:<0>

- FLINK-18222 <https://issues.apache.org/jira/browse/FLINK-18222> "Avro
Confluent Schema Registry nightly end-to-end test" unstable with "Kafka
cluster did not start after 120 seconds"

- FLINK-17511 <https://issues.apache.org/jira/browse/FLINK-17511> "RocksDB
Memory Management end-to-end test" fails with "Current block cache usage
202123272 larger than expected memory limit 200000000"




On Mon, Jul 27, 2020 at 8:42 PM Robert Metzger <rm...@apache.org> wrote:

> Hi team,
>
> We would like to use this thread as a permanent thread for
> regularly syncing on stale blockers (need to have somebody assigned within
> a week and progress, or a good plan) and build instabilities (need to check
> if its a blocker).
>
> Recent test-instabilities:
>
>    - https://issues.apache.org/jira/browse/FLINK-17159 (ES6 test)
>    - https://issues.apache.org/jira/browse/FLINK-16768 (s3 test unstable)
>    - https://issues.apache.org/jira/browse/FLINK-18374 (s3 test unstable)
>    - https://issues.apache.org/jira/browse/FLINK-17949
>    (KafkaShuffleITCase)
>    - https://issues.apache.org/jira/browse/FLINK-18634 (Kafka
>    transactions)
>
>
> It would be nice if the committers taking care of these components could
> look into the test failures.
> If nothing happens, we'll personally reach out to people I believe they
> could look into the ticket.
>
> Best,
> Dian & Robert
>