You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Roman Khachatryan (Jira)" <ji...@apache.org> on 2022/05/25 13:22:00 UTC

[jira] [Comment Edited] (FLINK-27169) PartiallyFinishedSourcesITCase.test hangs on azure

    [ https://issues.apache.org/jira/browse/FLINK-27169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17542043#comment-17542043 ] 

Roman Khachatryan edited comment on FLINK-27169 at 5/25/22 1:21 PM:
--------------------------------------------------------------------

Thanks for looking into the issue [~chesnay].
This is what I believe leads to test hanging up:
 # Checkpoint 1 completes
 # Several subsequent checkpoints fail due to a timeout while writing changelog segments (all 3 configured attempts exhausted)
 # Job graph gets restarted due to failure
 # FINISH_SOURCES command gets lost as a result
 # TestJobExecutor hangs in waitForSubtasksToFinish as a result

I suppose the root cause is an intermittent failure of the local disk. I'm going to increase the timeout and the number of attempts in test. Increasing environment.tolerable_declined_checkpoint_number won't help because a failed upload fails all subsequent checkpoint (until the segment materialized).


To prevent the test from hanging up, I'm going to add a timeout (restarts can not be disabled because they are required by the test scenario; and can not be detected easily)
To ease debugging I'm going to raise the log level TestJobExecutor to INFO.

 

I'll open a PR with the above changes.


was (Author: roman_khachatryan):
Thanks for looking into the issue [~chesnay].
This is what I believe leads to test hanging up:
 # Checkpoint 1 completes
 # Several subsequent checkpoints fail due to a timeout while writing changelog segments (all 3 configured attempts exhausted)
 # Job graph gets restarted due to failure
 # FINISH_SOURCES command gets lost as a result
 # TestJobExecutor hangs in waitForSubtasksToFinish as a result

I suppose the root cause is an intermittent failure of the local disk. I'm going to increase the timeout and the number of attempts in test.
To prevent the test from hanging up, I'm going to add a timeout (restarts can not be disabled because they are required by the test scenario; and can not be detected easily)
To ease debugging I'm going to raise the log level TestJobExecutor to INFO.

 

I'll open a PR with the above changes.

> PartiallyFinishedSourcesITCase.test hangs on azure
> --------------------------------------------------
>
>                 Key: FLINK-27169
>                 URL: https://issues.apache.org/jira/browse/FLINK-27169
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.16.0
>            Reporter: Yun Gao
>            Assignee: Roman Khachatryan
>            Priority: Major
>              Labels: test-stability
>             Fix For: 1.16.0
>
>
> {code:java}
> Apr 10 08:32:18 "main" #1 prio=5 os_prio=0 tid=0x00007f553400b800 nid=0x8345 waiting on condition [0x00007f553be60000]
> Apr 10 08:32:18    java.lang.Thread.State: TIMED_WAITING (sleeping)
> Apr 10 08:32:18 	at java.lang.Thread.sleep(Native Method)
> Apr 10 08:32:18 	at org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:145)
> Apr 10 08:32:18 	at org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:138)
> Apr 10 08:32:18 	at org.apache.flink.runtime.testutils.CommonTestUtils.waitForSubtasksToFinish(CommonTestUtils.java:291)
> Apr 10 08:32:18 	at org.apache.flink.runtime.operators.lifecycle.TestJobExecutor.waitForSubtasksToFinish(TestJobExecutor.java:226)
> Apr 10 08:32:18 	at org.apache.flink.runtime.operators.lifecycle.PartiallyFinishedSourcesITCase.test(PartiallyFinishedSourcesITCase.java:138)
> Apr 10 08:32:18 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> Apr 10 08:32:18 	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> Apr 10 08:32:18 	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> Apr 10 08:32:18 	at java.lang.reflect.Method.invoke(Method.java:498)
> Apr 10 08:32:18 	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
> Apr 10 08:32:18 	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> Apr 10 08:32:18 	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
> Apr 10 08:32:18 	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> Apr 10 08:32:18 	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> Apr 10 08:32:18 	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> Apr 10 08:32:18 	at org.junit.rules.ExternalResource$1.evaluate(ExternalResource.java:54)
> Apr 10 08:32:18 	at org.apache.flink.util.TestNameProvider$1.evaluate(TestNameProvider.java:45)
> Apr 10 08:32:18 	at org.junit.rules.TestWatcher$1.evaluate(TestWatcher.java:61)
> Apr 10 08:32:18 	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
> Apr 10 08:32:18 	at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
> Apr 10 08:32:18 	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
> Apr 10 08:32:18 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
> Apr 10 08:32:18 	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
> Apr 10 08:32:18 	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> Apr 10 08:32:18 	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
> Apr 10 08:32:18 	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
> Apr 10 08:32:18 	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
> Apr 10 08:32:18 	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
> Apr 10 08:32:18 	at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
> Apr 10 08:32:18 	at org.junit.runners.Suite.runChild(Suite.java:128)
> Apr 10 08:32:18 	at org.junit.runners.Suite.runChild(Suite.java:27)
> Apr 10 08:32:18 	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
> {code}
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=34484&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=6757



--
This message was sent by Atlassian Jira
(v8.20.7#820007)