You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by "Roman Khachatryan (Jira)" <ji...@apache.org> on 2023/03/03 20:14:00 UTC

[jira] [Resolved] (FLINK-31133) PartiallyFinishedSourcesITCase hangs if a checkpoint fails

     [ https://issues.apache.org/jira/browse/FLINK-31133?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Roman Khachatryan resolved FLINK-31133.
---------------------------------------
    Resolution: Fixed

Fix merged asĀ 

1.15 51660f840cfc505b9b9cb72530fde7f9f8a4dee2
1.16 cf04b2c08fa04091845bd310990497129c3bcbe8
1.17 6e7703738cdefed17277ea86d2c9dc25393eceac
master 4aacff572a9e3996c5dee9273638831e4040c767

> PartiallyFinishedSourcesITCase hangs if a checkpoint fails
> ----------------------------------------------------------
>
>                 Key: FLINK-31133
>                 URL: https://issues.apache.org/jira/browse/FLINK-31133
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Coordination
>    Affects Versions: 1.15.3, 1.16.1, 1.18.0, 1.17.1
>            Reporter: Matthias Pohl
>            Assignee: Roman Khachatryan
>            Priority: Major
>              Labels: pull-request-available, test-stability
>             Fix For: 1.16.2, 1.18.0, 1.17.1, 1.15.5
>
>
> https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b
> This build ran into a timeout. Based on the stacktraces reported, it was either caused by [SnapshotMigrationTestBase.restoreAndExecute|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=13475]:
> {code}
> "main" #1 prio=5 os_prio=0 tid=0x00007f23d800b800 nid=0x60cdd waiting on condition [0x00007f23e1c0d000]
>    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 	at java.lang.Thread.sleep(Native Method)
> 	at org.apache.flink.test.checkpointing.utils.SnapshotMigrationTestBase.restoreAndExecute(SnapshotMigrationTestBase.java:382)
> 	at org.apache.flink.test.migration.TypeSerializerSnapshotMigrationITCase.testSnapshot(TypeSerializerSnapshotMigrationITCase.java:172)
> 	at sun.reflect.GeneratedMethodAccessor47.invoke(Unknown Source)
> [...]
> {code}
> or [PartiallyFinishedSourcesITCase.test|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=10401]:
> {code}
> 2023-02-20T07:13:05.6084711Z "main" #1 prio=5 os_prio=0 tid=0x00007fd35c00b800 nid=0x8c8a waiting on condition [0x00007fd363d0f000]
> 2023-02-20T07:13:05.6085149Z    java.lang.Thread.State: TIMED_WAITING (sleeping)
> 2023-02-20T07:13:05.6085487Z 	at java.lang.Thread.sleep(Native Method)
> 2023-02-20T07:13:05.6085925Z 	at org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:145)
> 2023-02-20T07:13:05.6086512Z 	at org.apache.flink.runtime.testutils.CommonTestUtils.waitUntilCondition(CommonTestUtils.java:138)
> 2023-02-20T07:13:05.6087103Z 	at org.apache.flink.runtime.testutils.CommonTestUtils.waitForSubtasksToFinish(CommonTestUtils.java:291)
> 2023-02-20T07:13:05.6087730Z 	at org.apache.flink.runtime.operators.lifecycle.TestJobExecutor.waitForSubtasksToFinish(TestJobExecutor.java:226)
> 2023-02-20T07:13:05.6088410Z 	at org.apache.flink.runtime.operators.lifecycle.PartiallyFinishedSourcesITCase.test(PartiallyFinishedSourcesITCase.java:138)
> 2023-02-20T07:13:05.6088957Z 	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> [...]
> {code}
> Still, it sounds odd: Based on a code analysis it's quite unlikely that those two caused the issue. The former one has a 5 min timeout (see related code in [SnapshotMigrationTestBase:382|https://github.com/apache/flink/blob/release-1.15/flink-tests/src/test/java/org/apache/flink/test/checkpointing/utils/SnapshotMigrationTestBase.java#L382]). For the other one, we found it being not responsible in the past when some other concurrent test caused the issue (see FLINK-30261).
> An investigation on where we lose the time for the timeout revealed that {{AdaptiveSchedulerITCase}} took 2980s to finish (see [build logs|https://dev.azure.com/apache-flink/apache-flink/_build/results?buildId=46299&view=logs&j=39d5b1d5-3b41-54dc-6458-1e2ddd1cdcf3&t=0c010d0c-3dec-5bf1-d408-7b18988b1b2b&l=5265]).
> {code}
> 2023-02-20T03:43:55.4546050Z Feb 20 03:43:55 [ERROR] Picked up JAVA_TOOL_OPTIONS: -XX:+HeapDumpOnOutOfMemoryError
> 2023-02-20T03:43:58.0448506Z Feb 20 03:43:58 [INFO] Running org.apache.flink.test.scheduling.AdaptiveSchedulerITCase
> 2023-02-20T04:33:38.6824634Z Feb 20 04:33:38 [INFO] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2,980.445 s - in org.apache.flink.test.scheduling.AdaptiveSchedulerITCase
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)