You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/06/14 12:03:23 UTC

[GitHub] [flink] zentol opened a new pull request, #19957: [FLINK-28052][tests] Remove RunFailedJobListener

zentol opened a new pull request, #19957:
URL: https://github.com/apache/flink/pull/19957

   The RunFailedJobListener had rather obscure semantics.
   It considered a job to be terminal after it was restarted. This is awfully specific to a particular test case.
   A cleaner approach is just to just cancel the job and wait for it to terminate.
   
   Additionally it considered a job as running purely based on the job status, whereas, in particular when checkpointing is involved, waiting for the tasks to be submitted is a better measure.
   In fact, testExceptionHistoryWithTaskFailureFromStopWithSavepoint was a broken since a savepoint was never triggered, as not all tasks were running.
   
   This PR also contains a few cleanup commits.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] XComp commented on a diff in pull request #19957: [FLINK-28052][tests] Remove RunFailedJobListener

Posted by GitBox <gi...@apache.org>.
XComp commented on code in PR #19957:
URL: https://github.com/apache/flink/pull/19957#discussion_r897586518


##########
flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java:
##########
@@ -1080,10 +1053,10 @@ private Iterable<RootExceptionHistoryEntry> runExceptionHistoryTests(
                             declarativeSlotPool,
                             createSlotOffersForResourceRequirements(
                                     ResourceCounter.withResource(
-                                            ResourceProfile.UNKNOWN, numAvailableSlots)),
+                                            ResourceProfile.UNKNOWN, PARALLELISM)),
                             taskManagerGateway);
                 });
-        listener.waitForRunning();
+        taskManagerGateway.waitForSubmissions(4, TestingUtils.infiniteDuration());

Review Comment:
   can we add a comment here about why waiting for all subtasks here is necessary due to checkpointing?



##########
flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java:
##########
@@ -1080,10 +1053,10 @@ private Iterable<RootExceptionHistoryEntry> runExceptionHistoryTests(
                             declarativeSlotPool,
                             createSlotOffersForResourceRequirements(
                                     ResourceCounter.withResource(
-                                            ResourceProfile.UNKNOWN, numAvailableSlots)),
+                                            ResourceProfile.UNKNOWN, PARALLELISM)),
                             taskManagerGateway);
                 });
-        listener.waitForRunning();
+        taskManagerGateway.waitForSubmissions(4, TestingUtils.infiniteDuration());

Review Comment:
   ```suggestion
           taskManagerGateway.waitForSubmissions(PARALLELISM, TestingUtils.infiniteDuration());
   ```
   We can remove the magic number here as well...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] zentol commented on a diff in pull request #19957: [FLINK-28052][tests] Remove RunFailedJobListener

Posted by GitBox <gi...@apache.org>.
zentol commented on code in PR #19957:
URL: https://github.com/apache/flink/pull/19957#discussion_r897588588


##########
flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java:
##########
@@ -1080,10 +1053,10 @@ private Iterable<RootExceptionHistoryEntry> runExceptionHistoryTests(
                             declarativeSlotPool,
                             createSlotOffersForResourceRequirements(
                                     ResourceCounter.withResource(
-                                            ResourceProfile.UNKNOWN, numAvailableSlots)),
+                                            ResourceProfile.UNKNOWN, PARALLELISM)),
                             taskManagerGateway);
                 });
-        listener.waitForRunning();
+        taskManagerGateway.waitForSubmissions(4, TestingUtils.infiniteDuration());

Review Comment:
   good point; will do



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] zentol merged pull request #19957: [FLINK-28052][tests] Remove RunFailedJobListener

Posted by GitBox <gi...@apache.org>.
zentol merged PR #19957:
URL: https://github.com/apache/flink/pull/19957


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] zentol commented on pull request #19957: [FLINK-28052][tests] Remove RunFailedJobListener

Posted by GitBox <gi...@apache.org>.
zentol commented on PR #19957:
URL: https://github.com/apache/flink/pull/19957#issuecomment-1156020082

   > But I don't understand why it's important to know that one test issues task restarts for the old implementation to be kind of reasonable?
   
   It doesn't make sense because it's not even doing what it says on the tin.
   What it looks like the test is doing is completing some job cancellation (that came from _somewhere_) and waiting for it to reach a terminal state.
   
   In case of the restarting test, it just completes the restart (==acknowledging the cancel requests). The job actually keeps running. Which is then covered by this special case in the listener where we consider it terminal if it restarted.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] XComp commented on a diff in pull request #19957: [FLINK-28052][tests] Remove RunFailedJobListener

Posted by GitBox <gi...@apache.org>.
XComp commented on code in PR #19957:
URL: https://github.com/apache/flink/pull/19957#discussion_r896935595


##########
flink-runtime/src/test/java/org/apache/flink/runtime/scheduler/adaptive/AdaptiveSchedulerTest.java:
##########
@@ -1106,17 +1079,8 @@ private Iterable<RootExceptionHistoryEntry> runExceptionHistoryTests(
                         singleThreadMainThreadExecutor);
         runTestLogicFuture.get();
 
-        Consumer<ExecutionAttemptID> canceller =
-                attemptId ->
-                        scheduler.updateTaskExecutionState(
-                                new TaskExecutionStateTransition(
-                                        new TaskExecutionState(
-                                                attemptId, ExecutionState.CANCELED, null)));
-        CompletableFuture<Void> cancelFuture =
-                CompletableFuture.runAsync(
-                        () -> cancelledTasks.forEach(canceller), singleThreadMainThreadExecutor);
-        cancelFuture.get();
-        listener.waitForTerminal();
+        singleThreadMainThreadExecutor.execute(() -> scheduler.cancel());

Review Comment:
   ```suggestion
           singleThreadMainThreadExecutor.execute(scheduler::cancel);
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] flinkbot commented on pull request #19957: [FLINK-28052][tests] Remove RunFailedJobListener

Posted by GitBox <gi...@apache.org>.
flinkbot commented on PR #19957:
URL: https://github.com/apache/flink/pull/19957#issuecomment-1155097581

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "803106f5231d5ed2f2156c4e779b004d70d8fdcf",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "803106f5231d5ed2f2156c4e779b004d70d8fdcf",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 803106f5231d5ed2f2156c4e779b004d70d8fdcf UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org