You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/11/10 03:03:18 UTC

[GitHub] [flink] 1996fanrui opened a new pull request, #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

1996fanrui opened a new pull request, #21281:
URL: https://github.com/apache/flink/pull/21281

   ## What is the purpose of the change
   
   Add the root cause when exceeded checkpoint tolerable failure threshold, it's helpful during troubleshooting.
   
   After change:
   
   <img width="1297" alt="image" src="https://user-images.githubusercontent.com/38427477/200990319-1f1211ff-b46a-4d0e-9e65-0fd5aca50752.png">
   
   
   ## Brief change log
   
   Add the root cause when exceeded checkpoint tolerable failure threshold.
   
   
   ## Verifying this change
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not documented
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] Myasuka commented on a diff in pull request #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

Posted by GitBox <gi...@apache.org>.
Myasuka commented on code in PR #21281:
URL: https://github.com/apache/flink/pull/21281#discussion_r1018741944


##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointFailureManager.java:
##########
@@ -204,7 +204,8 @@ private void checkFailureAgainstCounter(
             if (continuousFailureCounter.get() > tolerableCpFailureNumber) {
                 clearCount();
                 errorHandler.accept(
-                        new FlinkRuntimeException(EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE));
+                        new FlinkRuntimeException(
+                                EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE, exception));

Review Comment:
   The job failed due to the failure counter being larger than the tolerable number, and we can only have the exception reason for the last broken checkpoint. However, this would make users think all checkpoints failed due to the last exception. The correct way is to let users check the job manager logs or checkpoint UI to know what happened in the last checkpoints.
   From my point of view, I am +0 for this proposal.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] Myasuka merged pull request #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

Posted by GitBox <gi...@apache.org>.
Myasuka merged PR #21281:
URL: https://github.com/apache/flink/pull/21281


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] 1996fanrui commented on pull request #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

Posted by GitBox <gi...@apache.org>.
1996fanrui commented on PR #21281:
URL: https://github.com/apache/flink/pull/21281#issuecomment-1312487773

   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] Myasuka commented on a diff in pull request #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

Posted by GitBox <gi...@apache.org>.
Myasuka commented on code in PR #21281:
URL: https://github.com/apache/flink/pull/21281#discussion_r1023930513


##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointFailureManager.java:
##########
@@ -204,7 +204,8 @@ private void checkFailureAgainstCounter(
             if (continuousFailureCounter.get() > tolerableCpFailureNumber) {
                 clearCount();
                 errorHandler.accept(
-                        new FlinkRuntimeException(EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE));
+                        new FlinkRuntimeException(
+                                EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE, exception));

Review Comment:
   I think this might be a better idea. BTW, could you please make `Exceeded checkpoint tolerable failure threshold` ends with a period instead of a comma to make it the same as before.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] flinkbot commented on pull request #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

Posted by GitBox <gi...@apache.org>.
flinkbot commented on PR #21281:
URL: https://github.com/apache/flink/pull/21281#issuecomment-1309707084

   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "57a945e1e31779602a525232a453a34078250341",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "57a945e1e31779602a525232a453a34078250341",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 57a945e1e31779602a525232a453a34078250341 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run azure` re-run the last Azure build
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] Myasuka commented on a diff in pull request #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

Posted by GitBox <gi...@apache.org>.
Myasuka commented on code in PR #21281:
URL: https://github.com/apache/flink/pull/21281#discussion_r1019856735


##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointFailureManager.java:
##########
@@ -204,7 +204,8 @@ private void checkFailureAgainstCounter(
             if (continuousFailureCounter.get() > tolerableCpFailureNumber) {
                 clearCount();
                 errorHandler.accept(
-                        new FlinkRuntimeException(EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE));
+                        new FlinkRuntimeException(
+                                EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE, exception));

Review Comment:
   I think we can add some descriptions to tell users to refer to the checkpoint history tab or job manager logs to see why continuous checkpoints failed. That could be better than the current hint.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] 1996fanrui commented on pull request #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

Posted by GitBox <gi...@apache.org>.
1996fanrui commented on PR #21281:
URL: https://github.com/apache/flink/pull/21281#issuecomment-1312459114

   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] 1996fanrui commented on pull request #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

Posted by GitBox <gi...@apache.org>.
1996fanrui commented on PR #21281:
URL: https://github.com/apache/flink/pull/21281#issuecomment-1309709403

   @Myasuka  Please help take a look in your free time, thanks~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] 1996fanrui commented on a diff in pull request #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

Posted by GitBox <gi...@apache.org>.
1996fanrui commented on code in PR #21281:
URL: https://github.com/apache/flink/pull/21281#discussion_r1019960875


##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointFailureManager.java:
##########
@@ -204,7 +204,8 @@ private void checkFailureAgainstCounter(
             if (continuousFailureCounter.get() > tolerableCpFailureNumber) {
                 clearCount();
                 errorHandler.accept(
-                        new FlinkRuntimeException(EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE));
+                        new FlinkRuntimeException(
+                                EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE, exception));

Review Comment:
   Thanks for your suggestion. I have added some descriptions.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] 1996fanrui commented on a diff in pull request #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

Posted by GitBox <gi...@apache.org>.
1996fanrui commented on code in PR #21281:
URL: https://github.com/apache/flink/pull/21281#discussion_r1024664659


##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointFailureManager.java:
##########
@@ -204,7 +204,8 @@ private void checkFailureAgainstCounter(
             if (continuousFailureCounter.get() > tolerableCpFailureNumber) {
                 clearCount();
                 errorHandler.accept(
-                        new FlinkRuntimeException(EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE));
+                        new FlinkRuntimeException(
+                                EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE, exception));

Review Comment:
   Updated.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] 1996fanrui commented on a diff in pull request #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

Posted by GitBox <gi...@apache.org>.
1996fanrui commented on code in PR #21281:
URL: https://github.com/apache/flink/pull/21281#discussion_r1018967494


##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointFailureManager.java:
##########
@@ -204,7 +204,8 @@ private void checkFailureAgainstCounter(
             if (continuousFailureCounter.get() > tolerableCpFailureNumber) {
                 clearCount();
                 errorHandler.accept(
-                        new FlinkRuntimeException(EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE));
+                        new FlinkRuntimeException(
+                                EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE, exception));

Review Comment:
   @Myasuka Thanks for your feedback. 
   
   You are right, the correct way is check full information from JM log or checkpoint UI. 
   
   Actually, I added this due to some reasons:
   
   - Some Flink platforms collect exceptions. When the job fails and JM stops, users can easily see the root cause of the last checkpoint through the exception. At this point WebUI has stopped, and it is more convenient than JM LOG.
   - Displaying more root cause has no effect on the original logic.
   - When developing some features, ITCase is often run without LOG enabled. Some ITCases fail, it just shows `Exceeded checkpoint tolerable failure threshold.`, doesn't show the root cause. Inconvenient to locate the problem. 😂
   
   I also don't think this change is necessary. You can take a look at these reasons and I will close this PR if not needed. Thanks~



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] 1996fanrui commented on a diff in pull request #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

Posted by GitBox <gi...@apache.org>.
1996fanrui commented on code in PR #21281:
URL: https://github.com/apache/flink/pull/21281#discussion_r1023807792


##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointFailureManager.java:
##########
@@ -204,7 +204,8 @@ private void checkFailureAgainstCounter(
             if (continuousFailureCounter.get() > tolerableCpFailureNumber) {
                 clearCount();
                 errorHandler.accept(
-                        new FlinkRuntimeException(EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE));
+                        new FlinkRuntimeException(
+                                EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE, exception));

Review Comment:
   How about this? We tell user the latest checkpoint failed cause and how to check full checkpoint info?
   
   The `exception.getCheckpointFailureReason().message()` is a short description.
   
   ```
   public static final String EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE =
               "Exceeded checkpoint tolerable failure threshold, the latest checkpoint failed due to %s,"
                       + " view the Checkpoint History tab or the Job Manager log to find out why"
                       + " continuous checkpoints failed.";
   
   new FlinkRuntimeException(String.format(
           EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE,
           exception.getCheckpointFailureReason().message()));
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] Myasuka commented on a diff in pull request #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

Posted by GitBox <gi...@apache.org>.
Myasuka commented on code in PR #21281:
URL: https://github.com/apache/flink/pull/21281#discussion_r1023682566


##########
flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointFailureManager.java:
##########
@@ -204,7 +204,8 @@ private void checkFailureAgainstCounter(
             if (continuousFailureCounter.get() > tolerableCpFailureNumber) {
                 clearCount();
                 errorHandler.accept(
-                        new FlinkRuntimeException(EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE));
+                        new FlinkRuntimeException(
+                                EXCEEDED_CHECKPOINT_TOLERABLE_FAILURE_MESSAGE, exception));

Review Comment:
   My previous suggestion hopes you could remove the `exception` from the `FlinkRuntimeException` to avoid misunderstanding for users.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] 1996fanrui commented on pull request #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

Posted by GitBox <gi...@apache.org>.
1996fanrui commented on PR #21281:
URL: https://github.com/apache/flink/pull/21281#issuecomment-1312386445

   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org


[GitHub] [flink] 1996fanrui commented on pull request #21281: [FLINK-29969][checkpoint] Show the root cause when exceeded checkpoint tolerable failure threshold

Posted by GitBox <gi...@apache.org>.
1996fanrui commented on PR #21281:
URL: https://github.com/apache/flink/pull/21281#issuecomment-1312451228

   @flinkbot run azure


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org