You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2021/01/13 21:06:35 UTC

[GitHub] [flink] rkhachatryan opened a new pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

rkhachatryan opened a new pull request #14635:
URL: https://github.com/apache/flink/pull/14635


   ## What is the purpose of the change
   
   Update checkpoint statistics (shown in the web UI) even after a checkpoint fails
   (this would facilitate investigation of issues with slow checkpointing).
   
   With this change, failed checkpoint stats is updated when:
   1. Subtask acks a checkpoint too late or after some other failure. `AsyncCheckpointRunnable` completes normally and reports snapshot as usual. `CheckpointCoordinator` was updated to handle these calls
   1. Subtask receives abortion notification and cancels the runnable before it completes. In this case it only reports the metrics. Both TM and JM sides were updated and a **new RPC added**
   
   ## Verifying this change
   
   This change added tests and can be verified as follows:
   - `CheckpointCoordinatorTest.testCheckpointStatsUpdatedAfterFailure`
   - `CheckpointCoordinatorTest.testAbortedCheckpointStatsUpdatedAfterFailure`
   - Manually verified the change by running `DataStreamAllroundTestProgram` on local cluser:
   ```
   execution.checkpointing.interval: 10s
   execution.checkpointing.min-pause: 1s
   execution.checkpointing.timeout: 1s
   execution.checkpointing.tolerable-failed-checkpoints: 1000000
   execution.checkpointing.unaligned: true
   taskmanager.numberOfTaskSlots: 8
   web.checkpoints.history: 100
   ```
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): no
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: no
     - The serializers: no
     - The runtime per-record code paths (performance sensitive): no
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: yes 
     - The S3 file system connector: no
   
   ## Documentation
   
     - Does this pull request introduce a new feature? no
     - If yes, how is the feature documented? not applicable
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073",
       "triggerID" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12080",
       "triggerID" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12384",
       "triggerID" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2e65254e667d4360d6c48a5e2ad9d2b620cdb17d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12080) 
   * 89004265ced3fe4c638a46d58dba64157fa06d08 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12384) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759741104


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit 6362142ac6babb0d846d6dd77bfcf30be0876b3f (Wed Jan 13 21:09:23 UTC 2021)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
    * **This pull request references an unassigned [Jira ticket](https://issues.apache.org/jira/browse/FLINK-19462).** According to the [code contribution guide](https://flink.apache.org/contributing/contribute-code.html), tickets need to be assigned before starting with the implementation work.
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-760328916


   Thanks for reviewing, @pnowojski.
   I've addressed your feedback, PTAL.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #14635:
URL: https://github.com/apache/flink/pull/14635#discussion_r563569096



##########
File path: docs/ops/monitoring/checkpoint_monitoring.md
##########
@@ -61,7 +61,7 @@ Note that for failed checkpoints, metrics are updated on a best efforts basis an
 - **ID**: The ID of the triggered checkpoint. The IDs are incremented for each checkpoint, starting at 1.
 - **Status**: The current status of the checkpoint, which is either *In Progress* (<i aria-hidden="true" class="fa fa-circle-o-notch fa-spin fa-fw"/>), *Completed* (<i aria-hidden="true" class="fa fa-check"/>), or *Failed* (<i aria-hidden="true" class="fa fa-remove"/>). If the triggered checkpoint is a savepoint, you will see a <i aria-hidden="true" class="fa fa-floppy-o"/> symbol.
 - **Trigger Time**: The time when the checkpoint was triggered at the JobManager.
-- **Latest Acknowledgement**: The time when the latest acknowledgement for any subtask was received at the JobManager (or n/a if no acknowledgement received yet). For a failed checkpoint, this is the time from trigger timestamp to failure.
+- **Latest Acknowledgement**: The time when the latest acknowledgement for any subtask was received at the JobManager (or n/a if no acknowledgement received yet).

Review comment:
       I don't know Chinese so I created FLINK-21122 to update it according to the [contribution guide](https://flink.apache.org/contributing/contribute-documentation.html#chinese-documentation-translation).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski merged pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
pnowojski merged pull request #14635:
URL: https://github.com/apache/flink/pull/14635


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-766707720


   Thanks for the review @pnowojski .
   I've added the space and created a ticket to translate the docs.
   I've also squashed the commits.
   
   > for example AsynCheckpointRunnable fails (throws an exception), I can not see any stats for any subtasks that have finished after the failure
   
   As discussed offline, this happens because the failed upstream doesn't sent barrier downstream.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073",
       "triggerID" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12080",
       "triggerID" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12384",
       "triggerID" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 89004265ced3fe4c638a46d58dba64157fa06d08 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12384) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fcc06b3d273f1c4a1ae2725034684793c192c97e Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070) 
   * 026f0b5bfd703dfe84504876c5e86cb5b6682307 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #14635:
URL: https://github.com/apache/flink/pull/14635#discussion_r557474839



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java
##########
@@ -1051,6 +1051,34 @@ public void acknowledgeCheckpoint(
         }
     }
 
+    @Override
+    public void reportCheckpointMetrics(
+            JobID jobID, ExecutionAttemptID attemptId, long id, CheckpointMetrics metrics) {
+        mainThreadExecutor.assertRunningInMainThread();
+
+        final CheckpointCoordinator checkpointCoordinator =
+                executionGraph.getCheckpointCoordinator();
+
+        if (checkpointCoordinator != null) {
+            ioExecutor.execute(
+                    () -> {
+                        try {
+                            checkpointCoordinator.reportStats(id, attemptId, metrics);
+                        } catch (Throwable t) {
+                            log.warn("Error while processing report checkpoint stats message", t);
+                        }
+                    });
+        } else {
+            String errorMessage =
+                    "Received ReportCheckpointStats message for job {} with no CheckpointCoordinator";
+            if (executionGraph.getState() == JobStatus.RUNNING) {
+                log.error(errorMessage, jobGraph.getJobID());
+            } else {
+                log.debug(errorMessage, jobGraph.getJobID());
+            }
+        }
+    }

Review comment:
       Good idea!
   (I'll extract a  method in a separate commit as there are at least 2 existing methods already)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-766707720


   Thanks for the review @pnowojski .
   I've added the space and created a ticket to translate the docs.
   I've also squashed the commits.
   
   > for example AsynCheckpointRunnable fails (throws an exception), I can not see any stats for any subtasks that have finished after the failure
   
   As discussed offline, this happens because the failed upstream doesn't sent barrier downstream.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6362142ac6babb0d846d6dd77bfcf30be0876b3f Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073",
       "triggerID" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12080",
       "triggerID" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 026f0b5bfd703dfe84504876c5e86cb5b6682307 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073) 
   * 2e65254e667d4360d6c48a5e2ad9d2b620cdb17d Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12080) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on a change in pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #14635:
URL: https://github.com/apache/flink/pull/14635#discussion_r563513306



##########
File path: flink-runtime-web/web-dashboard/src/app/pages/job/checkpoints/subtask/job-checkpoints-subtask.component.html
##########
@@ -99,7 +99,7 @@
         <td>{{ subTask['index'] }}</td>
         <ng-container *ngIf="subTask['status'] == 'completed'">
           <td >{{ subTask['ack_timestamp'] | date:'yyyy-MM-dd HH:mm:ss' }}</td>
-          <td>{{ subTask['end_to_end_duration'] | humanizeDuration}}</td>
+          <td>{{ subTask['end_to_end_duration'] | humanizeDuration}}<span *ngIf="subTask['aborted']">(aborted)</span></td>

Review comment:
       add space in front  of `(aborted)`?

##########
File path: docs/ops/monitoring/checkpoint_monitoring.md
##########
@@ -61,7 +61,7 @@ Note that for failed checkpoints, metrics are updated on a best efforts basis an
 - **ID**: The ID of the triggered checkpoint. The IDs are incremented for each checkpoint, starting at 1.
 - **Status**: The current status of the checkpoint, which is either *In Progress* (<i aria-hidden="true" class="fa fa-circle-o-notch fa-spin fa-fw"/>), *Completed* (<i aria-hidden="true" class="fa fa-check"/>), or *Failed* (<i aria-hidden="true" class="fa fa-remove"/>). If the triggered checkpoint is a savepoint, you will see a <i aria-hidden="true" class="fa fa-floppy-o"/> symbol.
 - **Trigger Time**: The time when the checkpoint was triggered at the JobManager.
-- **Latest Acknowledgement**: The time when the latest acknowledgement for any subtask was received at the JobManager (or n/a if no acknowledgement received yet). For a failed checkpoint, this is the time from trigger timestamp to failure.
+- **Latest Acknowledgement**: The time when the latest acknowledgement for any subtask was received at the JobManager (or n/a if no acknowledgement received yet).

Review comment:
       Update the `.zh.md` version as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6362142ac6babb0d846d6dd77bfcf30be0876b3f UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6362142ac6babb0d846d6dd77bfcf30be0876b3f Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008) 
   * fcc06b3d273f1c4a1ae2725034684793c192c97e UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski merged pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
pnowojski merged pull request #14635:
URL: https://github.com/apache/flink/pull/14635


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073",
       "triggerID" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12080",
       "triggerID" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2e65254e667d4360d6c48a5e2ad9d2b620cdb17d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12080) 
   * 89004265ced3fe4c638a46d58dba64157fa06d08 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #14635:
URL: https://github.com/apache/flink/pull/14635#discussion_r557451532



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointMetrics.java
##########
@@ -46,8 +48,11 @@
     /** Is the checkpoint completed as an unaligned checkpoint. */
     private final boolean unalignedCheckpoint;
 
+    private final long totalBytesPersisted;
+
+    @VisibleForTesting
     public CheckpointMetrics() {
-        this(-1L, -1L, -1L, -1L, -1L, -1L, false);
+        this(-1L, -1L, -1L, -1L, -1L, -1L, false, 0L);

Review comment:
       Negative state is not allowed (and I think this is a valid check).
   So some tests failed with `-1L` and I changed it to `0L`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073",
       "triggerID" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12080",
       "triggerID" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12384",
       "triggerID" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3618093d3ed7359710c9f73ced3031ff2ab8def8",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12389",
       "triggerID" : "3618093d3ed7359710c9f73ced3031ff2ab8def8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bd8d680c1b6886f053f511793502c946f2c899e5",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12454",
       "triggerID" : "bd8d680c1b6886f053f511793502c946f2c899e5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * bd8d680c1b6886f053f511793502c946f2c899e5 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12454) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #14635:
URL: https://github.com/apache/flink/pull/14635#discussion_r557464399



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/FailedCheckpointStats.java
##########
@@ -87,15 +72,20 @@
             @Nullable SubtaskStateStats latestAcknowledgedSubtask,
             @Nullable Throwable cause) {
 
-        super(checkpointId, triggerTimestamp, props, totalSubtaskCount, taskStats);
+        super(
+                checkpointId,
+                triggerTimestamp,
+                props,
+                totalSubtaskCount,
+                numAcknowledgedSubtasks,
+                taskStats,
+                PendingCheckpointStatsCallback.noOp(),

Review comment:
       First, `PendingCheckpointStats` is created. It has "normal" callbacks and subtasks reports update it as usual.
   Upon failure it is converted to a `FailedCheckpointStats` using one of these "normal" callbacks.
   This new object won't be converted so it has `noOp` callbacks. It still can be updated from subtasks.
   
   I tried to refactor this code to have just one `CheckpointStats` class without conversions and callbacks.
   But the change was to big compared to the functionality added so I dropped it.
   
   




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-765417912


   I've updated the PR (adding 4 new commits):
   1. Tasks reporting upon abort RPC are marked as `aborted` in e2e duration column
   2. Only tasks that actually ACKed checkpoint are counted for ackCount and lastAckTime
   3. `-1B` is shown as `-` (the same way as durations)
   4. Fix the docs
   
   ![image](https://user-images.githubusercontent.com/3939322/105499876-669e6700-5cc2-11eb-8d99-b301a83a548c.png)
   
   cc: @NicoK


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073",
       "triggerID" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12080",
       "triggerID" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12384",
       "triggerID" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3618093d3ed7359710c9f73ced3031ff2ab8def8",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12389",
       "triggerID" : "3618093d3ed7359710c9f73ced3031ff2ab8def8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bd8d680c1b6886f053f511793502c946f2c899e5",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "bd8d680c1b6886f053f511793502c946f2c899e5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3618093d3ed7359710c9f73ced3031ff2ab8def8 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12389) 
   * bd8d680c1b6886f053f511793502c946f2c899e5 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009






----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #14635:
URL: https://github.com/apache/flink/pull/14635#discussion_r557549401



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/FailedCheckpointStats.java
##########
@@ -87,15 +72,20 @@
             @Nullable SubtaskStateStats latestAcknowledgedSubtask,
             @Nullable Throwable cause) {
 
-        super(checkpointId, triggerTimestamp, props, totalSubtaskCount, taskStats);
+        super(
+                checkpointId,
+                triggerTimestamp,
+                props,
+                totalSubtaskCount,
+                numAcknowledgedSubtasks,
+                taskStats,
+                PendingCheckpointStatsCallback.noOp(),

Review comment:
       I replaced `noOp` with a static one throwing `UnsupportedOperationException`.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #14635:
URL: https://github.com/apache/flink/pull/14635#discussion_r557454005



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/state/StateUtil.java
##########
@@ -70,9 +70,11 @@ public static void bestEffortDiscardAllStateObjects(
      *
      * @param stateFuture to be discarded
      * @throws Exception if the discard operation failed
+     * @return the size of state before cancellation (if available)
      */
-    public static void discardStateFuture(Future<? extends StateObject> stateFuture)
+    public static long discardStateFuture(Future<? extends StateObject> stateFuture)
             throws Exception {
+        long stateSize = 0;

Review comment:
       It would be problematic to combine `-1` with valid values.
   I think we should just mention in the docs that this number is best effort if checkpoint isn't completed.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #14635:
URL: https://github.com/apache/flink/pull/14635#discussion_r563569096



##########
File path: docs/ops/monitoring/checkpoint_monitoring.md
##########
@@ -61,7 +61,7 @@ Note that for failed checkpoints, metrics are updated on a best efforts basis an
 - **ID**: The ID of the triggered checkpoint. The IDs are incremented for each checkpoint, starting at 1.
 - **Status**: The current status of the checkpoint, which is either *In Progress* (<i aria-hidden="true" class="fa fa-circle-o-notch fa-spin fa-fw"/>), *Completed* (<i aria-hidden="true" class="fa fa-check"/>), or *Failed* (<i aria-hidden="true" class="fa fa-remove"/>). If the triggered checkpoint is a savepoint, you will see a <i aria-hidden="true" class="fa fa-floppy-o"/> symbol.
 - **Trigger Time**: The time when the checkpoint was triggered at the JobManager.
-- **Latest Acknowledgement**: The time when the latest acknowledgement for any subtask was received at the JobManager (or n/a if no acknowledgement received yet). For a failed checkpoint, this is the time from trigger timestamp to failure.
+- **Latest Acknowledgement**: The time when the latest acknowledgement for any subtask was received at the JobManager (or n/a if no acknowledgement received yet).

Review comment:
       I don't know Chinese so I created FLINK-21122 to update it according to the [contribution guide](https://flink.apache.org/contributing/contribute-documentation.html#chinese-documentation-translation).




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 6362142ac6babb0d846d6dd77bfcf30be0876b3f Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759741104


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit bd8d680c1b6886f053f511793502c946f2c899e5 (Fri May 28 08:06:00 UTC 2021)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073",
       "triggerID" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 026f0b5bfd703dfe84504876c5e86cb5b6682307 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073",
       "triggerID" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * fcc06b3d273f1c4a1ae2725034684793c192c97e Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070) 
   * 026f0b5bfd703dfe84504876c5e86cb5b6682307 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073",
       "triggerID" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12080",
       "triggerID" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12384",
       "triggerID" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3618093d3ed7359710c9f73ced3031ff2ab8def8",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12389",
       "triggerID" : "3618093d3ed7359710c9f73ced3031ff2ab8def8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3618093d3ed7359710c9f73ced3031ff2ab8def8 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12389) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] NicoK commented on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
NicoK commented on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-765514464


   So, as soon as we are through the sync phase, we will get stats (if the CP is aborted during the sync phase, that won't interrupt the sync part anyway and will wait for it to complete). If we didn't reach the sync phase yet, the timeout could be because of slowly moving barriers (no barrier was received yet) or slow alignment (some barriers received but not all). These could be derived from looking at backpressure or data skew or starting times of other subtasks or timings from previous subtasks.
   
   I think, the current state is a good step forward and the stats look good :+1: 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on a change in pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on a change in pull request #14635:
URL: https://github.com/apache/flink/pull/14635#discussion_r557470014



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/SubtaskStateStats.java
##########
@@ -75,7 +75,6 @@
 
         checkArgument(subtaskIndex >= 0, "Negative subtask index");
         this.subtaskIndex = subtaskIndex;
-        checkArgument(stateSize >= 0, "Negative state size");

Review comment:
       I removed it for debugging, will restore. Thanks for pointing out!




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] rkhachatryan commented on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
rkhachatryan commented on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-763414913


   Thanks a lot for trying it out.
   > I think it's strictly necessary to:
   > clearly mark which checkpoint for which subtask has failed
   
   It is not always the task that fails a checkpoint. Timeout decision is made by the `CheckpointCoordinator`.
   Multiple tasks can fail independently as well.
   I agree that marking "failed" tasks would be useful but I don't think it's directly related to this feature or at least this PR.
   
   > if we were not able to collect/calculate a metric, it must be N/A - not just 0ms
   
   I don't see `0ms` on your screenshots nor while running locally. Do you mean `0 B` per operator? 
   If so, why is it incorrect? (I do see non-zero size running cluster).
   
   > correctly calculate the durations (end to end, sync, async, etc...) also for failed checkpoints, not just N/A
   
   A checkpoint can be cancelled before even being started on some subtasks. 


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073",
       "triggerID" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12080",
       "triggerID" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12384",
       "triggerID" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3618093d3ed7359710c9f73ced3031ff2ab8def8",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12389",
       "triggerID" : "3618093d3ed7359710c9f73ced3031ff2ab8def8",
       "triggerType" : "PUSH"
     }, {
       "hash" : "bd8d680c1b6886f053f511793502c946f2c899e5",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12454",
       "triggerID" : "bd8d680c1b6886f053f511793502c946f2c899e5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 3618093d3ed7359710c9f73ced3031ff2ab8def8 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12389) 
   * bd8d680c1b6886f053f511793502c946f2c899e5 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12454) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on a change in pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #14635:
URL: https://github.com/apache/flink/pull/14635#discussion_r563513306



##########
File path: flink-runtime-web/web-dashboard/src/app/pages/job/checkpoints/subtask/job-checkpoints-subtask.component.html
##########
@@ -99,7 +99,7 @@
         <td>{{ subTask['index'] }}</td>
         <ng-container *ngIf="subTask['status'] == 'completed'">
           <td >{{ subTask['ack_timestamp'] | date:'yyyy-MM-dd HH:mm:ss' }}</td>
-          <td>{{ subTask['end_to_end_duration'] | humanizeDuration}}</td>
+          <td>{{ subTask['end_to_end_duration'] | humanizeDuration}}<span *ngIf="subTask['aborted']">(aborted)</span></td>

Review comment:
       add space in front  of `(aborted)`?

##########
File path: docs/ops/monitoring/checkpoint_monitoring.md
##########
@@ -61,7 +61,7 @@ Note that for failed checkpoints, metrics are updated on a best efforts basis an
 - **ID**: The ID of the triggered checkpoint. The IDs are incremented for each checkpoint, starting at 1.
 - **Status**: The current status of the checkpoint, which is either *In Progress* (<i aria-hidden="true" class="fa fa-circle-o-notch fa-spin fa-fw"/>), *Completed* (<i aria-hidden="true" class="fa fa-check"/>), or *Failed* (<i aria-hidden="true" class="fa fa-remove"/>). If the triggered checkpoint is a savepoint, you will see a <i aria-hidden="true" class="fa fa-floppy-o"/> symbol.
 - **Trigger Time**: The time when the checkpoint was triggered at the JobManager.
-- **Latest Acknowledgement**: The time when the latest acknowledgement for any subtask was received at the JobManager (or n/a if no acknowledgement received yet). For a failed checkpoint, this is the time from trigger timestamp to failure.
+- **Latest Acknowledgement**: The time when the latest acknowledgement for any subtask was received at the JobManager (or n/a if no acknowledgement received yet).

Review comment:
       Update the `.zh.md` version as well?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073",
       "triggerID" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12080",
       "triggerID" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12384",
       "triggerID" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3618093d3ed7359710c9f73ced3031ff2ab8def8",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12389",
       "triggerID" : "3618093d3ed7359710c9f73ced3031ff2ab8def8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 89004265ced3fe4c638a46d58dba64157fa06d08 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12384) 
   * 3618093d3ed7359710c9f73ced3031ff2ab8def8 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12389) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073",
       "triggerID" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12080",
       "triggerID" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "triggerType" : "PUSH"
     }, {
       "hash" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12384",
       "triggerID" : "89004265ced3fe4c638a46d58dba64157fa06d08",
       "triggerType" : "PUSH"
     }, {
       "hash" : "3618093d3ed7359710c9f73ced3031ff2ab8def8",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "3618093d3ed7359710c9f73ced3031ff2ab8def8",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 89004265ced3fe4c638a46d58dba64157fa06d08 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12384) 
   * 3618093d3ed7359710c9f73ced3031ff2ab8def8 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073",
       "triggerID" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12080",
       "triggerID" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 2e65254e667d4360d6c48a5e2ad9d2b620cdb17d Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12080) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14635:
URL: https://github.com/apache/flink/pull/14635#issuecomment-759750009


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12008",
       "triggerID" : "6362142ac6babb0d846d6dd77bfcf30be0876b3f",
       "triggerType" : "PUSH"
     }, {
       "hash" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12070",
       "triggerID" : "fcc06b3d273f1c4a1ae2725034684793c192c97e",
       "triggerType" : "PUSH"
     }, {
       "hash" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "status" : "FAILURE",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073",
       "triggerID" : "026f0b5bfd703dfe84504876c5e86cb5b6682307",
       "triggerType" : "PUSH"
     }, {
       "hash" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "2e65254e667d4360d6c48a5e2ad9d2b620cdb17d",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 026f0b5bfd703dfe84504876c5e86cb5b6682307 Azure: [FAILURE](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=12073) 
   * 2e65254e667d4360d6c48a5e2ad9d2b620cdb17d UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on a change in pull request #14635: [FLINK-19462][checkpointing] Update failed checkpoint stats

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #14635:
URL: https://github.com/apache/flink/pull/14635#discussion_r557327924



##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorImpl.java
##########
@@ -312,15 +312,17 @@ public void notifyCheckpointComplete(
             long checkpointId, OperatorChain<?, ?> operatorChain, Supplier<Boolean> isRunning)
             throws Exception {
         if (isRunning.get()) {
-            LOG.debug("Notification of complete checkpoint for task {}", taskName);
+            LOG.debug(
+                    "Notification of completed checkpoint {} for task {}", taskName, checkpointId);

Review comment:
       switch `taskName` and `checkpointId`?

##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/SubtaskCheckpointCoordinatorImpl.java
##########
@@ -333,7 +335,7 @@ public void notifyCheckpointAborted(
 
         Exception previousException = null;
         if (isRunning.get()) {
-            LOG.debug("Notification of aborted checkpoint for task {}", taskName);
+            LOG.debug("Notification of aborted checkpoint {} for task {}", taskName, checkpointId);

Review comment:
       switch `taskName` and `checkpointId`?

##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/OperatorSnapshotFutures.java
##########
@@ -155,20 +155,22 @@ public void setResultSubpartitionStateFuture(
         this.resultSubpartitionStateFuture = resultSubpartitionStateFuture;
     }
 
-    public void cancel() throws Exception {
+    /** @return discarded state size (if available). */
+    public long cancel() throws Exception {

Review comment:
       Can this be unit tested? Maybe in `OperatorSnapshotFuturesTest`?

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/SubtaskStateStats.java
##########
@@ -75,7 +75,6 @@
 
         checkArgument(subtaskIndex >= 0, "Negative subtask index");
         this.subtaskIndex = subtaskIndex;
-        checkArgument(stateSize >= 0, "Negative state size");

Review comment:
       Why was this removed?

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/state/StateUtil.java
##########
@@ -70,9 +70,11 @@ public static void bestEffortDiscardAllStateObjects(
      *
      * @param stateFuture to be discarded
      * @throws Exception if the discard operation failed
+     * @return the size of state before cancellation (if available)
      */
-    public static void discardStateFuture(Future<? extends StateObject> stateFuture)
+    public static long discardStateFuture(Future<? extends StateObject> stateFuture)
             throws Exception {
+        long stateSize = 0;

Review comment:
       Hmmm, -1? Don't know what would be better. `-1` we could print as `N/A` in the WebUI?

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/CheckpointMetrics.java
##########
@@ -46,8 +48,11 @@
     /** Is the checkpoint completed as an unaligned checkpoint. */
     private final boolean unalignedCheckpoint;
 
+    private final long totalBytesPersisted;
+
+    @VisibleForTesting
     public CheckpointMetrics() {
-        this(-1L, -1L, -1L, -1L, -1L, -1L, false);
+        this(-1L, -1L, -1L, -1L, -1L, -1L, false, 0L);

Review comment:
       nit: `-1L` for the sake of consistency?

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/FailedCheckpointStats.java
##########
@@ -87,15 +72,20 @@
             @Nullable SubtaskStateStats latestAcknowledgedSubtask,
             @Nullable Throwable cause) {
 
-        super(checkpointId, triggerTimestamp, props, totalSubtaskCount, taskStats);
+        super(
+                checkpointId,
+                triggerTimestamp,
+                props,
+                totalSubtaskCount,
+                numAcknowledgedSubtasks,
+                taskStats,
+                PendingCheckpointStatsCallback.noOp(),

Review comment:
       I don't understand this. How is this supposed to be working? How stats are being reported for failed checkpoints? How are we reporting I get, that in this commit you would like to have temporary `noOp` being used, but I don't see this being replaced later? Is something missing, or is this a dead code, or am I missing something?

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/state/StateUtil.java
##########
@@ -70,9 +70,11 @@ public static void bestEffortDiscardAllStateObjects(
      *
      * @param stateFuture to be discarded
      * @throws Exception if the discard operation failed
+     * @return the size of state before cancellation (if available)
      */
-    public static void discardStateFuture(Future<? extends StateObject> stateFuture)
+    public static long discardStateFuture(Future<? extends StateObject> stateFuture)

Review comment:
       Can this be unit tested? Maybe in `OperatorSnapshotFuturesTest`?

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/checkpoint/PendingCheckpointStats.java
##########
@@ -60,6 +61,39 @@
     /** Stats of the latest acknowledged subtask. */
     private volatile SubtaskStateStats latestAcknowledgedSubtask;
 
+    /**
+     * Creates a tracker for a {@link PendingCheckpoint}.
+     *
+     * @param checkpointId ID of the checkpoint.
+     * @param triggerTimestamp Timestamp when the checkpoint was triggered.
+     * @param props Checkpoint properties of the checkpoint.
+     * @param taskStats Task stats for each involved operator.
+     * @param trackerCallback Callback for the {@link CheckpointStatsTracker}.
+     */
+    PendingCheckpointStats(
+            long checkpointId,
+            long triggerTimestamp,
+            CheckpointProperties props,
+            Map<JobVertexID, Integer> taskStats,
+            CheckpointStatsTracker.PendingCheckpointStatsCallback trackerCallback) {
+        this(
+                checkpointId,
+                triggerTimestamp,
+                props,
+                taskStats.values().stream().mapToInt(i -> i).sum(),
+                0,
+                taskStats.entrySet().stream()
+                        .collect(
+                                toConcurrentMap(
+                                        Map.Entry::getKey,
+                                        e -> new TaskStateStats(e.getKey(), e.getValue()))),
+                trackerCallback,
+                0,
+                0,
+                0,
+                null);

Review comment:
       I can not find a matching constructor for this call in this commit/master? Some rebasing issue?

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/scheduler/SchedulerBase.java
##########
@@ -1051,6 +1051,34 @@ public void acknowledgeCheckpoint(
         }
     }
 
+    @Override
+    public void reportCheckpointMetrics(
+            JobID jobID, ExecutionAttemptID attemptId, long id, CheckpointMetrics metrics) {
+        mainThreadExecutor.assertRunningInMainThread();
+
+        final CheckpointCoordinator checkpointCoordinator =
+                executionGraph.getCheckpointCoordinator();
+
+        if (checkpointCoordinator != null) {
+            ioExecutor.execute(
+                    () -> {
+                        try {
+                            checkpointCoordinator.reportStats(id, attemptId, metrics);
+                        } catch (Throwable t) {
+                            log.warn("Error while processing report checkpoint stats message", t);
+                        }
+                    });
+        } else {
+            String errorMessage =
+                    "Received ReportCheckpointStats message for job {} with no CheckpointCoordinator";
+            if (executionGraph.getState() == JobStatus.RUNNING) {
+                log.error(errorMessage, jobGraph.getJobID());
+            } else {
+                log.debug(errorMessage, jobGraph.getJobID());
+            }
+        }
+    }

Review comment:
       Could we deduplicate those lines (error handling and logging) with `acknowledgeCheckpoint`? They seem to differ only by a lambda function and string messages, which maybe could be easily extracted as parameters?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org