You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2020/01/23 08:32:03 UTC

[GitHub] [flink] pnowojski opened a new pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

pnowojski opened a new pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931
 
 
   Introduce a `checkpointStartDelay` metric: the time in nanoseconds that elapsed between the creation of the last checkpoint and the time when the checkpointing process has started by this Task. This delay shows how long it takes for a first checkpoint barrier to reach the task. Back-pressure will increase this value.
   
   ## Verifying this change
   
   This change is hard to test, however I've added some rudimentary checks in `CheckpointBarrierAlignerTestBase`
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / **no**)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (yes / **no**)
     - The serializers: (yes / **no** / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know)
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (**yes** / no)
     - If yes, how is the feature documented? (not applicable / **docs** / JavaDocs / not documented)
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] pnowojski merged pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
pnowojski merged pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931
 
 
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#discussion_r369989970
 
 

 ##########
 File path: docs/monitoring/metrics.md
 ##########
 @@ -1341,11 +1341,16 @@ Metrics related to data exchange between task executors using netty network comm
       <td>Gauge</td>
     </tr>
     <tr>
-      <th rowspan="1">Task</th>
+      <th rowspan="2">Task</th>
       <td>checkpointAlignmentTime</td>
       <td>The time in nanoseconds that the last barrier alignment took to complete, or how long the current alignment has taken so far (in nanoseconds).</td>
       <td>Gauge</td>
     </tr>
+    <tr>
+      <td>checkpointStartDelay</td>
+      <td>The time in nanoseconds that elapsed between the creation of the last checkpoint and the time when the checkpointing process has started by this Task. This delay shows how long it takes for a first checkpoint barrier to reach the task. Back-pressure will increase this value.</td>
 
 Review comment:
   This delay shows how long it takes for **the** first checkpoint barrier to reach the task.
   A high value indicates back-pressure. If only a specific task has a long start delay, the most likely reason is data skew.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#discussion_r370031660
 
 

 ##########
 File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/io/CheckpointBarrierHandler.java
 ##########
 @@ -95,4 +101,10 @@ protected void notifyAbortOnCancellationBarrier(long checkpointId) throws Except
 	protected void notifyAbort(long checkpointId, CheckpointException cause) throws Exception {
 		toNotifyOnCheckpoint.abortCheckpointOnBarrier(checkpointId, cause);
 	}
+
+	protected void markCheckpointStart(long checkpointCreationTimestamp) {
+		latestCheckpointStartDelayNanos = 1000_000 * Math.max(
 
 Review comment:
   > It's in the `ns` due to consistency with the existing method/field `getAlignmentDurationNanos` and `checkpointAlignmentTime` metric.
   
   I'd be okay with leaving as is. However, I don't think consistency is needed as the unit is carried in the name.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#issuecomment-577584540
 
 
   <!--
   Meta data
   Hash:b43b813c97413608b68d0672c312b92b23e1d4e0 Status:SUCCESS URL:https://travis-ci.com/flink-ci/flink/builds/145720649 TriggerType:PUSH TriggerID:b43b813c97413608b68d0672c312b92b23e1d4e0
   Hash:b43b813c97413608b68d0672c312b92b23e1d4e0 Status:SUCCESS URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4569 TriggerType:PUSH TriggerID:b43b813c97413608b68d0672c312b92b23e1d4e0
   Hash:753e1073d4b9d298c8e5b50a3ddc1815c8b36a40 Status:UNKNOWN URL:TBD TriggerType:PUSH TriggerID:753e1073d4b9d298c8e5b50a3ddc1815c8b36a40
   -->
   ## CI report:
   
   * b43b813c97413608b68d0672c312b92b23e1d4e0 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/145720649) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4569) 
   * 753e1073d4b9d298c8e5b50a3ddc1815c8b36a40 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#discussion_r369992272
 
 

 ##########
 File path: flink-streaming-java/src/test/java/org/apache/flink/streaming/runtime/io/CheckpointBarrierAlignerTestBase.java
 ##########
 @@ -1254,9 +1264,15 @@ private static void check(BufferOrEvent expected, BufferOrEvent present, int pag
 		}
 	}
 
-	private static void validateAlignmentTime(long startTimestamp, long alignmentDuration) {
-		final long elapsed = System.nanoTime() - startTimestamp;
-		assertTrue("wrong alignment time", alignmentDuration <= elapsed);
+	private static void validateAlignmentTime(long alignmentStartTimestamp, CheckpointedInputGate inputGate) {
+		long elapsedAlignment = System.nanoTime() - alignmentStartTimestamp;
+		long elapsedTotalTime = System.nanoTime() - testStartTimeNanos;
+		assertThat(inputGate.getAlignmentDurationNanos(), Matchers.lessThanOrEqualTo(elapsedAlignment));
+
+		// Barrier lag is calculated with System.currentTimeMillis(), so we need a tolerance of 1ms
+		// when comparing to time elapsed via System.nanoTime()
+		long tolerance = 1_000_000;
 
 Review comment:
   exactly the reason why I would only use ms. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] pnowojski commented on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
pnowojski commented on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#issuecomment-577718793
 
 
   @AHeise, can you check the updated version?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] pnowojski commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#discussion_r370089380
 
 

 ##########
 File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/io/CheckpointBarrierHandler.java
 ##########
 @@ -95,4 +101,10 @@ protected void notifyAbortOnCancellationBarrier(long checkpointId) throws Except
 	protected void notifyAbort(long checkpointId, CheckpointException cause) throws Exception {
 		toNotifyOnCheckpoint.abortCheckpointOnBarrier(checkpointId, cause);
 	}
+
+	protected void markCheckpointStart(long checkpointCreationTimestamp) {
+		latestCheckpointStartDelayNanos = 1000_000 * Math.max(
 
 Review comment:
   Whether it is in the name or not, I think is a secondary issue. If we provide some stats, it's best if all have the same units.
   
   https://ci.apache.org/projects/flink/flink-docs-stable/monitoring/metrics.html#checkpointing
   
   however, annoying thing is, that in this class yes, all units are in nanos, but in other places it's millis (7 metrics in millis vs 2 in nanos?). More over, web ui is also using millis only... 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot commented on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
flinkbot commented on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#issuecomment-577579769
 
 
   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit b43b813c97413608b68d0672c312b92b23e1d4e0 (Thu Jan 23 08:36:05 UTC 2020)
   
    ✅no warnings
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] pnowojski commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#discussion_r370525968
 
 

 ##########
 File path: docs/monitoring/metrics.md
 ##########
 @@ -1341,11 +1341,16 @@ Metrics related to data exchange between task executors using netty network comm
       <td>Gauge</td>
     </tr>
     <tr>
-      <th rowspan="1">Task</th>
+      <th rowspan="2">Task</th>
       <td>checkpointAlignmentTime</td>
       <td>The time in nanoseconds that the last barrier alignment took to complete, or how long the current alignment has taken so far (in nanoseconds).</td>
       <td>Gauge</td>
     </tr>
+    <tr>
+      <td>checkpointStartDelay</td>
+      <td>The time in nanoseconds that elapsed between the creation of the last checkpoint and the time when the checkpointing process has started by this Task. This delay shows how long it takes for a first checkpoint barrier to reach the task. Back-pressure will increase this value.</td>
 
 Review comment:
   Ops, sorry I've missed that. Updating.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#discussion_r370031172
 
 

 ##########
 File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/io/CheckpointBarrierHandler.java
 ##########
 @@ -95,4 +101,10 @@ protected void notifyAbortOnCancellationBarrier(long checkpointId) throws Except
 	protected void notifyAbort(long checkpointId, CheckpointException cause) throws Exception {
 		toNotifyOnCheckpoint.abortCheckpointOnBarrier(checkpointId, cause);
 	}
+
+	protected void markCheckpointStart(long checkpointCreationTimestamp) {
+		latestCheckpointStartDelayNanos = 1000_000 * Math.max(
 
 Review comment:
   I found it weird to read, but it's clearly a nit.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot commented on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
flinkbot commented on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#issuecomment-577584540
 
 
   <!--
   Meta data
   Hash:b43b813c97413608b68d0672c312b92b23e1d4e0 Status:UNKNOWN URL:TBD TriggerType:PUSH TriggerID:b43b813c97413608b68d0672c312b92b23e1d4e0
   -->
   ## CI report:
   
   * b43b813c97413608b68d0672c312b92b23e1d4e0 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#issuecomment-577584540
 
 
   <!--
   Meta data
   Hash:b43b813c97413608b68d0672c312b92b23e1d4e0 Status:SUCCESS URL:https://travis-ci.com/flink-ci/flink/builds/145720649 TriggerType:PUSH TriggerID:b43b813c97413608b68d0672c312b92b23e1d4e0
   Hash:b43b813c97413608b68d0672c312b92b23e1d4e0 Status:SUCCESS URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4569 TriggerType:PUSH TriggerID:b43b813c97413608b68d0672c312b92b23e1d4e0
   Hash:753e1073d4b9d298c8e5b50a3ddc1815c8b36a40 Status:PENDING URL:https://travis-ci.com/flink-ci/flink/builds/145779679 TriggerType:PUSH TriggerID:753e1073d4b9d298c8e5b50a3ddc1815c8b36a40
   Hash:753e1073d4b9d298c8e5b50a3ddc1815c8b36a40 Status:PENDING URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4581 TriggerType:PUSH TriggerID:753e1073d4b9d298c8e5b50a3ddc1815c8b36a40
   -->
   ## CI report:
   
   * b43b813c97413608b68d0672c312b92b23e1d4e0 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/145720649) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4569) 
   * 753e1073d4b9d298c8e5b50a3ddc1815c8b36a40 Travis: [PENDING](https://travis-ci.com/flink-ci/flink/builds/145779679) Azure: [PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4581) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] pnowojski commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#discussion_r370009939
 
 

 ##########
 File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/io/CheckpointBarrierHandler.java
 ##########
 @@ -95,4 +101,10 @@ protected void notifyAbortOnCancellationBarrier(long checkpointId) throws Except
 	protected void notifyAbort(long checkpointId, CheckpointException cause) throws Exception {
 		toNotifyOnCheckpoint.abortCheckpointOnBarrier(checkpointId, cause);
 	}
+
+	protected void markCheckpointStart(long checkpointCreationTimestamp) {
+		latestCheckpointStartDelayNanos = 1000_000 * Math.max(
 
 Review comment:
   It's in the `ns` due to consistency with the existing method/field `getAlignmentDurationNanos`  and `checkpointAlignmentTime` metric.
   
   Re 1_000_000, I don't think that it increases readability, does it matter?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#discussion_r369991681
 
 

 ##########
 File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/io/CheckpointBarrierHandler.java
 ##########
 @@ -95,4 +101,10 @@ protected void notifyAbortOnCancellationBarrier(long checkpointId) throws Except
 	protected void notifyAbort(long checkpointId, CheckpointException cause) throws Exception {
 		toNotifyOnCheckpoint.abortCheckpointOnBarrier(checkpointId, cause);
 	}
+
+	protected void markCheckpointStart(long checkpointCreationTimestamp) {
+		latestCheckpointStartDelayNanos = 1000_000 * Math.max(
 
 Review comment:
   1_000_000, while you usually don't use a separator up to 9999, once you start, you use it consistently every 3 places.
   Why storing it as nanos at all? Should we just track it as ms and only display it as nano (for consistency?). We are faking a precision that is not there.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#discussion_r370092256
 
 

 ##########
 File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/io/CheckpointBarrierHandler.java
 ##########
 @@ -95,4 +101,10 @@ protected void notifyAbortOnCancellationBarrier(long checkpointId) throws Except
 	protected void notifyAbort(long checkpointId, CheckpointException cause) throws Exception {
 		toNotifyOnCheckpoint.abortCheckpointOnBarrier(checkpointId, cause);
 	}
+
+	protected void markCheckpointStart(long checkpointCreationTimestamp) {
+		latestCheckpointStartDelayNanos = 1000_000 * Math.max(
 
 Review comment:
   Yes, I'd convert to ms. I don't see why any user profits from ns. 
   If it was some per-record metric, I'd agree with nano (like in Kinesis?).
   It's technically much too complicated to actually produce nano-precision given the expected usability gain.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#issuecomment-577584540
 
 
   <!--
   Meta data
   Hash:b43b813c97413608b68d0672c312b92b23e1d4e0 Status:SUCCESS URL:https://travis-ci.com/flink-ci/flink/builds/145720649 TriggerType:PUSH TriggerID:b43b813c97413608b68d0672c312b92b23e1d4e0
   Hash:b43b813c97413608b68d0672c312b92b23e1d4e0 Status:SUCCESS URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4569 TriggerType:PUSH TriggerID:b43b813c97413608b68d0672c312b92b23e1d4e0
   Hash:753e1073d4b9d298c8e5b50a3ddc1815c8b36a40 Status:SUCCESS URL:https://travis-ci.com/flink-ci/flink/builds/145779679 TriggerType:PUSH TriggerID:753e1073d4b9d298c8e5b50a3ddc1815c8b36a40
   Hash:753e1073d4b9d298c8e5b50a3ddc1815c8b36a40 Status:SUCCESS URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4581 TriggerType:PUSH TriggerID:753e1073d4b9d298c8e5b50a3ddc1815c8b36a40
   -->
   ## CI report:
   
   * b43b813c97413608b68d0672c312b92b23e1d4e0 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/145720649) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4569) 
   * 753e1073d4b9d298c8e5b50a3ddc1815c8b36a40 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/145779679) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4581) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] pnowojski commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#discussion_r370090021
 
 

 ##########
 File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/io/CheckpointBarrierHandler.java
 ##########
 @@ -95,4 +101,10 @@ protected void notifyAbortOnCancellationBarrier(long checkpointId) throws Except
 	protected void notifyAbort(long checkpointId, CheckpointException cause) throws Exception {
 		toNotifyOnCheckpoint.abortCheckpointOnBarrier(checkpointId, cause);
 	}
+
+	protected void markCheckpointStart(long checkpointCreationTimestamp) {
+		latestCheckpointStartDelayNanos = 1000_000 * Math.max(
 
 Review comment:
   I guess after all, I would be leaning towards converging towards millis, so changing this new metric to millis as well... WDYT?

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#discussion_r369992272
 
 

 ##########
 File path: flink-streaming-java/src/test/java/org/apache/flink/streaming/runtime/io/CheckpointBarrierAlignerTestBase.java
 ##########
 @@ -1254,9 +1264,15 @@ private static void check(BufferOrEvent expected, BufferOrEvent present, int pag
 		}
 	}
 
-	private static void validateAlignmentTime(long startTimestamp, long alignmentDuration) {
-		final long elapsed = System.nanoTime() - startTimestamp;
-		assertTrue("wrong alignment time", alignmentDuration <= elapsed);
+	private static void validateAlignmentTime(long alignmentStartTimestamp, CheckpointedInputGate inputGate) {
+		long elapsedAlignment = System.nanoTime() - alignmentStartTimestamp;
+		long elapsedTotalTime = System.nanoTime() - testStartTimeNanos;
+		assertThat(inputGate.getAlignmentDurationNanos(), Matchers.lessThanOrEqualTo(elapsedAlignment));
+
+		// Barrier lag is calculated with System.currentTimeMillis(), so we need a tolerance of 1ms
+		// when comparing to time elapsed via System.nanoTime()
+		long tolerance = 1_000_000;
 
 Review comment:
   exactly the reason why I would only use ms. 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
AHeise commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#discussion_r370522304
 
 

 ##########
 File path: docs/monitoring/metrics.md
 ##########
 @@ -1341,11 +1341,16 @@ Metrics related to data exchange between task executors using netty network comm
       <td>Gauge</td>
     </tr>
     <tr>
-      <th rowspan="1">Task</th>
+      <th rowspan="2">Task</th>
       <td>checkpointAlignmentTime</td>
       <td>The time in nanoseconds that the last barrier alignment took to complete, or how long the current alignment has taken so far (in nanoseconds).</td>
       <td>Gauge</td>
     </tr>
+    <tr>
+      <td>checkpointStartDelay</td>
+      <td>The time in nanoseconds that elapsed between the creation of the last checkpoint and the time when the checkpointing process has started by this Task. This delay shows how long it takes for a first checkpoint barrier to reach the task. Back-pressure will increase this value.</td>
 
 Review comment:
   Just to make sure that this was not just an oversight:
   I also suggested to add "A high value indicates back-pressure. If only a specific task has a long start delay, the most likely reason is data skew." instead of "Back-pressure will increase this value."

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] AHeise commented on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
AHeise commented on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#issuecomment-578042046
 
 
   Except for the optional comment in the docs. LGTM.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] pnowojski commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#discussion_r370160682
 
 

 ##########
 File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/io/CheckpointBarrierHandler.java
 ##########
 @@ -95,4 +101,10 @@ protected void notifyAbortOnCancellationBarrier(long checkpointId) throws Except
 	protected void notifyAbort(long checkpointId, CheckpointException cause) throws Exception {
 		toNotifyOnCheckpoint.abortCheckpointOnBarrier(checkpointId, cause);
 	}
+
+	protected void markCheckpointStart(long checkpointCreationTimestamp) {
+		latestCheckpointStartDelayNanos = 1000_000 * Math.max(
 
 Review comment:
   Offline discussion consensus was to:
   
   - In order to have consistency start converging all of the metrics (when adding new ones) to nano seconds. To nanoseconds instead of milliseconds, because some metrics might need this extra precision
   - Always adding the unit (`Nanos`) suffix to the  metric name

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#issuecomment-577584540
 
 
   <!--
   Meta data
   Hash:b43b813c97413608b68d0672c312b92b23e1d4e0 Status:SUCCESS URL:https://travis-ci.com/flink-ci/flink/builds/145720649 TriggerType:PUSH TriggerID:b43b813c97413608b68d0672c312b92b23e1d4e0
   Hash:b43b813c97413608b68d0672c312b92b23e1d4e0 Status:SUCCESS URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4569 TriggerType:PUSH TriggerID:b43b813c97413608b68d0672c312b92b23e1d4e0
   -->
   ## CI report:
   
   * b43b813c97413608b68d0672c312b92b23e1d4e0 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/145720649) Azure: [SUCCESS](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4569) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#issuecomment-577584540
 
 
   <!--
   Meta data
   Hash:b43b813c97413608b68d0672c312b92b23e1d4e0 Status:PENDING URL:https://travis-ci.com/flink-ci/flink/builds/145720649 TriggerType:PUSH TriggerID:b43b813c97413608b68d0672c312b92b23e1d4e0
   Hash:b43b813c97413608b68d0672c312b92b23e1d4e0 Status:PENDING URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4569 TriggerType:PUSH TriggerID:b43b813c97413608b68d0672c312b92b23e1d4e0
   -->
   ## CI report:
   
   * b43b813c97413608b68d0672c312b92b23e1d4e0 Travis: [PENDING](https://travis-ci.com/flink-ci/flink/builds/145720649) Azure: [PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4569) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

[GitHub] [flink] flinkbot edited a comment on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on issue #10931: [FLINK-15603][metrics] Add checkpointStartDelay metric
URL: https://github.com/apache/flink/pull/10931#issuecomment-577584540
 
 
   <!--
   Meta data
   Hash:b43b813c97413608b68d0672c312b92b23e1d4e0 Status:SUCCESS URL:https://travis-ci.com/flink-ci/flink/builds/145720649 TriggerType:PUSH TriggerID:b43b813c97413608b68d0672c312b92b23e1d4e0
   Hash:b43b813c97413608b68d0672c312b92b23e1d4e0 Status:PENDING URL:https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4569 TriggerType:PUSH TriggerID:b43b813c97413608b68d0672c312b92b23e1d4e0
   -->
   ## CI report:
   
   * b43b813c97413608b68d0672c312b92b23e1d4e0 Travis: [SUCCESS](https://travis-ci.com/flink-ci/flink/builds/145720649) Azure: [PENDING](https://dev.azure.com/rmetzger/5bd3ef0a-4359-41af-abca-811b04098d2e/_build/results?buildId=4569) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services