You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2022/03/07 14:23:17 UTC

[GitHub] [flink] pnowojski commented on a change in pull request #18991: [FLINK-26279][runtime] Add mailbox delay and throughput metrics

pnowojski commented on a change in pull request #18991:
URL: https://github.com/apache/flink/pull/18991#discussion_r820731984



##########
File path: docs/content.zh/docs/ops/metrics.md
##########
@@ -1428,6 +1428,16 @@ Note that the metrics are only available via reporters.
       <td>Maximum recorded duration of a single consecutive period of the task being in the hard back pressure state in the last sampling period. Please check softBackPressuredTimeMsPerSecond and hardBackPressuredTimeMsPerSecond for more information.</td>
       <td>Gauge</td>
     </tr>
+    <tr>

Review comment:
       what about the mailbox queue length? Should we add it as well?

##########
File path: docs/content.zh/docs/ops/metrics.md
##########
@@ -1428,6 +1428,16 @@ Note that the metrics are only available via reporters.
       <td>Maximum recorded duration of a single consecutive period of the task being in the hard back pressure state in the last sampling period. Please check softBackPressuredTimeMsPerSecond and hardBackPressuredTimeMsPerSecond for more information.</td>
       <td>Gauge</td>
     </tr>
+    <tr>
+      <td>mailboxThroughputPerSecond</td>
+      <td>The number of actions processed from the task's mailbox per second over all actions which includes, e.g., checkpointing, timer, or cancellation actions.</td>
+      <td>Meter</td>
+    </tr>
+    <tr>
+      <td>mailboxThroughputLatencyUs</td>
+      <td>The latency in processing actions from the task's mailbox, i.e., an action's waiting time in the mailbox, in microseconds for actions with standard (default) priority, e.g., checkpointing actions.</td>
+      <td>Histogram</td>

Review comment:
       Have you checked how is it being reported/displayed, especially in the WebUI?

##########
File path: docs/content.zh/docs/ops/metrics.md
##########
@@ -1428,6 +1428,16 @@ Note that the metrics are only available via reporters.
       <td>Maximum recorded duration of a single consecutive period of the task being in the hard back pressure state in the last sampling period. Please check softBackPressuredTimeMsPerSecond and hardBackPressuredTimeMsPerSecond for more information.</td>
       <td>Gauge</td>
     </tr>
+    <tr>
+      <td>mailboxThroughputPerSecond</td>

Review comment:
       `mailsPerSecond` would be more consistent with existing metrics? `taskThreadActionsPerSecond`?

##########
File path: docs/content.zh/docs/ops/metrics.md
##########
@@ -1428,6 +1428,16 @@ Note that the metrics are only available via reporters.
       <td>Maximum recorded duration of a single consecutive period of the task being in the hard back pressure state in the last sampling period. Please check softBackPressuredTimeMsPerSecond and hardBackPressuredTimeMsPerSecond for more information.</td>
       <td>Gauge</td>
     </tr>
+    <tr>
+      <td>mailboxThroughputPerSecond</td>
+      <td>The number of actions processed from the task's mailbox per second over all actions which includes, e.g., checkpointing, timer, or cancellation actions.</td>
+      <td>Meter</td>
+    </tr>
+    <tr>
+      <td>mailboxThroughputLatencyUs</td>

Review comment:
       `mailboxLatencyUs`? `taskThreadLatencyUs`?

##########
File path: docs/content.zh/docs/ops/metrics.md
##########
@@ -1428,6 +1428,16 @@ Note that the metrics are only available via reporters.
       <td>Maximum recorded duration of a single consecutive period of the task being in the hard back pressure state in the last sampling period. Please check softBackPressuredTimeMsPerSecond and hardBackPressuredTimeMsPerSecond for more information.</td>
       <td>Gauge</td>
     </tr>
+    <tr>
+      <td>mailboxThroughputPerSecond</td>
+      <td>The number of actions processed from the task's mailbox per second over all actions which includes, e.g., checkpointing, timer, or cancellation actions.</td>
+      <td>Meter</td>
+    </tr>
+    <tr>
+      <td>mailboxThroughputLatencyUs</td>

Review comment:
       Why microseconds? Everything else is either in seconds, milliseconds or nanoseconds. 

##########
File path: docs/content.zh/docs/ops/metrics.md
##########
@@ -1328,7 +1328,7 @@ Note that the metrics are only available via reporters.
       <td>Histogram</td>
     </tr>
     <tr>
-      <th rowspan="20"><strong>Task</strong></th>
+      <th rowspan="22"><strong>Task</strong></th>

Review comment:
       Do we have a better place for those metrics? They are not very "task io" related

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/metrics/groups/TaskIOMetricGroup.java
##########
@@ -100,6 +104,10 @@ public TaskIOMetricGroup(TaskMetricGroup parent) {
                         hardBackPressuredTimePerSecond::getMaxSingleMeasurement);
 
         this.busyTimePerSecond = gauge(MetricNames.TASK_BUSY_TIME, this::getBusyTimePerSecond);
+
+        this.mailboxThroughput = meter(MetricNames.MAILBOX_THROUGHPUT, new MeterView(60));
+        this.mailboxLatency =
+                histogram(MetricNames.MAILBOX_LATENCY, new DescriptiveStatisticsHistogram(60));

Review comment:
       Why are you defining `60s`  as `timeSpanInSeconds` here? Why not using a default parameter?

##########
File path: flink-streaming-java/src/main/java/org/apache/flink/streaming/runtime/tasks/mailbox/MailboxProcessor.java
##########
@@ -175,6 +199,7 @@ public void close() {
     public void drain() throws Exception {
         for (final Mail mail : mailbox.drain()) {
             mail.run();
+            mailboxThroughputMeter.markEvent();

Review comment:
       I would be slightly worried about adding extra virtual calls overhead. I don't think this is a confirmed concern here, but since we could quite easily avoid it, wouldn't it be better to have here a simple `long counter++` here?

##########
File path: docs/content.zh/docs/ops/metrics.md
##########
@@ -1428,6 +1428,16 @@ Note that the metrics are only available via reporters.
       <td>Maximum recorded duration of a single consecutive period of the task being in the hard back pressure state in the last sampling period. Please check softBackPressuredTimeMsPerSecond and hardBackPressuredTimeMsPerSecond for more information.</td>
       <td>Gauge</td>
     </tr>
+    <tr>
+      <td>mailboxThroughputPerSecond</td>
+      <td>The number of actions processed from the task's mailbox per second over all actions which includes, e.g., checkpointing, timer, or cancellation actions.</td>
+      <td>Meter</td>
+    </tr>
+    <tr>
+      <td>mailboxThroughputLatencyUs</td>
+      <td>The latency in processing actions from the task's mailbox, i.e., an action's waiting time in the mailbox, in microseconds for actions with standard (default) priority, e.g., checkpointing actions.</td>

Review comment:
       I would mention that it is being sampled periodically, so it might not be fully accurate. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscribe@flink.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org