You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@flink.apache.org by GitBox <gi...@apache.org> on 2021/01/12 11:25:09 UTC

[GitHub] [flink] pnowojski opened a new pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

pnowojski opened a new pull request #14618:
URL: https://github.com/apache/flink/pull/14618


   This PR focuses on providing `idleTime` and `busyTime` metrics per subtask in the "BackPressure" tab in the WebUI. Additionally back pressure ratio in this tab is rewritten to use `backPressuredTime` metric, which is vastly simplifying the existing `JobVertexBackPressureHandler`.
   
   ## Brief change log
   
   Please check individual commit messages.
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / **no**)
     - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (**yes** / no)
     - The serializers: (yes / **no** / don't know)
     - The runtime per-record code paths (performance sensitive): (yes / **no** / don't know)
     - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / **no** / don't know)
     - The S3 file system connector: (yes / **no** / don't know)
   
   ## Documentation
   
     - Does this pull request introduce a new feature? (**yes** / no)
     - If yes, how is the feature documented? (not applicable / docs / JavaDocs / **not documented**)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14618:
URL: https://github.com/apache/flink/pull/14618#issuecomment-758605763


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "38e168531edad862805232d24e72f251c1742fca",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=11926",
       "triggerID" : "38e168531edad862805232d24e72f251c1742fca",
       "triggerType" : "PUSH"
     }, {
       "hash" : "80aefc49c25a16273dc01a5441c23b664228d2c5",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=11935",
       "triggerID" : "80aefc49c25a16273dc01a5441c23b664228d2c5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 38e168531edad862805232d24e72f251c1742fca Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=11926) 
   * 80aefc49c25a16273dc01a5441c23b664228d2c5 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=11935) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14618:
URL: https://github.com/apache/flink/pull/14618#issuecomment-758605763


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "38e168531edad862805232d24e72f251c1742fca",
       "status" : "DELETED",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=11926",
       "triggerID" : "38e168531edad862805232d24e72f251c1742fca",
       "triggerType" : "PUSH"
     }, {
       "hash" : "80aefc49c25a16273dc01a5441c23b664228d2c5",
       "status" : "SUCCESS",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=11935",
       "triggerID" : "80aefc49c25a16273dc01a5441c23b664228d2c5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 80aefc49c25a16273dc01a5441c23b664228d2c5 Azure: [SUCCESS](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=11935) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14618:
URL: https://github.com/apache/flink/pull/14618#issuecomment-758605763


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "38e168531edad862805232d24e72f251c1742fca",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=11926",
       "triggerID" : "38e168531edad862805232d24e72f251c1742fca",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 38e168531edad862805232d24e72f251c1742fca Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=11926) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on a change in pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #14618:
URL: https://github.com/apache/flink/pull/14618#discussion_r555728947



##########
File path: flink-runtime-web/web-dashboard/src/app/pages/job/overview/backpressure/job-overview-drawer-backpressure.component.ts
##########
@@ -41,6 +41,15 @@ export class JobOverviewDrawerBackpressureComponent implements OnInit, OnDestroy
     return node.subtask;
   }
 
+  prettyPrint(value: number): string {
+    if (isNaN(value)) {
+      return "N/A"
+    }
+    else {
+      return Math.round(value * 100) + "%";

Review comment:
       It just simply rounds the value to the nearest integer. Why do you think we shouldn't be showing 100%?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] zentol commented on a change in pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
zentol commented on a change in pull request #14618:
URL: https://github.com/apache/flink/pull/14618#discussion_r555708125



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobVertexBackPressureHandler.java
##########
@@ -49,50 +54,78 @@
                 JobVertexBackPressureInfo,
                 JobVertexMessageParameters> {
 
+    private final MetricFetcher metricFetcher;
+
     public JobVertexBackPressureHandler(
             GatewayRetriever<? extends RestfulGateway> leaderRetriever,
             Time timeout,
             Map<String, String> responseHeaders,
             MessageHeaders<EmptyRequestBody, JobVertexBackPressureInfo, JobVertexMessageParameters>
-                    messageHeaders) {
+                    messageHeaders,
+            MetricFetcher metricFetcher) {
         super(leaderRetriever, timeout, responseHeaders, messageHeaders);
+        this.metricFetcher = metricFetcher;
     }
 
     @Override
     protected CompletableFuture<JobVertexBackPressureInfo> handleRequest(
             @Nonnull HandlerRequest<EmptyRequestBody, JobVertexMessageParameters> request,
             @Nonnull RestfulGateway gateway)
             throws RestHandlerException {
+        metricFetcher.update();
+
         final JobID jobId = request.getPathParameter(JobIDPathParameter.class);
         final JobVertexID jobVertexId = request.getPathParameter(JobVertexIdPathParameter.class);
-        return gateway.requestOperatorBackPressureStats(jobId, jobVertexId)
-                .thenApply(
-                        operatorBackPressureStats ->
-                                operatorBackPressureStats
-                                        .getOperatorBackPressureStats()
-                                        .map(
-                                                JobVertexBackPressureHandler
-                                                        ::createJobVertexBackPressureInfo)
-                                        .orElse(JobVertexBackPressureInfo.deprecated()));
+
+        TaskMetricStore taskMetricStore =
+                metricFetcher
+                        .getMetricStore()
+                        .getTaskMetricStore(jobId.toString(), jobVertexId.toString());
+
+        return CompletableFuture.completedFuture(
+                taskMetricStore != null
+                        ? createJobVertexBackPressureInfo(
+                                taskMetricStore.getAllSubtaskMetricStores())
+                        : JobVertexBackPressureInfo.deprecated());
     }
 
-    private static JobVertexBackPressureInfo createJobVertexBackPressureInfo(
-            final OperatorBackPressureStats operatorBackPressureStats) {
+    private JobVertexBackPressureInfo createJobVertexBackPressureInfo(
+            Collection<ComponentMetricStore> allSubtaskMetricStores) {
+
         return new JobVertexBackPressureInfo(
                 JobVertexBackPressureInfo.VertexBackPressureStatus.OK,
-                getBackPressureLevel(operatorBackPressureStats.getMaxBackPressureRatio()),
-                operatorBackPressureStats.getEndTimestamp(),
-                IntStream.range(0, operatorBackPressureStats.getNumberOfSubTasks())
-                        .mapToObj(
-                                subtask -> {
-                                    final double backPressureRatio =
-                                            operatorBackPressureStats.getBackPressureRatio(subtask);
-                                    return new JobVertexBackPressureInfo.SubtaskBackPressureInfo(
-                                            subtask,
-                                            getBackPressureLevel(backPressureRatio),
-                                            backPressureRatio);
-                                })
-                        .collect(Collectors.toList()));
+                getBackPressureLevel(getMaxBackPressureRatio(allSubtaskMetricStores)),
+                metricFetcher.getLastUpdateTime(),
+                createSubtaskBackPressureInfo(allSubtaskMetricStores));
+    }
+
+    private List<SubtaskBackPressureInfo> createSubtaskBackPressureInfo(
+            Collection<ComponentMetricStore> subtaskMetricStores) {
+        List<SubtaskBackPressureInfo> result = new ArrayList<>(subtaskMetricStores.size());
+        int subTaskIndex = 0;
+        for (ComponentMetricStore subtaskMetricStore : subtaskMetricStores) {
+            double backPressureRatio = getBackPressureRatio(subtaskMetricStore);
+            result.add(
+                    new SubtaskBackPressureInfo(
+                            subTaskIndex,
+                            getBackPressureLevel(backPressureRatio),
+                            backPressureRatio));
+            subTaskIndex++;

Review comment:
       this could be incorrect if at this point in time only some of the subtasks have reported back. We could change taskMetricStore.getAllSubtaskMetricStores() to also return the subtask index, or add a dedicated SubtaskMetricStore with a getter for the index.

##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobVertexBackPressureHandler.java
##########
@@ -124,8 +128,19 @@ private double getMaxBackPressureRatio(
     }
 
     private double getBackPressureRatio(ComponentMetricStore metricStore) {
-        return Double.valueOf(metricStore.getMetric(MetricNames.TASK_BACK_PRESSURED_TIME, "0"))
-                / 1_000;
+        return getDoubleMetric(metricStore, MetricNames.TASK_BACK_PRESSURED_TIME);
+    }
+
+    private double getIdleRatio(ComponentMetricStore metricStore) {
+        return getDoubleMetric(metricStore, MetricNames.TASK_IDLE_TIME);
+    }
+
+    private double getBusyRatio(ComponentMetricStore metricStore) {
+        return getDoubleMetric(metricStore, MetricNames.TASK_BUSY_TIME);
+    }
+
+    private double getDoubleMetric(ComponentMetricStore metricStore, String metricName) {

Review comment:
       This does a bit more than loading a double metric.

##########
File path: flink-runtime-web/web-dashboard/src/app/pages/job/overview/backpressure/job-overview-drawer-backpressure.component.ts
##########
@@ -41,6 +41,15 @@ export class JobOverviewDrawerBackpressureComponent implements OnInit, OnDestroy
     return node.subtask;
   }
 
+  prettyPrint(value: number): string {
+    if (isNaN(value)) {
+      return "N/A"
+    }
+    else {
+      return Math.round(value * 100) + "%";

Review comment:
       does this round up or down? (I suppose we ideally should round down so we never accidentally show 100%?




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14618:
URL: https://github.com/apache/flink/pull/14618#issuecomment-758593887


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit 80aefc49c25a16273dc01a5441c23b664228d2c5 (Fri May 28 07:06:08 UTC 2021)
   
   **Warnings:**
    * No documentation files were touched! Remember to keep the Flink docs up to date!
   
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski merged pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
pnowojski merged pull request #14618:
URL: https://github.com/apache/flink/pull/14618


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on a change in pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #14618:
URL: https://github.com/apache/flink/pull/14618#discussion_r555807692



##########
File path: flink-runtime-web/web-dashboard/src/app/pages/job/overview/backpressure/job-overview-drawer-backpressure.component.ts
##########
@@ -41,6 +41,15 @@ export class JobOverviewDrawerBackpressureComponent implements OnInit, OnDestroy
     return node.subtask;
   }
 
+  prettyPrint(value: number): string {
+    if (isNaN(value)) {
+      return "N/A"
+    }
+    else {
+      return Math.round(value * 100) + "%";

Review comment:
       Ok, I get your point. I have two reasons why I decided to round it:
   
   - it takes less space in the WebUI, and it's currently being displayed in one column, three ratios (back pressured / idle / busy) one after another. For example: `14% / 36% / 50%`. Adding decimal places was clogging this view a bit.
   - those metrics are already approximations. For example sleeping intervals of <1ms are not measured accurately, but only asymptomatically. If you measure many intervals intervals,  this +/-1ms will average out. So even if report `100.0%` without this `Math.round()` call, it still doesn't necessarily mean usage was 100%. But 100% +/- 1ms asymptomatically. And in the worst case scenario (with every sleep interval being 0.499ms and measured as 0ms) we can have reported 0.0% usage, while in reality it should have been 49.9% (this worst case scenario is unrealistic though)




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on a change in pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #14618:
URL: https://github.com/apache/flink/pull/14618#discussion_r555807692



##########
File path: flink-runtime-web/web-dashboard/src/app/pages/job/overview/backpressure/job-overview-drawer-backpressure.component.ts
##########
@@ -41,6 +41,15 @@ export class JobOverviewDrawerBackpressureComponent implements OnInit, OnDestroy
     return node.subtask;
   }
 
+  prettyPrint(value: number): string {
+    if (isNaN(value)) {
+      return "N/A"
+    }
+    else {
+      return Math.round(value * 100) + "%";

Review comment:
       Ok, I get your point. I have two reasons why I decided to round it:
   
   - it takes less space in the WebUI, and it's currently being displayed in one column, three ratios (back pressured / idle / busy) one after another. For example: `14% / 36% / 50%`. Adding decimal places was clogging this view a bit.
   - those metrics are already approximations. For example sleeping intervals of <1ms are not measured accurately, but only asymptomatically. If you measure many intervals intervals,  this +/-1ms will average out. So even if report `100.0%` without this `Math.round()` call, it still doesn't necessarily mean usage was 100%. But 100% +/- 1ms asymptomatically. And in the worst case scenario (with every sleep interval being 0.499ms and measured as 0ms) we can have reported 0.0% usage, while in reality it should have been 49.9% (this worst case scenario is unrealistic though)
   - floats/doubles are not precise anyway 




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on a change in pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #14618:
URL: https://github.com/apache/flink/pull/14618#discussion_r555838520



##########
File path: flink-runtime-web/web-dashboard/src/app/pages/job/overview/backpressure/job-overview-drawer-backpressure.component.ts
##########
@@ -41,6 +41,15 @@ export class JobOverviewDrawerBackpressureComponent implements OnInit, OnDestroy
     return node.subtask;
   }
 
+  prettyPrint(value: number): string {
+    if (isNaN(value)) {
+      return "N/A"
+    }
+    else {
+      return Math.round(value * 100) + "%";

Review comment:
       👍 
   I think it's fine here.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14618:
URL: https://github.com/apache/flink/pull/14618#issuecomment-758605763


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "38e168531edad862805232d24e72f251c1742fca",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=11926",
       "triggerID" : "38e168531edad862805232d24e72f251c1742fca",
       "triggerType" : "PUSH"
     }, {
       "hash" : "80aefc49c25a16273dc01a5441c23b664228d2c5",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "80aefc49c25a16273dc01a5441c23b664228d2c5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 38e168531edad862805232d24e72f251c1742fca Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=11926) 
   * 80aefc49c25a16273dc01a5441c23b664228d2c5 UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #14618:
URL: https://github.com/apache/flink/pull/14618#issuecomment-758605763


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "38e168531edad862805232d24e72f251c1742fca",
       "status" : "UNKNOWN",
       "url" : "TBD",
       "triggerID" : "38e168531edad862805232d24e72f251c1742fca",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 38e168531edad862805232d24e72f251c1742fca UNKNOWN
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot commented on pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
flinkbot commented on pull request #14618:
URL: https://github.com/apache/flink/pull/14618#issuecomment-758593887


   Thanks a lot for your contribution to the Apache Flink project. I'm the @flinkbot. I help the community
   to review your pull request. We will use this comment to track the progress of the review.
   
   
   ## Automated Checks
   Last check on commit 38e168531edad862805232d24e72f251c1742fca (Tue Jan 12 11:27:30 UTC 2021)
   
    ✅no warnings
   
   <sub>Mention the bot in a comment to re-run the automated checks.</sub>
   ## Review Progress
   
   * ❓ 1. The [description] looks good.
   * ❓ 2. There is [consensus] that the contribution should go into to Flink.
   * ❓ 3. Needs [attention] from.
   * ❓ 4. The change fits into the overall [architecture].
   * ❓ 5. Overall code [quality] is good.
   
   Please see the [Pull Request Review Guide](https://flink.apache.org/contributing/reviewing-prs.html) for a full explanation of the review process.<details>
    The Bot is tracking the review progress through labels. Labels are applied according to the order of the review items. For consensus, approval by a Flink committer of PMC member is required <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot approve description` to approve one or more aspects (aspects: `description`, `consensus`, `architecture` and `quality`)
    - `@flinkbot approve all` to approve all aspects
    - `@flinkbot approve-until architecture` to approve everything until `architecture`
    - `@flinkbot attention @username1 [@username2 ..]` to require somebody's attention
    - `@flinkbot disapprove architecture` to remove an approval you gave earlier
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] flinkbot edited a comment on pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
flinkbot edited a comment on pull request #14618:
URL: https://github.com/apache/flink/pull/14618#issuecomment-758605763


   <!--
   Meta data
   {
     "version" : 1,
     "metaDataEntries" : [ {
       "hash" : "38e168531edad862805232d24e72f251c1742fca",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=11926",
       "triggerID" : "38e168531edad862805232d24e72f251c1742fca",
       "triggerType" : "PUSH"
     }, {
       "hash" : "80aefc49c25a16273dc01a5441c23b664228d2c5",
       "status" : "PENDING",
       "url" : "https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=11935",
       "triggerID" : "80aefc49c25a16273dc01a5441c23b664228d2c5",
       "triggerType" : "PUSH"
     } ]
   }-->
   ## CI report:
   
   * 38e168531edad862805232d24e72f251c1742fca Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=11926) 
   * 80aefc49c25a16273dc01a5441c23b664228d2c5 Azure: [PENDING](https://dev.azure.com/apache-flink/98463496-1af2-4620-8eab-a2ecc1a2e6fe/_build/results?buildId=11935) 
   
   <details>
   <summary>Bot commands</summary>
     The @flinkbot bot supports the following commands:
   
    - `@flinkbot run travis` re-run the last Travis build
    - `@flinkbot run azure` re-run the last Azure build
   </details>


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on a change in pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #14618:
URL: https://github.com/apache/flink/pull/14618#discussion_r555807692



##########
File path: flink-runtime-web/web-dashboard/src/app/pages/job/overview/backpressure/job-overview-drawer-backpressure.component.ts
##########
@@ -41,6 +41,15 @@ export class JobOverviewDrawerBackpressureComponent implements OnInit, OnDestroy
     return node.subtask;
   }
 
+  prettyPrint(value: number): string {
+    if (isNaN(value)) {
+      return "N/A"
+    }
+    else {
+      return Math.round(value * 100) + "%";

Review comment:
       Ok, I get your point. I have two reasons why I decided to round it:
   
   - it takes less space in the WebUI, and it's currently being displayed in one column, three ratios (back pressured / idle / busy) one after another. For example: `14% / 36% / 50%`. Adding decimal places was clogging this view a bit.
   - those metrics are already approximations. For example sleeping intervals of <1ms are not measured accurately, but only asymptomatically. If you measure many intervals intervals,  this +/-1ms will average out. So even if report `100.0%` without this `Math.round()` call, it still doesn't necessarily mean usage was 100%. But 100% +/- 1ms asymptomatically. And in the worst case scenario (with every sleep interval being 0.499ms and measured as 0ms) we can have reported 0.0% usage, while in reality it should have been 49.9% (this worst case scenario is unrealistic though)
   - floats/doubles are not precise anyway.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] zentol commented on a change in pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
zentol commented on a change in pull request #14618:
URL: https://github.com/apache/flink/pull/14618#discussion_r555768917



##########
File path: flink-runtime-web/web-dashboard/src/app/pages/job/overview/backpressure/job-overview-drawer-backpressure.component.ts
##########
@@ -41,6 +41,15 @@ export class JobOverviewDrawerBackpressureComponent implements OnInit, OnDestroy
     return node.subtask;
   }
 
+  prettyPrint(value: number): string {
+    if (isNaN(value)) {
+      return "N/A"
+    }
+    else {
+      return Math.round(value * 100) + "%";

Review comment:
       I was just reminded of FLINK-20424 where rounding to 100% could be misleading.
   For backpressure/idling I think the distinction between 99.6% and 100% could be important (doing _something_ vs doing _nothing_). But that's just me.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] zentol commented on a change in pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
zentol commented on a change in pull request #14618:
URL: https://github.com/apache/flink/pull/14618#discussion_r555810906



##########
File path: flink-runtime-web/web-dashboard/src/app/pages/job/overview/backpressure/job-overview-drawer-backpressure.component.ts
##########
@@ -41,6 +41,15 @@ export class JobOverviewDrawerBackpressureComponent implements OnInit, OnDestroy
     return node.subtask;
   }
 
+  prettyPrint(value: number): string {
+    if (isNaN(value)) {
+      return "N/A"
+    }
+    else {
+      return Math.round(value * 100) + "%";

Review comment:
       I don't mind rounding, that is all fine. The key thing is whether to round up or down ;) If we're making a conscious decision to round to the nearest integer, then that's fine.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
pnowojski commented on pull request #14618:
URL: https://github.com/apache/flink/pull/14618#issuecomment-758889979


   Thanks for the review @zentol. Azure is green, so merging.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org



[GitHub] [flink] pnowojski commented on a change in pull request #14618: [FLINK-20852][metrics][webui] Provide more detailed per subtask backpressure stats

Posted by GitBox <gi...@apache.org>.
pnowojski commented on a change in pull request #14618:
URL: https://github.com/apache/flink/pull/14618#discussion_r555778071



##########
File path: flink-runtime/src/main/java/org/apache/flink/runtime/rest/handler/job/JobVertexBackPressureHandler.java
##########
@@ -49,50 +54,78 @@
                 JobVertexBackPressureInfo,
                 JobVertexMessageParameters> {
 
+    private final MetricFetcher metricFetcher;
+
     public JobVertexBackPressureHandler(
             GatewayRetriever<? extends RestfulGateway> leaderRetriever,
             Time timeout,
             Map<String, String> responseHeaders,
             MessageHeaders<EmptyRequestBody, JobVertexBackPressureInfo, JobVertexMessageParameters>
-                    messageHeaders) {
+                    messageHeaders,
+            MetricFetcher metricFetcher) {
         super(leaderRetriever, timeout, responseHeaders, messageHeaders);
+        this.metricFetcher = metricFetcher;
     }
 
     @Override
     protected CompletableFuture<JobVertexBackPressureInfo> handleRequest(
             @Nonnull HandlerRequest<EmptyRequestBody, JobVertexMessageParameters> request,
             @Nonnull RestfulGateway gateway)
             throws RestHandlerException {
+        metricFetcher.update();
+
         final JobID jobId = request.getPathParameter(JobIDPathParameter.class);
         final JobVertexID jobVertexId = request.getPathParameter(JobVertexIdPathParameter.class);
-        return gateway.requestOperatorBackPressureStats(jobId, jobVertexId)
-                .thenApply(
-                        operatorBackPressureStats ->
-                                operatorBackPressureStats
-                                        .getOperatorBackPressureStats()
-                                        .map(
-                                                JobVertexBackPressureHandler
-                                                        ::createJobVertexBackPressureInfo)
-                                        .orElse(JobVertexBackPressureInfo.deprecated()));
+
+        TaskMetricStore taskMetricStore =
+                metricFetcher
+                        .getMetricStore()
+                        .getTaskMetricStore(jobId.toString(), jobVertexId.toString());
+
+        return CompletableFuture.completedFuture(
+                taskMetricStore != null
+                        ? createJobVertexBackPressureInfo(
+                                taskMetricStore.getAllSubtaskMetricStores())
+                        : JobVertexBackPressureInfo.deprecated());
     }
 
-    private static JobVertexBackPressureInfo createJobVertexBackPressureInfo(
-            final OperatorBackPressureStats operatorBackPressureStats) {
+    private JobVertexBackPressureInfo createJobVertexBackPressureInfo(
+            Collection<ComponentMetricStore> allSubtaskMetricStores) {
+
         return new JobVertexBackPressureInfo(
                 JobVertexBackPressureInfo.VertexBackPressureStatus.OK,
-                getBackPressureLevel(operatorBackPressureStats.getMaxBackPressureRatio()),
-                operatorBackPressureStats.getEndTimestamp(),
-                IntStream.range(0, operatorBackPressureStats.getNumberOfSubTasks())
-                        .mapToObj(
-                                subtask -> {
-                                    final double backPressureRatio =
-                                            operatorBackPressureStats.getBackPressureRatio(subtask);
-                                    return new JobVertexBackPressureInfo.SubtaskBackPressureInfo(
-                                            subtask,
-                                            getBackPressureLevel(backPressureRatio),
-                                            backPressureRatio);
-                                })
-                        .collect(Collectors.toList()));
+                getBackPressureLevel(getMaxBackPressureRatio(allSubtaskMetricStores)),
+                metricFetcher.getLastUpdateTime(),
+                createSubtaskBackPressureInfo(allSubtaskMetricStores));
+    }
+
+    private List<SubtaskBackPressureInfo> createSubtaskBackPressureInfo(
+            Collection<ComponentMetricStore> subtaskMetricStores) {
+        List<SubtaskBackPressureInfo> result = new ArrayList<>(subtaskMetricStores.size());
+        int subTaskIndex = 0;
+        for (ComponentMetricStore subtaskMetricStore : subtaskMetricStores) {
+            double backPressureRatio = getBackPressureRatio(subtaskMetricStore);
+            result.add(
+                    new SubtaskBackPressureInfo(
+                            subTaskIndex,
+                            getBackPressureLevel(backPressureRatio),
+                            backPressureRatio));
+            subTaskIndex++;

Review comment:
       Good catch, I didn't know about this `MetricStore` feature/behaviour. I don't know why I was assuming it would return empty metrics in that case.
   
   Anyway, fixed with `getAllSubtaskMetricStores` returning the a map right now. I've also extended test case to cover for that.




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
users@infra.apache.org