You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@flink.apache.org by ch...@apache.org on 2021/11/22 21:39:00 UTC

[flink] branch master updated: [FLINK-24876][docs] Remove metrics limitation of Adaptive Scheduler

This is an automated email from the ASF dual-hosted git repository.

chesnay pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/flink.git


The following commit(s) were added to refs/heads/master by this push:
     new feed5e0  [FLINK-24876][docs] Remove metrics limitation of Adaptive Scheduler
feed5e0 is described below

commit feed5e0ec9a3e841e758997a0a6e72accf59a79b
Author: Chesnay Schepler <ch...@apache.org>
AuthorDate: Thu Nov 11 14:36:26 2021 +0100

    [FLINK-24876][docs] Remove metrics limitation of Adaptive Scheduler
---
 docs/content.zh/docs/deployment/elastic_scaling.md | 1 -
 docs/content.zh/docs/ops/metrics.md                | 8 --------
 docs/content/docs/deployment/elastic_scaling.md    | 1 -
 docs/content/docs/ops/metrics.md                   | 8 --------
 4 files changed, 18 deletions(-)

diff --git a/docs/content.zh/docs/deployment/elastic_scaling.md b/docs/content.zh/docs/deployment/elastic_scaling.md
index 062cd7b..c1effbd 100644
--- a/docs/content.zh/docs/deployment/elastic_scaling.md
+++ b/docs/content.zh/docs/deployment/elastic_scaling.md
@@ -146,7 +146,6 @@ Adaptive 调度器可以通过[所有在名字包含 `adaptive-scheduler` 的配
 - **不支持[本地恢复]({{< ref "docs/ops/state/large_state_tuning">}}#task-local-recovery)**:本地恢复是将 Task 调度到状态尽可能的被重用的机器上的功能。不支持这个功能意味着 Adaptive 调度器需要每次从 Checkpoint 的存储中下载整个 State。
 - **不支持部分故障恢复**: 部分故障恢复意味着调度器可以只重启失败 Job 其中某一部分(在 Flink 的内部结构中被称之为 Region)而不是重启整个 Job。这个限制只会影响那些独立并行(Embarrassingly Parallel)Job的恢复时长,默认的调度器可以重启失败的部分,然而 Adaptive 将需要重启整个 Job。
 - **与 Flink Web UI 的集成受限**: Adaptive 调度器会在 Job 的生命周期中改变它的并行度。Web UI 上只显示 Job 当前的并行度。
-- **Job 的指标受限**: 除了 `numRestarts` 外,`Job` 作用域下所有的 [可用性]({{< ref "docs/ops/metrics" >}}#availability) 和 [Checkpoint]({{< ref "docs/ops/metrics" >}}#checkpointing) 指标都不准确。
 - **空闲 Slot**: 如果 Slot 共享组的最大并行度不相等,提供给 Adaptive 调度器所使用的的 Slot 可能不会被使用。
 - 扩缩容事件会触发 Job 和 Task 重启,Task 重试的次数也会增加。
 
diff --git a/docs/content.zh/docs/ops/metrics.md b/docs/content.zh/docs/ops/metrics.md
index 58d3e93..97e344d 100644
--- a/docs/content.zh/docs/ops/metrics.md
+++ b/docs/content.zh/docs/ops/metrics.md
@@ -1051,10 +1051,6 @@ Metrics related to data exchange between task executors using netty network comm
 
 ### Availability
 
-{{< hint warning >}}
-If [Reactive Mode]({{< ref "docs/deployment/elastic_scaling" >}}#reactive-mode) is enabled then these metrics, except `numRestarts`, do not work correctly.
-{{< /hint >}}
-
 <table class="table table-bordered">
   <thead>
     <tr>
@@ -1103,10 +1099,6 @@ If [Reactive Mode]({{< ref "docs/deployment/elastic_scaling" >}}#reactive-mode)
 {
 ### Checkpointing
 
-{{< hint warning >}}
-If [Reactive Mode]({{< ref "docs/deployment/elastic_scaling" >}}#reactive-mode) is enabled then the checkpointing metrics with the `Job` scope do not work correctly.
-{{< /hint >}}
-
 Note that for failed checkpoints, metrics are updated on a best efforts basis and may be not accurate.
 <table class="table table-bordered">
   <thead>
diff --git a/docs/content/docs/deployment/elastic_scaling.md b/docs/content/docs/deployment/elastic_scaling.md
index 24680ce..9f14af9 100644
--- a/docs/content/docs/deployment/elastic_scaling.md
+++ b/docs/content/docs/deployment/elastic_scaling.md
@@ -148,7 +148,6 @@ The behavior of Adaptive Scheduler is configured by [all configuration options c
 - **No support for [local recovery]({{< ref "docs/ops/state/large_state_tuning">}}#task-local-recovery)**: Local recovery is a feature that schedules tasks to machines so that the state on that machine gets re-used if possible. The lack of this feature means that Adaptive Scheduler will always need to download the entire state from the checkpoint storage.
 - **No support for partial failover**: Partial failover means that the scheduler is able to restart parts ("regions" in Flink's internals) of a failed job, instead of the entire job. This limitation impacts only recovery time of embarrassingly parallel jobs: Flink's default scheduler can restart failed parts, while Adaptive Scheduler will restart the entire job.
 - **Limited integration with Flink's Web UI**: Adaptive Scheduler allows that a job's parallelism can change over its lifetime. The web UI only shows the current parallelism the job.
-- **Limited Job metrics**: With the exception of `numRestarts` all [availability]({{< ref "docs/ops/metrics" >}}#availability) and [checkpointing]({{< ref "docs/ops/metrics" >}}#checkpointing) metrics with the `Job` scope are not working correctly.
 - **Unused slots**: If the max parallelism for slot sharing groups is not equal, slots offered to Adaptive Scheduler might be unused.
 - Scaling events trigger job and task restarts, which will increase the number of Task attempts.
 
diff --git a/docs/content/docs/ops/metrics.md b/docs/content/docs/ops/metrics.md
index 354669d..d52396c 100644
--- a/docs/content/docs/ops/metrics.md
+++ b/docs/content/docs/ops/metrics.md
@@ -1051,10 +1051,6 @@ Metrics related to data exchange between task executors using netty network comm
 
 ### Availability
 
-{{< hint warning >}}
-If [Reactive Mode]({{< ref "docs/deployment/elastic_scaling" >}}#reactive-mode) is enabled then these metrics, except `numRestarts`, do not work correctly.
-{{< /hint >}}
-
 <table class="table table-bordered">
   <thead>
     <tr>
@@ -1102,10 +1098,6 @@ If [Reactive Mode]({{< ref "docs/deployment/elastic_scaling" >}}#reactive-mode)
 
 ### Checkpointing
 
-{{< hint warning >}}
-If [Reactive Mode]({{< ref "docs/deployment/elastic_scaling" >}}#reactive-mode) is enabled then checkpointing metrics with the `Job` scope do not work correctly.
-{{< /hint >}}
-
 Note that for failed checkpoints, metrics are updated on a best efforts basis and may be not accurate.
 <table class="table table-bordered">
   <thead>