You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@doris.apache.org by mo...@apache.org on 2022/10/22 13:33:51 UTC

[doris] branch branch-1.2-unstable updated: [improvement](heartbeat) Add some relaxation strategies to reduce the failure probability of regression testing (#13568)

This is an automated email from the ASF dual-hosted git repository.

morningman pushed a commit to branch branch-1.2-unstable
in repository https://gitbox.apache.org/repos/asf/doris.git


The following commit(s) were added to refs/heads/branch-1.2-unstable by this push:
     new e8858a09c3 [improvement](heartbeat) Add some relaxation strategies to reduce the failure probability of regression testing (#13568)
e8858a09c3 is described below

commit e8858a09c32d7c26bde8d86fdddd13a2b1fa5e37
Author: Mingyu Chen <mo...@gmail.com>
AuthorDate: Sat Oct 22 17:53:07 2022 +0800

    [improvement](heartbeat) Add some relaxation strategies to reduce the failure probability of regression testing (#13568)
    
    The regression test may failed because of heartbeat failure occasionally.
    So I add 2 new FE config to relax this limit
    
    1. `disable_backend_black_list`
        Set to true to not put Backend to black list even if we failed to send task to it. Default is false.
    2. `max_backend_heartbeat_failure_tolerance_count`
       Only if the failure time of heartbeat exceed this config, we can set Backend as dead. Default is 1.
---
 docs/en/docs/admin-manual/config/fe-config.md      | 32 ++++++++++++++++++++
 docs/zh-CN/docs/admin-manual/config/fe-config.md   | 34 +++++++++++++++++++++-
 .../main/java/org/apache/doris/common/Config.java  | 23 +++++++++++++++
 .../apache/doris/common/proc/BackendsProcDir.java  |  3 ++
 .../main/java/org/apache/doris/qe/Coordinator.java |  2 +-
 .../java/org/apache/doris/qe/SimpleScheduler.java  |  5 +++-
 .../main/java/org/apache/doris/system/Backend.java | 26 +++++++++++++++--
 .../doris/utframe/DemoMultiBackendsTest.java       |  5 ++--
 8 files changed, 122 insertions(+), 8 deletions(-)

diff --git a/docs/en/docs/admin-manual/config/fe-config.md b/docs/en/docs/admin-manual/config/fe-config.md
index bcb0713bf7..a4de6ececb 100644
--- a/docs/en/docs/admin-manual/config/fe-config.md
+++ b/docs/en/docs/admin-manual/config/fe-config.md
@@ -2232,3 +2232,35 @@ The latest data version currently supported, cannot be modified, and should be c
 ### `min_be_exec_version`
 
 The oldest data version currently supported, which cannot be modified, should be consistent with the `HeartbeatServer::min_be_exec_version` in the BE of the matching version.
+
+### `max_query_profile_num`
+
+The max number of query profile.
+
+Default: 100
+
+Is it possible to dynamically configure: true
+
+Is it a configuration item unique to the Master FE node: false
+
+### `disable_backend_black_list`
+
+Used to disable the BE blacklist function. After this function is disabled, if the query request to the BE fails, the BE will not be added to the blacklist.
+This parameter is suitable for regression testing environments to reduce occasional bugs that cause a large number of regression tests to fail.
+
+Default: false
+
+Is it possible to configure dynamically: true
+
+Is it a configuration item unique to the Master FE node: false
+
+### `max_backend_heartbeat_failure_tolerance_count`
+
+The maximum tolerable number of BE node heartbeat failures. If the number of consecutive heartbeat failures exceeds this value, the BE state will be set to dead.
+This parameter is suitable for regression test environments to reduce occasional heartbeat failures that cause a large number of regression test failures.
+
+Default: 1
+
+Is it possible to configure dynamically: true
+
+Whether it is a configuration item unique to the Master FE node: true
diff --git a/docs/zh-CN/docs/admin-manual/config/fe-config.md b/docs/zh-CN/docs/admin-manual/config/fe-config.md
index a221bcea51..6608b58845 100644
--- a/docs/zh-CN/docs/admin-manual/config/fe-config.md
+++ b/docs/zh-CN/docs/admin-manual/config/fe-config.md
@@ -2286,4 +2286,36 @@ load 标签清理器将每隔 `label_clean_interval_second` 运行一次以清
 
 ### `min_be_exec_version`
 
-目前支持的最旧数据版本,不可修改,应与配套版本的BE中的`HeartbeatServer::min_data_version`一致。
\ No newline at end of file
+目前支持的最旧数据版本,不可修改,应与配套版本的BE中的`HeartbeatServer::min_data_version`一致。
+
+### `max_query_profile_num`
+
+用于设置保存查询的 profile 的最大个数。
+
+默认值:100
+
+是否可以动态配置:true
+
+是否为 Master FE 节点独有的配置项:false
+
+### `disable_backend_black_list`
+
+用于禁止BE黑名单功能。禁止该功能后,如果向BE发送查询请求失败,也不会将这个BE添加到黑名单。
+该参数适用于回归测试环境,以减少偶发的错误导致大量回归测试失败。
+
+默认值:false
+
+是否可以动态配置:true
+
+是否为 Master FE 节点独有的配置项:false
+
+### `max_backend_heartbeat_failure_tolerance_count`
+
+最大可容忍的BE节点心跳失败次数。如果连续心跳失败次数超过这个值,则会将BE状态置为 dead。
+该参数适用于回归测试环境,以减少偶发的心跳失败导致大量回归测试失败。
+
+默认值:1
+
+是否可以动态配置:true
+
+是否为 Master FE 节点独有的配置项:true
diff --git a/fe/fe-core/src/main/java/org/apache/doris/common/Config.java b/fe/fe-core/src/main/java/org/apache/doris/common/Config.java
index 607db27e82..14c9e54cf7 100644
--- a/fe/fe-core/src/main/java/org/apache/doris/common/Config.java
+++ b/fe/fe-core/src/main/java/org/apache/doris/common/Config.java
@@ -1788,4 +1788,27 @@ public class Config extends ConfigBase {
     @ConfField(mutable = false)
     public static int statistic_task_scheduler_execution_interval_ms = 60 * 1000;
 
+    /**
+     * Max query profile num.
+     */
+    @ConfField(mutable = true, masterOnly = false)
+    public static int max_query_profile_num = 100;
+
+    /**
+     * Set to true to disable backend black list, so that even if we failed to send task to a backend,
+     * that backend won't be added to black list.
+     * This should only be set when running tests, such as regression test.
+     * Highly recommended NOT disable it in product environment.
+     */
+    @ConfField(mutable = true, masterOnly = false)
+    public static boolean disable_backend_black_list = false;
+
+    /**
+     * Maximum backend heartbeat failure tolerance count.
+     * Default is 1, which means if 1 heart failed, the backend will be marked as dead.
+     * A larger value can improve the tolerance of the cluster to occasional heartbeat failures.
+     * For example, when running regression tests, this value can be increased.
+     */
+    @ConfField(mutable = true, masterOnly = true)
+    public static long max_backend_heartbeat_failure_tolerance_count = 1;
 }
diff --git a/fe/fe-core/src/main/java/org/apache/doris/common/proc/BackendsProcDir.java b/fe/fe-core/src/main/java/org/apache/doris/common/proc/BackendsProcDir.java
index aa3c9821b0..096cf57c0b 100644
--- a/fe/fe-core/src/main/java/org/apache/doris/common/proc/BackendsProcDir.java
+++ b/fe/fe-core/src/main/java/org/apache/doris/common/proc/BackendsProcDir.java
@@ -52,6 +52,7 @@ public class BackendsProcDir implements ProcDirInterface {
             .add("SystemDecommissioned").add("ClusterDecommissioned").add("TabletNum")
             .add("DataUsedCapacity").add("AvailCapacity").add("TotalCapacity").add("UsedPct")
             .add("MaxDiskUsedPct").add("RemoteUsedCapacity").add("Tag").add("ErrMsg").add("Version").add("Status")
+            .add("HeartbeatFailureCounter")
             .build();
 
     public static final int HOSTNAME_INDEX = 3;
@@ -178,6 +179,8 @@ public class BackendsProcDir implements ProcDirInterface {
             backendInfo.add(backend.getVersion());
             // status
             backendInfo.add(new Gson().toJson(backend.getBackendStatus()));
+            // heartbeat failure counter
+            backendInfo.add(backend.getHeartbeatFailureCounter());
 
             comparableBackendInfos.add(backendInfo);
         }
diff --git a/fe/fe-core/src/main/java/org/apache/doris/qe/Coordinator.java b/fe/fe-core/src/main/java/org/apache/doris/qe/Coordinator.java
index cd3296c354..13809d5633 100644
--- a/fe/fe-core/src/main/java/org/apache/doris/qe/Coordinator.java
+++ b/fe/fe-core/src/main/java/org/apache/doris/qe/Coordinator.java
@@ -2150,7 +2150,7 @@ public class Coordinator {
         }
 
         public boolean isBackendStateHealthy() {
-            if (backend.getLastMissingHeartbeatTime() > lastMissingHeartbeatTime) {
+            if (backend.getLastMissingHeartbeatTime() > lastMissingHeartbeatTime && !backend.isAlive()) {
                 LOG.warn("backend {} is down while joining the coordinator. job id: {}",
                         backend.getId(), jobId);
                 return false;
diff --git a/fe/fe-core/src/main/java/org/apache/doris/qe/SimpleScheduler.java b/fe/fe-core/src/main/java/org/apache/doris/qe/SimpleScheduler.java
index 5cb66e7a8b..ae3671efc9 100644
--- a/fe/fe-core/src/main/java/org/apache/doris/qe/SimpleScheduler.java
+++ b/fe/fe-core/src/main/java/org/apache/doris/qe/SimpleScheduler.java
@@ -18,6 +18,7 @@
 package org.apache.doris.qe;
 
 import org.apache.doris.catalog.Env;
+import org.apache.doris.common.Config;
 import org.apache.doris.common.FeConstants;
 import org.apache.doris.common.Pair;
 import org.apache.doris.common.Reference;
@@ -166,7 +167,9 @@ public class SimpleScheduler {
     }
 
     public static void addToBlacklist(Long backendID, String reason) {
-        if (backendID == null) {
+        if (backendID == null || Config.disable_backend_black_list) {
+            LOG.warn("ignore backend black list for backend: {}, disabled: {}", backendID,
+                    Config.disable_backend_black_list);
             return;
         }
 
diff --git a/fe/fe-core/src/main/java/org/apache/doris/system/Backend.java b/fe/fe-core/src/main/java/org/apache/doris/system/Backend.java
index fc5758374c..f705f2a2b1 100644
--- a/fe/fe-core/src/main/java/org/apache/doris/system/Backend.java
+++ b/fe/fe-core/src/main/java/org/apache/doris/system/Backend.java
@@ -21,6 +21,7 @@ import org.apache.doris.alter.DecommissionType;
 import org.apache.doris.catalog.DiskInfo;
 import org.apache.doris.catalog.DiskInfo.DiskState;
 import org.apache.doris.catalog.Env;
+import org.apache.doris.common.Config;
 import org.apache.doris.common.FeConstants;
 import org.apache.doris.common.io.Text;
 import org.apache.doris.common.io.Writable;
@@ -128,6 +129,14 @@ public class Backend implements Writable {
     @SerializedName("tagMap")
     private Map<String, String> tagMap = Maps.newHashMap();
 
+    // Counter of heartbeat failure.
+    // Once a heartbeat failed, increase this counter by one.
+    // And if it reaches Config.max_backend_heartbeat_failure_tolerance_count, this backend
+    // will be marked as dead.
+    // And once it back to alive, reset this counter.
+    // No need to persist, because only master FE handle heartbeat.
+    private int heartbeatFailureCounter = 0;
+
     public Backend() {
         this.host = "";
         this.version = "";
@@ -333,6 +342,10 @@ public class Backend implements Writable {
         return backendStatus;
     }
 
+    public int getHeartbeatFailureCounter() {
+        return heartbeatFailureCounter;
+    }
+
     /**
      * backend belong to some cluster
      *
@@ -690,12 +703,19 @@ public class Backend implements Writable {
             }
 
             heartbeatErrMsg = "";
+            this.heartbeatFailureCounter = 0;
         } else {
-            if (isAlive.compareAndSet(true, false)) {
-                isChanged = true;
-                LOG.warn("{} is dead,", this.toString());
+            // Only set backend to dead if the heartbeat failure counter exceed threshold.
+            if (++this.heartbeatFailureCounter >= Config.max_backend_heartbeat_failure_tolerance_count) {
+                if (isAlive.compareAndSet(true, false)) {
+                    isChanged = true;
+                    LOG.warn("{} is dead,", this.toString());
+                }
             }
 
+            // still set error msg and missing time even if we may not mark this backend as dead,
+            // for debug easily.
+            // But notice that if isChanged = false, these msg will not sync to other FE.
             heartbeatErrMsg = hbResponse.getMsg() == null ? "Unknown error" : hbResponse.getMsg();
             lastMissingHeartbeatTime = System.currentTimeMillis();
         }
diff --git a/fe/fe-core/src/test/java/org/apache/doris/utframe/DemoMultiBackendsTest.java b/fe/fe-core/src/test/java/org/apache/doris/utframe/DemoMultiBackendsTest.java
index 7f619f8267..860abe35e2 100644
--- a/fe/fe-core/src/test/java/org/apache/doris/utframe/DemoMultiBackendsTest.java
+++ b/fe/fe-core/src/test/java/org/apache/doris/utframe/DemoMultiBackendsTest.java
@@ -199,8 +199,9 @@ public class DemoMultiBackendsTest {
         ProcResult result = dir.fetchResult();
         Assert.assertEquals(BackendsProcDir.TITLE_NAMES.size(), result.getColumnNames().size());
         Assert.assertEquals("{\"location\" : \"default\"}", result.getRows().get(0).get(20));
-        Assert.assertEquals("{\"lastSuccessReportTabletsTime\":\"N/A\",\"lastStreamLoadTime\":-1,\"isQueryDisabled\":false,\"isLoadDisabled\":false}",
-                result.getRows().get(0).get(BackendsProcDir.TITLE_NAMES.size() - 1));
+        Assert.assertEquals(
+                "{\"lastSuccessReportTabletsTime\":\"N/A\",\"lastStreamLoadTime\":-1,\"isQueryDisabled\":false,\"isLoadDisabled\":false}",
+                result.getRows().get(0).get(BackendsProcDir.TITLE_NAMES.size() - 2));
     }
 
     private static void updateReplicaPathHash() {


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscribe@doris.apache.org
For additional commands, e-mail: commits-help@doris.apache.org