You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by al...@apache.org on 2019/04/02 21:17:42 UTC

[kudu] 03/04: [TS heartbeater] avoid reconnecting to master too often

This is an automated email from the ASF dual-hosted git repository.

alexey pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git

commit 0ce9b05d046da01b7b098818fe5e42c1f40e9ac2
Author: Alexey Serbin <al...@apache.org>
AuthorDate: Fri Mar 1 17:19:18 2019 -0800

    [TS heartbeater] avoid reconnecting to master too often
    
    With this patch, the heartbeater thread in tservers doesn't reset
    its master proxy and reconnect to master (re-negotiating a connection)
    every heartbeat under certain conditions.  In particular, that happened
    if the master was accepting connections and responding to Ping RPC
    requests, but was not able to process TS heartbeats properly because
    it was still bootstrapping.
    
    E.g., when running RemoteKsckTest.TestClusterWithLocation test scenario
    for TSAN builds, I sometimes saw log messages like the following
    (the test sets FLAGS_heartbeat_interval_ms = 10):
    
    I0301 20:29:11.932394  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:11.944639  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:11.946904  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:11.960994  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:11.964995  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:11.972220  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:11.974987  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:11.988946  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:11.991653  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.003091  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.017015  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.017540  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.031175  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.031175  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.046165  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.059644  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.073026  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.075335  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.077802  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.089138  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.101193  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.102268  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.104634  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.118392  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.132237  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.147235  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.165709  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.171120  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.179481  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    I0301 20:29:12.191591  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
    
    It turned out the counter of the consecutively failed heartbeats kept
    increasing because the master was responding with ServiceUnavailable
    to incoming TS hearbeats.  The prior version of the code did reset
    the master proxy every failed heartbeat since
    FLAGS_heartbeat_max_failures_before_backoff consecutive errors happened,
    and that was the reason behind frequent re-connections to the cluster.
    
    For testing, I just verified that the TS heartbeater no longer behaves
    like described above under the same scenarios and conditions.
    
    Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
    Reviewed-on: http://gerrit.cloudera.org:8080/12647
    Tested-by: Kudu Jenkins
    Reviewed-by: Will Berkeley <wd...@gmail.com>
---
 src/kudu/tserver/heartbeater.cc | 19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/src/kudu/tserver/heartbeater.cc b/src/kudu/tserver/heartbeater.cc
index 0b1587f..62856fd 100644
--- a/src/kudu/tserver/heartbeater.cc
+++ b/src/kudu/tserver/heartbeater.cc
@@ -588,10 +588,23 @@ void Heartbeater::Thread::RunThread() {
           << Substitute("Failed to heartbeat to $0 ($1 consecutive failures): $2",
                         master_address_.ToString(), consecutive_failed_heartbeats_, err_msg);
       consecutive_failed_heartbeats_++;
-      // If we encountered a network error (e.g., connection
-      // refused), try reconnecting.
+
+      // Reset master proxy if too many heartbeats failed in a row. The idea
+      // is to do so when HBs have already backed off from the 'fast HB retry'
+      // behavior. This might be useful in situations when NetworkError isn't
+      // going to be received from the remote side any soon, so resetting
+      // the proxy is a viable alternative to try.
+      //
+      // The 'num_failures_to_reset_proxy' is the number of consecutive errors
+      // to happen before the master proxy is reset again.
+      const auto num_failures_to_reset_proxy =
+          FLAGS_heartbeat_max_failures_before_backoff * 10;
+
+      // If we encountered a network error (e.g., connection refused) or
+      // there were too many consecutive errors while sending heartbeats since
+      // the proxy was reset last time, try reconnecting.
       if (s.IsNetworkError() ||
-          consecutive_failed_heartbeats_ >= FLAGS_heartbeat_max_failures_before_backoff) {
+          consecutive_failed_heartbeats_ % num_failures_to_reset_proxy == 0) {
         proxy_.reset();
       }
       string msg;