You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@kudu.apache.org by al...@apache.org on 2019/04/02 21:17:42 UTC
[kudu] 03/04: [TS heartbeater] avoid reconnecting to master too
often
This is an automated email from the ASF dual-hosted git repository.
alexey pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/kudu.git
commit 0ce9b05d046da01b7b098818fe5e42c1f40e9ac2
Author: Alexey Serbin <al...@apache.org>
AuthorDate: Fri Mar 1 17:19:18 2019 -0800
[TS heartbeater] avoid reconnecting to master too often
With this patch, the heartbeater thread in tservers doesn't reset
its master proxy and reconnect to master (re-negotiating a connection)
every heartbeat under certain conditions. In particular, that happened
if the master was accepting connections and responding to Ping RPC
requests, but was not able to process TS heartbeats properly because
it was still bootstrapping.
E.g., when running RemoteKsckTest.TestClusterWithLocation test scenario
for TSAN builds, I sometimes saw log messages like the following
(the test sets FLAGS_heartbeat_interval_ms = 10):
I0301 20:29:11.932394 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.944639 3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.946904 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.960994 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.964995 3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.972220 3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.974987 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.988946 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.991653 3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.003091 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.017015 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.017540 3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.031175 3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.031175 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.046165 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.059644 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.073026 3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.075335 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.077802 3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.089138 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.101193 3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.102268 3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.104634 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.118392 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.132237 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.147235 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.165709 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.171120 3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.179481 3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.191591 3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
It turned out the counter of the consecutively failed heartbeats kept
increasing because the master was responding with ServiceUnavailable
to incoming TS hearbeats. The prior version of the code did reset
the master proxy every failed heartbeat since
FLAGS_heartbeat_max_failures_before_backoff consecutive errors happened,
and that was the reason behind frequent re-connections to the cluster.
For testing, I just verified that the TS heartbeater no longer behaves
like described above under the same scenarios and conditions.
Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
Reviewed-on: http://gerrit.cloudera.org:8080/12647
Tested-by: Kudu Jenkins
Reviewed-by: Will Berkeley <wd...@gmail.com>
---
src/kudu/tserver/heartbeater.cc | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/src/kudu/tserver/heartbeater.cc b/src/kudu/tserver/heartbeater.cc
index 0b1587f..62856fd 100644
--- a/src/kudu/tserver/heartbeater.cc
+++ b/src/kudu/tserver/heartbeater.cc
@@ -588,10 +588,23 @@ void Heartbeater::Thread::RunThread() {
<< Substitute("Failed to heartbeat to $0 ($1 consecutive failures): $2",
master_address_.ToString(), consecutive_failed_heartbeats_, err_msg);
consecutive_failed_heartbeats_++;
- // If we encountered a network error (e.g., connection
- // refused), try reconnecting.
+
+ // Reset master proxy if too many heartbeats failed in a row. The idea
+ // is to do so when HBs have already backed off from the 'fast HB retry'
+ // behavior. This might be useful in situations when NetworkError isn't
+ // going to be received from the remote side any soon, so resetting
+ // the proxy is a viable alternative to try.
+ //
+ // The 'num_failures_to_reset_proxy' is the number of consecutive errors
+ // to happen before the master proxy is reset again.
+ const auto num_failures_to_reset_proxy =
+ FLAGS_heartbeat_max_failures_before_backoff * 10;
+
+ // If we encountered a network error (e.g., connection refused) or
+ // there were too many consecutive errors while sending heartbeats since
+ // the proxy was reset last time, try reconnecting.
if (s.IsNetworkError() ||
- consecutive_failed_heartbeats_ >= FLAGS_heartbeat_max_failures_before_backoff) {
+ consecutive_failed_heartbeats_ % num_failures_to_reset_proxy == 0) {
proxy_.reset();
}
string msg;