You are viewing a plain text version of this content. The canonical link for it is here.

Posted to reviews@kudu.apache.org by "Alexey Serbin (Code Review)" <ge...@cloudera.org> on 2019/03/02 02:21:44 UTC

[kudu-CR] [TS heartbeater] avoid reconnecting to master too often

Alexey Serbin has uploaded this change for review. ( http://gerrit.cloudera.org:8080/12647


Change subject: [TS heartbeater] avoid reconnecting to master too often
......................................................................

[TS heartbeater] avoid reconnecting to master too often

With this patch, the heartbeater thread in tservers don't reset
its master proxy and reconnect to master (re-negotiating a connection)
every heartbeat if the master is accepting connections and Ping RPC
requests but isn't able to properly respond to TS heartbeats.

E.g., when running RemoteKsckTest.TestClusterWithLocation test scenario
for TSAN builds, I sometimes saw log messages like the following
(the test sets FLAGS_heartbeat_interval_ms = 10):

I0301 20:29:11.932394  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.944639  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.946904  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.960994  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.964995  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.972220  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.974987  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.988946  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.991653  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.003091  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.017015  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.017540  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.031175  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.031175  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.046165  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.059644  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.073026  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.075335  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.077802  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.089138  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.101193  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.102268  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.104634  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.118392  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.132237  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.147235  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.165709  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.171120  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.179481  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.191591  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221

It turned out the counter of the consecutively failed heartbeats kept
increasing, while the master was responding with ServiceUnavailable
to incoming TS hearbeats.

Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
---
M src/kudu/tserver/heartbeater.cc
1 file changed, 18 insertions(+), 4 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/47/12647/1
-- 
To view, visit http://gerrit.cloudera.org:8080/12647
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
Gerrit-Change-Number: 12647
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>

[kudu-CR] [TS heartbeater] avoid reconnecting to master too often

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.

Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/12647 )

Change subject: [TS heartbeater] avoid reconnecting to master too often
......................................................................


Patch Set 1:

(6 comments)

http://gerrit.cloudera.org:8080/#/c/12647/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12647/1//COMMIT_MSG@9
PS1, Line 9: don't
> "won't" or "doesn't"
Done


http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc
File src/kudu/tserver/heartbeater.cc:

http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc@592
PS1, Line 592: Pretty arbitrary number to determine
> Can you reword this? "Pretty arbitrary" makes this sound like not a whole l
Done


http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc@597
PS1, Line 597: (heartbeat_rpc_timeout / heartbeat_interval)
             :       //     time interval
> This isn't a timeout-- it's a unitless quantity, the number of heartbeats e
yep, that should have been heartbeat_rpc_timeout; sorry for the mess.


http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc@600
PS1, Line 600: num_consecutive_failures_proxy_reset
> Would be nice to describe what this variable is. Something like "the period
Done


http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc@607
PS1, Line 607: s.IsNetworkError() 
> How did you test this? I thought I saw cases where this actually was a netw
I'm not sure what you want me to test, but I added the corresponding comment into the commit message.

In my scenario I clearly saw it was not a network error, but ServiceUnavailable each heartbeat while master was bootstrapping.  However, the proxy was reset each hearbeat after FLAGS_heartbeat_max_failures_before_backoff ServiceUnavailable errors in a row.  And that's exactly what I'm addressing here -- we don't want to reset the proxy and incur connection negotiation latency plus 'Ping' RPC after each heartbeat in such a case.


http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc@607
PS1, Line 607: consecutive_failed_heartbeats_ %
             :             num_consecutive_failures_proxy_reset == 0
> Could you split the line so each condition goes on a separate line?
Done



-- 
To view, visit http://gerrit.cloudera.org:8080/12647
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
Gerrit-Change-Number: 12647
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Thu, 07 Mar 2019 22:29:59 +0000
Gerrit-HasComments: Yes

[kudu-CR] [TS heartbeater] avoid reconnecting to master too often

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.

Alexey Serbin has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/12647 )

Change subject: [TS heartbeater] avoid reconnecting to master too often
......................................................................

[TS heartbeater] avoid reconnecting to master too often

With this patch, the heartbeater thread in tservers doesn't reset
its master proxy and reconnect to master (re-negotiating a connection)
every heartbeat under certain conditions.  In particular, that happened
if the master was accepting connections and responding to Ping RPC
requests, but was not able to process TS heartbeats properly because
it was still bootstrapping.

E.g., when running RemoteKsckTest.TestClusterWithLocation test scenario
for TSAN builds, I sometimes saw log messages like the following
(the test sets FLAGS_heartbeat_interval_ms = 10):

I0301 20:29:11.932394  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.944639  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.946904  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.960994  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.964995  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.972220  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.974987  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.988946  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.991653  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.003091  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.017015  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.017540  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.031175  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.031175  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.046165  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.059644  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.073026  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.075335  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.077802  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.089138  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.101193  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.102268  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.104634  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.118392  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.132237  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.147235  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.165709  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.171120  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.179481  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.191591  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221

It turned out the counter of the consecutively failed heartbeats kept
increasing because the master was responding with ServiceUnavailable
to incoming TS hearbeats.  The prior version of the code did reset
the master proxy every failed heartbeat since
FLAGS_heartbeat_max_failures_before_backoff consecutive errors happened,
and that was the reason behind frequent re-connections to the cluster.

For testing, I just verified that the TS heartbeater no longer behaves
like described above under the same scenarios and conditions.

Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
Reviewed-on: http://gerrit.cloudera.org:8080/12647
Tested-by: Kudu Jenkins
Reviewed-by: Will Berkeley <wd...@gmail.com>
---
M src/kudu/tserver/heartbeater.cc
1 file changed, 16 insertions(+), 3 deletions(-)

Approvals:
  Kudu Jenkins: Verified
  Will Berkeley: Looks good to me, approved

-- 
To view, visit http://gerrit.cloudera.org:8080/12647
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
Gerrit-Change-Number: 12647
Gerrit-PatchSet: 3
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>

[kudu-CR] [TS heartbeater] avoid reconnecting to master too often

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.

Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/12647 )

Change subject: [TS heartbeater] avoid reconnecting to master too often
......................................................................


Patch Set 1:

(5 comments)

http://gerrit.cloudera.org:8080/#/c/12647/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12647/1//COMMIT_MSG@9
PS1, Line 9: don't
"won't" or "doesn't"


http://gerrit.cloudera.org:8080/#/c/12647/1//COMMIT_MSG@11
PS1, Line 11: is accepting connections and Ping RPC
            : requests but isn't able to properly respond to TS heartbeats
Looking at the code, why does this situation return a NetworkError? Shouldn't it be a ServiceUnavailable error?


http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc
File src/kudu/tserver/heartbeater.cc:

http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc@595
PS1, Line 595: * At least once per (heartbeat_rpc_timeout ^ 2 / heartbeat_interval)
             :       //     time interval; the worst case is when every HB request times out.
Where does this factor in? What's the idea behind this formula? I don't get it.


http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc@597
PS1, Line 597: (heartbeat_rpc_timeout / heartbeat_interval)
             :       //     time interval
This isn't a timeout-- it's a unitless quantity, the number of heartbeats expected per timeout interval. What you are saying in the code is that at least the expected number of heartbeats that would happen in one timeout interval should fail consecutively before resetting the proxy.


http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc@607
PS1, Line 607: consecutive_failed_heartbeats_ %
             :             num_consecutive_failures_proxy_reset == 0
Could you split the line so each condition goes on a separate line?



-- 
To view, visit http://gerrit.cloudera.org:8080/12647
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
Gerrit-Change-Number: 12647
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 05 Mar 2019 17:46:43 +0000
Gerrit-HasComments: Yes

[kudu-CR] [TS heartbeater] avoid reconnecting to master too often

Posted by "Andrew Wong (Code Review)" <ge...@cloudera.org>.

Andrew Wong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12647 )

Change subject: [TS heartbeater] avoid reconnecting to master too often
......................................................................


Patch Set 1:

(3 comments)

http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc
File src/kudu/tserver/heartbeater.cc:

http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc@592
PS1, Line 592: Pretty arbitrary number to determine
Can you reword this? "Pretty arbitrary" makes this sound like not a whole lot of thought was put into this.


http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc@600
PS1, Line 600: num_consecutive_failures_proxy_reset
Would be nice to describe what this variable is. Something like "the period, in number of failures, that dictates the frequency at which we will reset the proxy" or something?


http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc@607
PS1, Line 607: s.IsNetworkError() 
How did you test this? I thought I saw cases where this actually was a network error.



-- 
To view, visit http://gerrit.cloudera.org:8080/12647
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
Gerrit-Change-Number: 12647
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 05 Mar 2019 18:35:07 +0000
Gerrit-HasComments: Yes

[kudu-CR] [TS heartbeater] avoid reconnecting to master too often

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.

Alexey Serbin has posted comments on this change. ( http://gerrit.cloudera.org:8080/12647 )

Change subject: [TS heartbeater] avoid reconnecting to master too often
......................................................................


Patch Set 1:

(2 comments)

http://gerrit.cloudera.org:8080/#/c/12647/1//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12647/1//COMMIT_MSG@11
PS1, Line 11: is accepting connections and Ping RPC
            : requests but isn't able to properly respond to TS heartbeats
> Looking at the code, why does this situation return a NetworkError? Shouldn
This situation didn't return a NetworkError.  It returned ServiceUnavailable error.


http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc
File src/kudu/tserver/heartbeater.cc:

http://gerrit.cloudera.org:8080/#/c/12647/1/src/kudu/tserver/heartbeater.cc@595
PS1, Line 595: * At least once per (heartbeat_rpc_timeout ^ 2 / heartbeat_interval)
             :       //     time interval; the worst case is when every HB request times out.
> Where does this factor in? What's the idea behind this formula? I don't get
The most appropriate thing I thought was getting rid of that condition at all and leaving the Network error only.  However, I'm a bit concerned that in some bad cases it will be a timeout (not manifested in NetworkError as is) working with a proxy due to a network problem, and re-creating a proxy might help to overcome this situation.

Basically, I would opt for having something like 'consecutive errors for more than X seconds' as a second criterion to reset the master proxy.



-- 
To view, visit http://gerrit.cloudera.org:8080/12647
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
Gerrit-Change-Number: 12647
Gerrit-PatchSet: 1
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 05 Mar 2019 21:32:34 +0000
Gerrit-HasComments: Yes

[kudu-CR] [TS heartbeater] avoid reconnecting to master too often

Posted by "Will Berkeley (Code Review)" <ge...@cloudera.org>.

Will Berkeley has posted comments on this change. ( http://gerrit.cloudera.org:8080/12647 )

Change subject: [TS heartbeater] avoid reconnecting to master too often
......................................................................


Patch Set 2: Code-Review+2


-- 
To view, visit http://gerrit.cloudera.org:8080/12647
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
Gerrit-Change-Number: 12647
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>
Gerrit-Comment-Date: Tue, 02 Apr 2019 16:56:46 +0000
Gerrit-HasComments: No

[kudu-CR] [TS heartbeater] avoid reconnecting to master too often

Posted by "Alexey Serbin (Code Review)" <ge...@cloudera.org>.

Hello Will Berkeley, Kudu Jenkins, Andrew Wong, 

I'd like you to reexamine a change. Please visit

    http://gerrit.cloudera.org:8080/12647

to look at the new patch set (#2).

Change subject: [TS heartbeater] avoid reconnecting to master too often
......................................................................

[TS heartbeater] avoid reconnecting to master too often

With this patch, the heartbeater thread in tservers doesn't reset
its master proxy and reconnect to master (re-negotiating a connection)
every heartbeat under certain conditions.  In particular, that happened
if the master was accepting connections and responding to Ping RPC
requests, but was not able to process TS heartbeats properly because
it was still bootstrapping.

E.g., when running RemoteKsckTest.TestClusterWithLocation test scenario
for TSAN builds, I sometimes saw log messages like the following
(the test sets FLAGS_heartbeat_interval_ms = 10):

I0301 20:29:11.932394  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.944639  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.946904  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.960994  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.964995  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.972220  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.974987  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.988946  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:11.991653  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.003091  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.017015  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.017540  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.031175  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.031175  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.046165  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.059644  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.073026  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.075335  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.077802  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.089138  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.101193  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.102268  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.104634  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.118392  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.132237  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.147235  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.165709  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.171120  3819 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.179481  3746 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221
I0301 20:29:12.191591  3671 heartbeater.cc:345] Connected to a master server at 127.3.75.254:36221

It turned out the counter of the consecutively failed heartbeats kept
increasing because the master was responding with ServiceUnavailable
to incoming TS hearbeats.  The prior version of the code did reset
the master proxy every failed heartbeat since
FLAGS_heartbeat_max_failures_before_backoff consecutive errors happened,
and that was the reason behind frequent re-connections to the cluster.

For testing, I just verified that the TS heartbeater no longer behaves
like described above under the same scenarios and conditions.

Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
---
M src/kudu/tserver/heartbeater.cc
1 file changed, 16 insertions(+), 3 deletions(-)


  git pull ssh://gerrit.cloudera.org:29418/kudu refs/changes/47/12647/2
-- 
To view, visit http://gerrit.cloudera.org:8080/12647
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: kudu
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I961ae453ffd6ce343574ce58cb0e13fdad218078
Gerrit-Change-Number: 12647
Gerrit-PatchSet: 2
Gerrit-Owner: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Alexey Serbin <as...@cloudera.com>
Gerrit-Reviewer: Andrew Wong <aw...@cloudera.com>
Gerrit-Reviewer: Kudu Jenkins (120)
Gerrit-Reviewer: Will Berkeley <wd...@gmail.com>