You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Adar Dembo (JIRA)" <ji...@apache.org> on 2017/03/13 19:06:42 UTC
[jira] [Commented] (KUDU-1934) tservers aggressively try to reconnect to masters

    [ https://issues.apache.org/jira/browse/KUDU-1934?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15922750#comment-15922750 ] 

Adar Dembo commented on KUDU-1934:
----------------------------------

The way it works today: a tserver will retry a failed heartbeat request in a tight loop up to heartbeat_max_failures_before_backoff times (defaults to 3). If all those attempts fail, it'll back off and retry every heartbeat_interval_ms (defaults to 1000).

Are you saying that this bimodal backoff (tight for a fixed number of attempts, then periodic forever) should be replaced with something smoother, like a linear or exponential backoff? If so, why?

I'm asking because there's an argument to be made for a tserver to reestablish connectivity with the masters fairly quickly, so that any work destined for that tserver (such as deleting redundant replicas, or adding new replicas) can be handled promptly.


> tservers aggressively try to reconnect to masters
> -------------------------------------------------
>
>                 Key: KUDU-1934
>                 URL: https://issues.apache.org/jira/browse/KUDU-1934
>             Project: Kudu
>          Issue Type: Bug
>          Components: tserver
>    Affects Versions: 1.3.0
>            Reporter: Jean-Daniel Cryans
>              Labels: newbie
>
> Related to KUDU-1933, I had mismatched 1.3 snapshots between the master and the tservers which caused them to try to reconnect to the master infinitely. Since they do it as fast as they can, the logs were quickly full of:
> {noformat}
> I0307 23:55:21.228502 70832 heartbeater.cc:291] Connected to a master server at ve0120.halxg.cloudera.com:7051
> I0307 23:55:21.228528 70832 heartbeater.cc:359] Registering TS with master...
> I0307 23:55:21.228865 70832 heartbeater.cc:389] Master ve0120.halxg.cloudera.com:7051 requested a full tablet report, sending...
> W0307 23:55:21.346961 70832 heartbeater.cc:499] Failed to heartbeat to ve0120.halxg.cloudera.com:7051: Remote error: Failed to send heartbeat to master: Not authorized: invalid CSR: CSR did not contain expected username. (CSR: '' RPC: 'kudu')
> I0307 23:55:22.347733 70832 heartbeater.cc:291] Connected to a master server at ve0120.halxg.cloudera.com:7051
> I0307 23:55:22.347757 70832 heartbeater.cc:359] Registering TS with master...
> I0307 23:55:22.348042 70832 heartbeater.cc:389] Master ve0120.halxg.cloudera.com:7051 requested a full tablet report, sending...
> W0307 23:55:22.467021 70832 heartbeater.cc:499] Failed to heartbeat to ve0120.halxg.cloudera.com:7051: Remote error: Failed to send heartbeat to master: Not authorized: invalid CSR: CSR did not contain expected username. (CSR: '' RPC: 'kudu')
> {noformat}
> Sounds like we should do backoff retries.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)