You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kudu.apache.org by "Alexey Serbin (JIRA)" <ji...@apache.org> on 2018/03/27 00:54:00 UTC

[jira] [Comment Edited] (KUDU-2354) In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly retries operations to add a replacement replica even if replacement is no longer needed

    [ https://issues.apache.org/jira/browse/KUDU-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414832#comment-16414832 ] 

Alexey Serbin edited comment on KUDU-2354 at 3/27/18 12:53 AM:
---------------------------------------------------------------

And another issue to look at: do follower masters continue retrying those tasks once then switched from the leader to the follower role?


was (Author: aserbin):
And another issue to look at: do follower masters continue to retry those tasks once then switched from the leader to the follower role?

> In case of 3-4-3 scheme and 3 tablet servers, catalog manager endlessly retries operations to add a replacement replica even if replacement is no longer needed
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: KUDU-2354
>                 URL: https://issues.apache.org/jira/browse/KUDU-2354
>             Project: Kudu
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 1.7.0
>         Environment: 3 tservers in the cluster, single master (?)
>            Reporter: Alexey Serbin
>            Priority: Major
>
> In a scenario reported by [~adar], 100 iterations of the following command were run:
> {noformat}
> kudu perf loadgen --keep-auto-table --table-num-buckets=40 --num-rows-per-thread=1 --table-num-replicas=3
> {noformat}
> That took about 10-15 minutes to complete, and for some reason ksck reported UNAVAILABLE tablets for 5-10 minutes after that.  Most likely, due to the spike of IO activity, tablet leaders didn't receive heartbeats from some replicas and tried to replace those.  After some time, the cluster has stabilized (no problems reported by ksck), but in the master's log the following messages continued to appear:
> {noformat}
> I0315 13:52:00.871310 106157 catalog_manager.cc:3234] Sending ChangeConfig:ADD_PEER:NON_VOTER on tablet 2776eb10c241426e90ddf7354260ee04 (attempt 22)
> I0315 13:52:00.871354 106157 catalog_manager.cc:2700] Scheduling retry of ChangeConfig:ADD_PEER:NON_VOTER RPC for tablet 2776eb10c241426e90ddf7354260ee04 with cas_config_opid_index -1 with a delay of 60018 ms (attempt = 22)
> {noformat}
> Of course, in case of just 3 tservers in the cluster not a single attempt to add a replacement non-voter replica would succeed, but it would make sense to stop retrying those operations when a tablet's OpId index is far ahead of the cas_config_opid_index of the operation being retried.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)