You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Alexey Serbin (Jira)" <ji...@apache.org> on 2019/09/13 20:54:00 UTC

[jira] [Updated] (KUDU-2800) Avoid 'unintended' re-replication of long-bootstrapping tablet replicas

     [ https://issues.apache.org/jira/browse/KUDU-2800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Alexey Serbin updated KUDU-2800:
--------------------------------
    Description: 
As implemented in
https://github.com/apache/kudu/blob/10ea0ce5a636a050a1207f7ab5ecf63d178683f5/src/kudu/consensus/consensus_queue.cc#L576 , the logic for tracking 'health' of tablet replicas cannot differentiate between bootstrapping and failed replicas.

As a result, if a tablet replica is bootstrapping for times longer than the interval specified by {{--follower_unavailable_considered_failed_sec}} run-time flag, the system can start the process of re-replication of the tablet replica elsewhere.

One option might be sending a specific error with {{ConsensusResponsePB}} in response to a Raft message sent by a leader replica, maybe adding extra information on the current progress of the replica bootstrap process.  As soon as such bootstrapping follower replica isn't failing behind leader's WAL GC threshold, the leader replica will not evict it.  But if the bootstrapping follower replica falls behind the WAL GC threshold, leader replica will evict it and the system will start re-replicating it elsewhere.  In cases when the amount of Raft transactions for a tablet is low, this approach would allow for longer bootstrapping times of tablet replicas.  That might be especially beneficial in cases when a tablet server with IO-heavy tablet replicas is being restarted, and there aren't many incoming updates/inserts for tablets hosted by the tablet server.

However, the approach above requires the Raft consensus object for a bootstrapping replica to be at least partially functional, so it entails reading at least some information about a replica from the on-disk consensus metadata prior to proper bootstrapping of a tablet replica by a tablet server.


  was:
As implemented in
https://github.com/apache/kudu/blob/10ea0ce5a636a050a1207f7ab5ecf63d178683f5/src/kudu/consensus/consensus_queue.cc#L576 , the logic for tracking 'health' of tablet replicas cannot differentiate between bootstrapping and failed replicas.

As a result, if a tablet replica is bootstrapping for times longer than the interval specified by {{--follower_unavailable_considered_failed_sec}} run-time flag, the system can start the process of re-replication of the tablet replica elsewhere.

One option might be sending a special {{PeerStatus}} for a bootstrapping replica with a response to a Raft message sent by a leader replica and updating the logic referenced above.  The response might also include additional information on the current progress of the bootstrap process.  Probably, we need add a separate timeout to track a stale bootstrapping replica, so its health would be reported as FAILED after the leader observes the replica being stuck in bootstrapping with no forward progress for a time interval longer than the timeout specified by the new parameter.

However, the approach above requires the Raft consensus object for a bootstrapping replica to be at least partially functional, so it entails reading at least some information about a replica from the on-disk consensus metadata prior to proper bootstrapping of a tablet replica by a tablet server.




> Avoid 'unintended' re-replication of long-bootstrapping tablet replicas
> -----------------------------------------------------------------------
>
>                 Key: KUDU-2800
>                 URL: https://issues.apache.org/jira/browse/KUDU-2800
>             Project: Kudu
>          Issue Type: Improvement
>          Components: consensus, tserver
>    Affects Versions: 1.7.0, 1.8.0, 1.7.1, 1.9.0, 1.9.1, 1.10.0
>            Reporter: Alexey Serbin
>            Assignee: Vladimir Verjovkin
>            Priority: Major
>              Labels: newbie
>
> As implemented in
> https://github.com/apache/kudu/blob/10ea0ce5a636a050a1207f7ab5ecf63d178683f5/src/kudu/consensus/consensus_queue.cc#L576 , the logic for tracking 'health' of tablet replicas cannot differentiate between bootstrapping and failed replicas.
> As a result, if a tablet replica is bootstrapping for times longer than the interval specified by {{--follower_unavailable_considered_failed_sec}} run-time flag, the system can start the process of re-replication of the tablet replica elsewhere.
> One option might be sending a specific error with {{ConsensusResponsePB}} in response to a Raft message sent by a leader replica, maybe adding extra information on the current progress of the replica bootstrap process.  As soon as such bootstrapping follower replica isn't failing behind leader's WAL GC threshold, the leader replica will not evict it.  But if the bootstrapping follower replica falls behind the WAL GC threshold, leader replica will evict it and the system will start re-replicating it elsewhere.  In cases when the amount of Raft transactions for a tablet is low, this approach would allow for longer bootstrapping times of tablet replicas.  That might be especially beneficial in cases when a tablet server with IO-heavy tablet replicas is being restarted, and there aren't many incoming updates/inserts for tablets hosted by the tablet server.
> However, the approach above requires the Raft consensus object for a bootstrapping replica to be at least partially functional, so it entails reading at least some information about a replica from the on-disk consensus metadata prior to proper bootstrapping of a tablet replica by a tablet server.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)