You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Todd Lipcon (JIRA)" <ji...@apache.org> on 2016/05/11 19:53:12 UTC
[jira] [Commented] (KUDU-1408) Adding a replica may never succeed if copying tablet takes longer than the log retention time

    [ https://issues.apache.org/jira/browse/KUDU-1408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15280703#comment-15280703 ] 

Todd Lipcon commented on KUDU-1408:
-----------------------------------

Trying to think through a couple of ideas to solve this issue... a few thoughts follow:

*Maybe we should change consensus to anchor the log as far back as its farthest-behind follower?*
In the case that there is a follower that has been added but not yet successfully replicated, we would consider this "infinitely in the past", and essentially disable WAL GC.

Upsides:
- definitely would solve this problem

Downsides:
- we currently see it as a sort of feature that if one of the replicas isn't keeping up, that we evict it and replace it with a new replica. This is good to counter "limplock" (http://ucare.cs.uchicago.edu/pdf/socc13-limplock.pdf) issues. This would break that (we'd never evict a replica that was 'limping along', but instead it would just get farther and farther behind. We could certainly design around this with various heuristics to estimate whether a replica is getting farther behind or catching up, but it's not trivial.
- there's a bit of hidden complexity here: only the leader currently knows how far behind the other replicas are. So, if there's a leader re-election, the new leader is likely to have already evicted the old logs, in which case we're back to the same issue. So, for this to work, the leader would probably need to be propagating the anchor along with its heartbeats.

*Would a "grace period" of retention after a remote bootstrap be sufficient?*

For example, after remote bootstrapping a new node, should we just hold the remote bootstrap session's log anchor for some number of minutes before releasing it? This has the same issue noted above where only the remote bootstrap "source" node would hold the anchor, and thus wouldn't work if there were any leader election during the catch-up time.

*What if "catch up" is impossible? Do we need to actually slow down the majority to let the new follower catch up?*

In the case that the tablet is handling the "max throughput" of writes, it seems quite plausible that a follower trying to catch up can't process the writes any faster than they were originally processed by the two "good" nodes. You could expect the follower to be slightly faster in most cases because it's receiving the writes in large batches, but in some cases it might also be slower because its caches will be cold compared to the active replicas.

----

It would be interesting to look into how some other systems handle this sort of issue - both consensus based systems and async log-shipping replication systems probably have the same issue where it's possible that a follower is unable to process operations at the same rate as the leader or majority.

> Adding a replica may never succeed if copying tablet takes longer than the log retention time
> ---------------------------------------------------------------------------------------------
>
>                 Key: KUDU-1408
>                 URL: https://issues.apache.org/jira/browse/KUDU-1408
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus, tserver
>    Affects Versions: 0.8.0
>            Reporter: Todd Lipcon
>            Priority: Critical
>
> Currently, while a remote bootstrap session is in progress, we anchor the logs from the time at which it started. However, as soon as the session finishes, we drop the anchor, and delete any logs. In the case where the tablet copy itself takes longer than the log retention period, this means it's likely to have a scenario like:
> - TS A starts downloading from TS B. It plans to download segments 1-4 and adds an anchor.
> - TS B handles writes for 20 minutes, rolling the log many times (e.g. up to log segment 20)
> - TS A finishes downloading, and ends the remote bootstrap session
> - TS B no longer has an anchor, so GCs all logs 1-16.
> - TS A finishes opening the tablet it just copied, but immediately is unable to catch up (because it only has segments 1-4, but the leader only has 17-20)
> - TS B evicts TS A
> This loop will go on basically forever until the write workload stops on TS B.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)