You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@kudu.apache.org by "Jean-Daniel Cryans (JIRA)" <ji...@apache.org> on 2016/03/25 22:06:25 UTC

[jira] [Updated] (KUDU-1278) Tablets that take >5 minutes to copy will never remote bootstrap

     [ https://issues.apache.org/jira/browse/KUDU-1278?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jean-Daniel Cryans updated KUDU-1278:
-------------------------------------
    Target Version/s: 0.7.0  (was: 0.8.0)

> Tablets that take >5 minutes to copy will never remote bootstrap
> ----------------------------------------------------------------
>
>                 Key: KUDU-1278
>                 URL: https://issues.apache.org/jira/browse/KUDU-1278
>             Project: Kudu
>          Issue Type: Bug
>          Components: consensus
>    Affects Versions: 0.6.0
>            Reporter: Todd Lipcon
>            Assignee: Binglin Chang
>            Priority: Blocker
>             Fix For: 0.7.0
>
>
> [~decster] and I debugged this issue on his cluster. One of the servers had been shut down due to bad RAM, so it triggered remote bootstrap of all of its tablets to create new replicas.
> During remote bootstrap, the leader replica continues to try to replicate operations to the new follower, while it's in the process of bootstrapping. This causes it to try to trigger remote bootstrap, which fails with a "Remote bootstrap already in progress" error. The leader considers this to be an unsuccessful communication with the follower. After 5 minutes of receiving this error, it will decide that the follower is dead and evict it, and request another new replica. When the previous replica finishes, it will find out that it's been evicted, and delete everything it just copied. This cycle repeats forever.
> We need to fix the leader so that, as long as the remote bootstrapping replica is making progress, we don't consider it dead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)