You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@kudu.apache.org by "Grant Henke (Jira)" <ji...@apache.org> on 2020/09/28 19:44:00 UTC

[jira] [Resolved] (KUDU-3191) Fail tablet replicas that suffer from KUDU-2233 instead of crashing

     [ https://issues.apache.org/jira/browse/KUDU-3191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Henke resolved KUDU-3191.
-------------------------------
    Fix Version/s: 1.14.0
       Resolution: Fixed

> Fail tablet replicas that suffer from KUDU-2233 instead of crashing
> -------------------------------------------------------------------
>
>                 Key: KUDU-3191
>                 URL: https://issues.apache.org/jira/browse/KUDU-3191
>             Project: Kudu
>          Issue Type: Task
>          Components: compaction
>            Reporter: Andrew Wong
>            Assignee: Andrew Wong
>            Priority: Major
>             Fix For: 1.14.0
>
>
> KUDU-2233 results in persisted corruption that causes a broken invariant, leading to a server crash. The recovery process for this corruption is arduous, especially if there are multiple tablet replicas in a given server that suffer from it -- users typically start the server, see the crash, remove the affected replica manually via tooling, and restart, repeatedly until the server comes up healthily.
> Instead, we should consider treating this as we do CFile block-level corruption[1] and fail the tablet replica. At best, we end up recovering from a non-corrupted replica. At worst, we'd end up with multiple corrupted replicas, which is still better than what we have today, which is multiple corrupted replicas and unavailable servers that lead to excessive re-replication.
> [1] https://github.com/apache/kudu/commit/cf6927cb153f384afb649b664de1d4276bd6d83f



--
This message was sent by Atlassian Jira
(v8.3.4#803005)